On 24/09/2009, Ken Krugler <[email protected]> wrote: > Hi Tobi, > > First, I'd suggest getting and reading through the sources of existing > Java-based web crawlers. They all use HttpClient, and thus would provide > much useful example code: > > Nutch (Apache) > Droids (Apache) > Heritrix (Archive) > Bixo (http://bixo.101tec.com) > > Some comments below: > > On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote: > > > > Hi Guys, > > > > I am working on a parallel webcrawler implementation in Java. I could use > some help with some design question and a bug that takes my sleep ;-) > > > > First thing, this is my design: I have a list, which stores URL's that > have been crawled already. Furhter I have a Queue which is responsible to > provide the crawler with the next URL to fetch. Then I have a > ThreadController which spawns new crawler-threads until a maximum number is > reached. Finally there are crawler-threads that process a URL given by the > queue. They work until the queue size is zero and then the system stops. > > > > Following is my question: I am using (basically) the following statements. > As I am new to httpclient this could probably a dump approach, and I am > happy for feedback. > > > > <snip from WebCrawlerThread> > > DefaultHttpClient client; > > HttpGet get; > > > > public run() { > > client = new DefaultHttpClient(); > > HttpResponse response = client.execute(get); > > HttpEntity entity = response.getEntity(); > > String mimetype = > entity.getContentType().getValue(); > > String rawPage = EntityUtils.toString(entity); > > client.getConnectionManager().shutdown(); > > > > (...) doing crawler things > > } > > </snap> > > > > First thing: Is the thread the right place to host the client object, or > should it be shared? > > > > You should use the ThreadSafeClientConnManager, and reuse the same > DefaultHttpClient instance for all threads. > > See the init() method of Bixo's SimpleHttpFetcher class for an example of > setting this up. > > > > Second: Would it enhance performance if I reuse the connection somehow? > > > > Yes, via keep-alive. Though you then have to be a bit more careful about > handling stale connections (ones that the server has shut down). > > Again, take a look at the Bixo SimpleHttpFetcher class for some code that > tries (at least) to do this properly. > > > > And most important the bug: With increasing number of pages I receive > zillions of > > > > "java.net.BindException: Address already in use: connect" > >
I've seen this error generated when a WinXP host runs out of sockets. i.e. the message is misleading in this case. > No idea, sorry. > > But I think that by default HttpClient limits the number of parallel > request to one host to be two. Not sure if that would be a factor in your > case, given how you're creating a new client for each request. > > -- Ken > > > > -------------------------- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-210-6378 > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
