Re: Parallel Webcrawler Implementation

sebb Thu, 24 Sep 2009 12:48:14 -0700

On 24/09/2009, Ken Krugler <[email protected]> wrote:
> Hi Tobi,
>
>  First, I'd suggest getting and reading through the sources of existing
> Java-based web crawlers. They all use HttpClient, and thus would provide
> much useful example code:
>
>  Nutch (Apache)
>  Droids (Apache)
>  Heritrix (Archive)
>  Bixo (http://bixo.101tec.com)
>
>  Some comments below:
>
>  On Sep 23, 2009, at 9:10pm, Tobias N. Sasse wrote:
>
>
> > Hi Guys,
> >
> > I am working on a parallel webcrawler implementation in Java. I could use
> some help with some design question and a bug that takes my sleep ;-)
> >
> > First thing, this is my design: I have a list, which stores URL's that
> have been crawled already. Furhter I have a Queue which is responsible to
> provide the crawler with the next URL to fetch. Then I have a
> ThreadController which spawns new crawler-threads until a maximum number is
> reached. Finally there are crawler-threads that process a URL given by the
> queue. They work until the queue size is zero and then the system stops.
> >
> > Following is my question: I am using (basically) the following statements.
> As I am new to httpclient this could probably a dump approach, and I am
> happy for feedback.
> >
> > <snip from WebCrawlerThread>
> > DefaultHttpClient client;
> > HttpGet get;
> >
> >  public run() {
> >      client = new DefaultHttpClient();
> >      HttpResponse response = client.execute(get);
> >      HttpEntity entity = response.getEntity();
> >      String mimetype =
> entity.getContentType().getValue();
> >      String rawPage = EntityUtils.toString(entity);
> >      client.getConnectionManager().shutdown();
> >
> >     (...) doing crawler things
> >  }
> > </snap>
> >
> > First thing: Is the thread the right place to host the client object, or
> should it be shared?
> >
>
>  You should use the ThreadSafeClientConnManager, and reuse the same
> DefaultHttpClient instance for all threads.
>
>  See the init() method of Bixo's SimpleHttpFetcher class for an example of
> setting this up.
>
>
> > Second: Would it enhance performance if I reuse the connection somehow?
> >
>
>  Yes, via keep-alive. Though you then have to be a bit more careful about
> handling stale connections (ones that the server has shut down).
>
>  Again, take a look at the Bixo SimpleHttpFetcher class for some code that
> tries (at least) to do this properly.
>
>
> > And most important the bug: With increasing number of pages I receive
> zillions of
> >
> > "java.net.BindException: Address already in use: connect"
> >


I've seen this error generated when a WinXP host runs out of sockets.
i.e. the message is misleading in this case.

>  No idea, sorry.
>
>  But I think that by default HttpClient limits the number of parallel
> request to one host to be two. Not sure if that would be a factor in your
> case, given how you're creating a new client for each request.
>
>  -- Ken
>
>
>
>  --------------------------
>  Ken Krugler
>  TransPac Software, Inc.
>  <http://www.transpac.com>
>  +1 530-210-6378
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Parallel Webcrawler Implementation

Reply via email to