Generally best practices for crawlers is that no process runs more than an
hour or five.  All crawler processes update
a central state store with their progress, but they exit when they reach a
time limit knowing that somebody else will
take up the work where they leave off.  This avoids a multitude of ills.

On Tue, Sep 21, 2010 at 11:53 AM, Tim Robertson
<timrobertson...@gmail.com>wrote:

> > On the topic of your application, why you are using processes instead of
> > threads?  With threads, you can get your memory overhead down to 10's of
> > kilobytes as opposed to 10's of megabytes.
>
> I am just prototyping scaling out many processes and potentially
> across multiple machines.  Our live crawler runs in a single JVM, but
> some of these crawlers take 4-6 weeks, so long running processes block
> others, so I was looking at alternatives - our live crawler also uses
> DOM based XML parsing so hitting memory limits - SAX would address
> this.  Also we want to be able to deploy patches to the crawlers
> without interrupting those long running jobs if possible.

Reply via email to