Hi Brian,
On Thursday, July 18, 2013, brian4 <[email protected]> wrote:
> On one machine, nutch just suddenly started freezing during the generator
> job.
Are these continuous crawls? What values d
you have set for generate.max.count? I ask as calls must be made to the
backed the determine a limit for URLs to generate into batches... I suppose
if you're running with a -1 value for this figure the call could be
expensive as well.
>
> I can also run the same crawl (using all of the same programs and files)
> from another machine and it runs fine.  Although it is one machine for
now,
> I am worried that it might randomly happen on other machines at some point
> as well, so I can't rely on it for regular crawling.

Mmm. So maybe you are not doing continuous large scale crawls as I thought
above?

> Looking at the dumps, it looks like it may be due to / related to a
deadlock
> caused by a zookeeper/hbase issue listed at the following link, but maybe
it
> can be avoided in the nutch generator itself.
>
> https://issues.apache.org/jira/browse/HBASE-2966

Yep

>
>
> However even if that is the cause we would have to wait for gora to be
> updated to use the fixed hbase once it's fixed and then for nutch to be
> updated to use the updated gora, so I am hoping maybe someone has an idea
of
> a workaround I could use now.

I've not heard anyone coming here with a similar problem! I am confused on
this one.

>
> Otherwise I am thinking of trying to switch to another data store.  Which
> data store is most reliable and does not have such deadlock issues?

If this is a problem with a zookeeper server then it may not be linked to
Gora. There is not one line of zookeeper code within Gora. I wiould check
your hbase/zk installation before you think about ditching everything and
jumping ship.

It
> seems like maybe a lot of people use Cassandra, but I had the impression
> there were more issues getting it to work correctly than with HBase.

Every1 to their own I suppose here. There are a number of *stable* backends
which can be used. If getting things working  easily is your primary
criteria then I would't say there is much between the available options.
hth

-- 
*Lewis*

Reply via email to