Same question for me ... I lost a 48h crawl recently, so I'd like to
understand how to do it in more incremental way.

2009/5/19 Larsson85 <[email protected]>

>
> I'm about to do a pretty big crawl, and when I generate my segments it says
> "jobtracker is 'local' , generating exactly one partition.
>
> My problem is that I cant put all my monney on that the crawler wont crash
> any time during the period I'm about to crawl. And from what I can
> understand a crash while crawling a segment will mean that I have to reado
> the whole segment. (is this wrong?)
>
> So my idea was to create many segments, and do a batch file which starts
> the
> segments right after eachother, and if it crashes thats no problem, lets
> just continue and redo the segments that didnt got crawled.
>
> From reading on the net I've realized that I cant do -numFetchers on local
> jobs, and that I have to set it to -ndfs. But I just cant seem to get this
> to work.
> bin/nutch generate -ndfs <nameserver:port> crawl/crawldb crawl/segments
> -numFetchers 10 is basicly what I would like to do, but I have no clue what
> the <nameserver:port> is. The more I read about ndfs, the more I start to
> doubt if that's really what I want to do.
>
> Is there perhaps a way to split segments after it's generated? Just like
> there's a way to merge them with mergesegs?
>
> Why is this so hard, have I missed something? I cant be the first who want
> to do a fail-safe crawl and dont want to loose all work if the connection
> or
> computer crashes.
> --
> View this message in context:
> http://www.nabble.com/How-to-get-more-than-1-segments-tp23606579p23606579.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Reply via email to