I'm about to do a pretty big crawl, and when I generate my segments it says "jobtracker is 'local' , generating exactly one partition.
My problem is that I cant put all my monney on that the crawler wont crash any time during the period I'm about to crawl. And from what I can understand a crash while crawling a segment will mean that I have to reado the whole segment. (is this wrong?) So my idea was to create many segments, and do a batch file which starts the segments right after eachother, and if it crashes thats no problem, lets just continue and redo the segments that didnt got crawled. >From reading on the net I've realized that I cant do -numFetchers on local jobs, and that I have to set it to -ndfs. But I just cant seem to get this to work. bin/nutch generate -ndfs <nameserver:port> crawl/crawldb crawl/segments -numFetchers 10 is basicly what I would like to do, but I have no clue what the <nameserver:port> is. The more I read about ndfs, the more I start to doubt if that's really what I want to do. Is there perhaps a way to split segments after it's generated? Just like there's a way to merge them with mergesegs? Why is this so hard, have I missed something? I cant be the first who want to do a fail-safe crawl and dont want to loose all work if the connection or computer crashes. -- View this message in context: http://www.nabble.com/How-to-get-more-than-1-segments-tp23606579p23606579.html Sent from the Nutch - User mailing list archive at Nabble.com.
