Briggs wrote: >> Are you running this in a distributed setup, or in "local" mode? Local >> mode is not designed to cope with such large datasets, so it's likely >> that you will be getting OOM errors during sorting ... I can only >> recommend that you use a distributed setup with several machines, and >> adjust RAM consumption with the number of reduce tasks. > > Currently we are running in local mode. We do not have the setup for > distributing. That is why I want to merge these segments. Would that > not help? Insteand of having potentially tens of thousands of > segments, I want to create several large segments and index those.
Yes, it makes perfect sense, but you are probably hitting the limits of a single machine. I suggest that you should do the merging in several steps: by trial and error find the maximum number of segments that don't explode SegmentMerger, and do the first pass merging these small segments into larger ones; then in the second pass merge these larger ones in the really large ones. > > Sorry for my ignorance, but not really sure how to scale nutch > correctly. Do you know of a document, or some pointers as to how > segment/index data should be stored? Most of this information is already available on the Nutch Wiki. All I can say is that there is certainly a limit to what you can do using the "local" mode - if you need to handle large numbers of pages you will need to migrate to the distributed setup. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
