Dear All, We need to crawl a number of sites which will amount to around 2 million pages and several GB of data. I know this is well within the ability of Nutch, but I a question about what is the right strategy to pursue to be able to crawl index with the least amount of bother.
Obviously I would need a single database create several manageable sized segments and run the fetcher on each one, and subsequently update the database, no problem so far. But what would then be the right strategy to get an index which could be searched? Should I : 1) merge all the segments and then index them, or 2) Should I index each segment individually and then merge the indexes, keeping the segments separate. Or 3) Should I index each segment separately, and keep both segments and indexes separate, and search across multiple indexes (but I have heard there are issues with the ranking) Please let me know your views!! Thanks a lot!! Regards ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
