Keith,
just copy your all your segment folders in one folder.
The create a new empty webdb. As next you can simply update your new webdb with all your existing segments.
The result is a merge of your 2 crawl directories.
BTW. there was some discussion how to do location limited crawling in the developer mailing list.
HTH Stefan
Am 03.03.2005 um 06:18 schrieb Keith Campbell:
Hi,
I'm a new user of Nutch running a beta Boston-based search engine here: http://www.beantownsearch.com/. I've been evaluating Nutch and testing various "crawl strategies".
Because of current bandwidth/hardware limitations, I've had to devise a methodical and selective "crawl strategy" to index the most relevant (and comprehensive) Boston-specific websites. (So far so good. My results already appear to be on par with Google's results). In order to continue this strategy (and to see if it is possible to offer a better local search than Google's) I need to learn how to combine multiple crawl directories (each many Gigabytes).
crawl_directory1 - db - segments crawl_directory2 - db - segments etc ...
The first directory was seeded with a very large number of Boston-specific URL's which were then used for "intranet crawling" followed by db analysis and numerous rounds of "whole web crawling". I've created additional crawl directories by "intranet crawling" other large collections of distinct Boston URL's. I've now reached the stage where I need to be able to combine these crawl directories into one crawl directory which can then be analyzed and used for continued "whole web crawling." This step is important because it's the only method I could think of to keep my "whole web crawling" on target (i.e. in the Boston area).
So my question is: How do I combine multiple crawl directories into one directory which can be used for additional "whole web crawling"?
Thanks in advance for any help, Keith Campbell
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general
-----------information technology------------------- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
