I'd like to 1) inject URLs from a database 2) add a RegexFilter for each URL such that only pages under each URL's TLD is indexed
For the first, looking at the code, I suppose a way is to subclass/customize WebDBInjector and add a method to read URLs from the DB and call addFile() on each URL. So that's ok. Is there a better way? I wish WebDBInjector could be refactored into something a little more extensible in terms of specifying different datasources, like DmozURLSource and FileURLSource. For the second, using RegexURLFilter to index a million URLs at once quickly becomes untenable since all filters are stored in-memory and every filter has to be matched for every URL. An idea is to index the URLs one at a time, adding a TLD regex rule for the currently indexed URL, and deleting the rule before the next URL starts. So basically modifying the set of rules whilst indexing. Any ideas on a smarter way to do this? Thanks, k ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
