You need to do both: seed the WebDB with the 14k urls extracted from the dmoz content file AND filter newly found urls against the urls in the mysql database using the urlfilter-db.
This is significantly faster than adding the 14k urls to the regex-urlfilter.txt file and checking against that. If you like to boost performance any further, consider pre-initializing the cache in the urlfilter-db upon load of the plugin and remove the code that goes to the database every time a url was not found in the cache. I learned this improves performance even more. HTH Thomas D. On 3/1/06, Brent Parker <[EMAIL PROTECTED]> wrote: > > I have approx. 14k urls (in a mysql db) that I extracted from the dmoz > content file. I intended to dump it to a txt file in order to seed the > webDb > using: bin/nutch inject db/ -urlfile urls.txt > > I was wondering... If instead, I did a whole web crawl using the full dmoz > content file, but filtered it using the urlfilter-db plugin, using my 14k > urls in mysql.... would I obtain similar results? > > I am a bit unsure as to what is going on under the hood, so I am looking > for > the best approach. If they do in fact give similar results, is one more > efficient, etc.? > > Thanks for any/all advice! > Brent Parker > >
