[Nutch-general] Re: urlfilter-db plugin usage...

TDLN Tue, 28 Feb 2006 23:39:01 -0800

You need to do both: seed the WebDB with the 14k urls extracted from the
dmoz
content file AND filter newly found urls against the urls in the mysql
database using the urlfilter-db.

This is significantly faster than adding the 14k urls to the
regex-urlfilter.txt file and checking against that.

If you like to boost performance any further, consider pre-initializing the
cache in the urlfilter-db upon load of the plugin and remove the code that
goes to the database every time a url was not found in the cache. I learned
this improves performance even more.

HTH Thomas D.

On 3/1/06, Brent Parker <[EMAIL PROTECTED]> wrote:
>
> I have approx. 14k urls (in a mysql db) that I extracted from the dmoz
> content file. I intended to dump it to a txt file in order to seed the
> webDb
> using:  bin/nutch inject db/ -urlfile urls.txt
>
> I was wondering... If instead, I did a whole web crawl using the full dmoz
> content file, but filtered it using the urlfilter-db plugin, using my 14k
> urls in mysql.... would I obtain similar results?
>
> I am a bit unsure as to what is going on under the hood, so I am looking
> for
> the best approach.  If they do in fact give similar results, is one more
> efficient, etc.?
>
> Thanks for any/all advice!
> Brent Parker
>
>

[Nutch-general] Re: urlfilter-db plugin usage...

Reply via email to