I notice filtering urls is done in the output format until
parsing. Wouldn't it be better to filter it until updating crawlDb?
"Until" == "during" ?
Sorry, yes during!
As you observed, doing it at this stage saves space in segment
data, and in consequence saves on processing time (no CPU/IO needed
to process useless data, throw away junk as soon as possible).
Make sense, thanks for the hint. I guess now with a published db
filter tool for nutch .7 and .8 people will be able to clean up web-
and crawl databases.
Stefan
- Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/or... Stefan Groschupf
-