Enzo Michelangeli wrote:
> Is there a way of parallelizing URLFiltering over multiple threads? 
> After all, the URLFilters themselves must already be thread-safe, or 
> else they would have problems during fetching.
> 
> The reason why I'm asking is I have a custom URLFilter that needs to 
> make calls to the DNS resolver, and multi-threading the URLFiltering 
> would greatly speed up some filtering procedures that, unlike fetching, 
> appear to be single-threaded: "mergedb -filter", inject, generate, 
> "updatedb -filter" etc. (The most important is of course "generate" or, 
> even better, "updatedb -filter" to prevent undesired URL's to reach the 
> crawldb in first place).

Are you running jobs in the "local" mode? In distributed mode filtering 
is naturally parallel, because you have as many concurrent lookups as 
there are map tasks.

In my experience, using multiple threads for DNS lookup doesn't help 
that much. What helps A LOT (like several orders of magnitude) is using 
a local DNS cache, or even two-level DNS cache (one cache per node, one 
cache per cluster).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to