On 7/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> I didn't find the answer to my question yet but I made some progress:
> Using a profiler, I saw that a lot of time is spent looking for URL in the 
> text document (using regular expression). This is something Lucene doesn't do.
> However, having recompiled the text parser without this scanning, it is still 
> as slow.
> Now it seems that a lot of time is spent in hadoop framework (compared to say 
> the indexing by Lucene and the loading of documents from the file system).
>
> Would that mean that the overhead of the Hadoop framework is killing the 
> performance on a single box ?

It is hard to say something without knowing what you are comparing.

1) How are you indexing pages with lucene (which analyzers, etc?)

2) Is 4H10M spent only in indexing job or is it the total duration of
the entire crawl?

3) How big is your crawldb, linkdb, etc.? Indexer reads a lot of
different structures to combine data, perhaps I/O takes a lot of
time...

>
>
> -------- Message d'origine--------
> De: Brette, Marc
> Date: lun. 23/07/2007 12:08
> À: [EMAIL PROTECTED]
> Objet : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene 
> ?)
>
> Hi all,
> I performed a little test where I index the same set of documents with Nutch 
> (0.9) and Lucene.
> This is a set of documents from TREC, 134 000+ short text documents.
>
> With Lucene, it took 1H. With Nutch using the file:/ protocol, it took 4H10.
>
> Could anyone explain why there is such a difference and is there some way to 
> eliminate part of this overhead ?
>
> Regards,
> --
> Marc
>
>
>
>
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to