-----Original message-----
> From:kemical <mickael.lume...@gmail.com>
> Sent: Fri 08-Feb-2013 10:53
> To: user@nutch.apache.org
> Subject: Best Practice to optimize Parse reduce step / ParseoutputFormat
> 
> Hi,
> 
> I've been looking for some time now the reasons of Parse reduce taking a lot
> of time. And i've found lots of different suggestions but no many feedbacks
> on which are working or not.
> 
> 
> First here is a list of the thread i've found, and also the Patch 1314 :
> 
> http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html
> http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html
> http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html
> https://issues.apache.org/jira/browse/NUTCH-1314
> 
> Here are some questions about what i've found on them:
> 
> - It's seems that parse reduce time is mainly due to long urls
> => Is there anyone who can confirm since he has excluded long urls (with
> patch or regex or whatever, he now have better perfs?)

Most certainly!

> 
> - Normalizing step is occuring before filtering:
> => If so, is there a real interest to filter urls with regex (like the
> -^.{350,}$ expression) ?

The sooner you can reject long URL's, the better.

> 
> -The patch 1314 seems to be done when you parse with parse-html
> => i'm using boilerpipe with patch NUTCH-961, should the patch 1314 work
> with it? (i guess not) and what change should i make (i'm quite afraid to do
> a patch/plugin myself) . 

It will help a little but i don't think you'll win much vs. filtering by regex 
filter.

> 
> This is not an exhaustive list of questions, so if you have questions and/or
> recommandations, please add them.
> 
> 
> 
> Sorry to start a new thread since it could have been added as an answer to
> my last one:
> http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-td4037673.html
> but i think the title of this one could be useful for more people (mine was
> too specific)
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Best-Practice-to-optimize-Parse-reduce-step-ParseoutputFormat-tp4039200.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to