-----Original message----- > From:kemical <mickael.lume...@gmail.com> > Sent: Fri 08-Feb-2013 10:53 > To: user@nutch.apache.org > Subject: Best Practice to optimize Parse reduce step / ParseoutputFormat > > Hi, > > I've been looking for some time now the reasons of Parse reduce taking a lot > of time. And i've found lots of different suggestions but no many feedbacks > on which are working or not. > > > First here is a list of the thread i've found, and also the Patch 1314 : > > http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html > http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html > http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html > https://issues.apache.org/jira/browse/NUTCH-1314 > > Here are some questions about what i've found on them: > > - It's seems that parse reduce time is mainly due to long urls > => Is there anyone who can confirm since he has excluded long urls (with > patch or regex or whatever, he now have better perfs?)
Most certainly! > > - Normalizing step is occuring before filtering: > => If so, is there a real interest to filter urls with regex (like the > -^.{350,}$ expression) ? The sooner you can reject long URL's, the better. > > -The patch 1314 seems to be done when you parse with parse-html > => i'm using boilerpipe with patch NUTCH-961, should the patch 1314 work > with it? (i guess not) and what change should i make (i'm quite afraid to do > a patch/plugin myself) . It will help a little but i don't think you'll win much vs. filtering by regex filter. > > This is not an exhaustive list of questions, so if you have questions and/or > recommandations, please add them. > > > > Sorry to start a new thread since it could have been added as an answer to > my last one: > http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-td4037673.html > but i think the title of this one could be useful for more people (mine was > too specific) > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Best-Practice-to-optimize-Parse-reduce-step-ParseoutputFormat-tp4039200.html > Sent from the Nutch - User mailing list archive at Nabble.com. >