RE: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Markus Jelsma
Oh, i'd like to add that the biggest problem is memory and the possibility for a parser to hang, consume resources and time out everything else and destroying the segment. -Original message- > From:Weilei Zhang > Sent: Sat 09-Feb-2013 23:40 > To: user@nutch.apache.org > Subject: Re:

RE: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Markus Jelsma
A parsing fetcher does everything in the mapper. Please check the output() method around line 1012 onwards: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup Parsing, signature, outlink processing (using code in ParseOutputFormat) all happens th

Re: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Weilei Zhang
This is indeed helpful. Thanks Lewis. On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney wrote: > I've eventually added this to our FAQ's > > http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F > > This should explain for you. > Lewis > > On Wed, Feb 6, 2013 at 6:31 PM,

Re: How to get page content of crawled pages

2013-02-09 Thread Lewis John Mcgibbney
Hi, Once I get access to my office I am going to build the patches from trunk. Is it trunk that you are using? Thanks Lewis On Fri, Feb 8, 2013 at 9:00 PM, peterbarretto wrote: > Hi Lewis, > > I managed to get the code working by adding the below function to > MongodbWriter.java in the public cla