Re: Nutch single instance

Tomasz Thu, 25 Feb 2016 02:11:10 -0800

Great, I remove crawl_generate and it helps a bit to save space. I run
nutch commands with -D mapreduce.map.output.compress=true but don't see any
significant space drop. Is this enough to enable compression? Thanks.


2016-02-24 21:39 GMT+01:00 Markus Jelsma <[email protected]>:

> Oh, i forgot the following; enable Hadoop's snappy compression on in- and
> output files. It reduced our storage requirements to 10% of the original
> file size. Apparently Nutch' data structures are easily compressed. It also
> greatly reduces I/O, thus speeding up all load times. CPU usage is
> negligible compared to I/O wait.
>
> Markus
>
> -----Original message-----
> > From:Tomasz <[email protected]>
> > Sent: Wednesday 24th February 2016 15:46
> > To: [email protected]
> > Subject: Re: Nutch single instance
> >
> > Markus, thanks for sharing. Changing a bit the topic. A few messages
> > earlier I asked about storing only links between pages without a content.
> > With your great help I run Nutch with fetcher.store.content = false and
> > fetcher.parse = true and omit a parse step in generate/fetch/update
> cycle.
> > What more I remove parse_text from segments directory after each cycle to
> > save space, but space used by segments is growing rapidly and I wonder
> if I
> > really need all the data. Let me summarise my case - I crawl only to get
> > connections between pages (inverted links with anchors) and I don't need
> > the content. I run generate/fetch/update cycle continuously (I've set up
> > time limit for fetcher to run max 90 min). Is there a way I can save more
> > storage space? Thanks.
> >
> > Tomasz
> >
> > 2016-02-24 12:09 GMT+01:00 Markus Jelsma <[email protected]>:
> >
> > > Hi - see inline.
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Tomasz <[email protected]>
> > > > Sent: Wednesday 24th February 2016 11:54
> > > > To: [email protected]
> > > > Subject: Nutch single instance
> > > >
> > > > Hello,
> > > >
> > > > After a few days testing Nutch with Amazon EMR (1 master and 2
> slaves) I
> > > > had to give up. It was extremely slow (avg. fetching speed at 8
> urls/sec
> > > > counting those 2 slaves) and along with map-reduce overhead the whole
> > > > solution hasn't satisfied me at all. I moved Nutch crawl databases
> and
> > > > segments to single EC2 instance and it works pretty fast now
> reaching 35
> > > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed
> to
> > > > work with Hadoop environment and regret it didn't work in my case.
> > >
> > > Setting up Nutch the correct way is a delicate matter and quite some
> trial
> > > and error. But in general, more machines are faster. But in some
> cases, one
> > > fast beast can easily outperform a few less powerful machines.
> > >
> > > >
> > > > Anyway I would like to know if I'm alone with the approach and
> everybody
> > > > set up Nutch with Hadoop. If no and some of you runs Nutch in a
> single
> > > > instance maybe you can share with some best practices e.g. do you use
> > > crawl
> > > > script or generate/fetch/update continuously perhaps using some cron
> > > jobs?
> > >
> > > Well, in both cases you need some script(s) to run the jobs. We have a
> lot
> > > of complicated scripts that get stuff from everywhere. We have
> integrated
> > > Nutch in our Sitesearch platform so it has to be coupled to a lot of
> > > different systems. We still rely on bash scripts but probably Python is
> > > easier if scripts are complicated. Ideally, in a distributed
> environment,
> > > you use Apache Oozie to run the crawls.
> > >
> > > >
> > > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats -
> > > what
> > > > exactly does it mean?
> > >
> > > These are transient errors, e.g. connection time outs, connection
> resets
> > > but also 5xx errors that are usually transient. They are eligble for
> > > recrawl 24 hours later. By default, after retry 3, the records goes
> from
> > > db_unfetched to db_gone.
> > >
> > > >
> > > > Regards,
> > > > Tomasz
> > > >
> > > > Here are my current crawldb stats:
> > > > TOTAL urls:     16347942
> > > > retry 0:        16012503
> > > > retry 1:        134346
> > > > retry 2:        106037
> > > > retry 3:        95056
> > > > min score:      0.0
> > > > avg score:      0.04090025
> > > > max score:      331.052
> > > > status 1 (db_unfetched):        14045806
> > > > status 2 (db_fetched):  1769382
> > > > status 3 (db_gone):     160768
> > > > status 4 (db_redir_temp):       68104
> > > > status 5 (db_redir_perm):       151944
> > > > status 6 (db_notmodified):      151938
> > > >
> > >
> >
>

Re: Nutch single instance

Reply via email to