RE: Nutch single instance

2016-03-01 Thread Markus Jelsma
Yes, GZip wil certainly help a lot until you get compression sorted out. GZip 
is not splittable, so you have to decompress before loading it again.  
 
-Original message-
> From:Tomasz 
> Sent: Tuesday 1st March 2016 18:11
> To: user@nutch.apache.org
> Subject: Re: Nutch single instance
> 
> Since I didn't manage to enable the compression I worked out another
> solution to save space or at least save some time before I'll get it work.
> After each generate/fetch/update/invertlinks cycle I gzip the most recent
> segment directory since I don't need it for 30 days (next fetch time). Not
> giving up to set up Nutch in pseudo-distributed mode and get the benefits
> especially the compression.
> 
> 2016-02-26 13:07 GMT+01:00 Markus Jelsma :
> 
> > I am not sure it will work on a single node / local instance. But it would
> > be a good idea to run stuff on Yarn and HDFS anyway, even in local mode. It
> > has some benefits, and perhaps even compression that works.
> > Markus
> >
> > -Original message-
> > > From:Tomasz 
> > > Sent: Thursday 25th February 2016 22:25
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch single instance
> > >
> > > Thanks for the hint but still it doesn't work. I ran the commands with
> > the
> > > following arguments:
> > >
> > > -D mapreduce.map.output.compress=true -D
> > > mapreduce.output.fileoutputformat.compress=false
> > > and
> > > -D mapreduce.map.output.compress=true -D
> > > mapreduce.output.fileoutputformat.compress=true
> > >
> > > Used space didn't change regardless of true/false value for the 2nd
> > > parameter and it consumes about 1-1.5GB for each generate/fetch/update
> > > cycle which means to me I will run out of disk space in a few days. I'm
> > not
> > > even sure if the compression is available on the machine but didn't
> > notice
> > > any errors/warning on the other hand. I don't use slaves, it's a single
> > > node instance and maybe mapreduce arguments doesn't work with such a
> > > environment? Markus, what to do?
> > >
> > > Tomasz
> > >
> > > 2016-02-25 15:09 GMT+01:00 Markus Jelsma :
> > >
> > > > Hi - no, not just that. My colleague tells me you also need
> > > > mapreduce.output.fileoutputformat.compress.
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:Tomasz 
> > > > > Sent: Thursday 25th February 2016 11:10
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Nutch single instance
> > > > >
> > > > > Great, I remove crawl_generate and it helps a bit to save space. I
> > run
> > > > > nutch commands with -D mapreduce.map.output.compress=true but don't
> > see
> > > > any
> > > > > significant space drop. Is this enough to enable compression? Thanks.
> > > > >
> > > > > 2016-02-24 21:39 GMT+01:00 Markus Jelsma  > >:
> > > > >
> > > > > > Oh, i forgot the following; enable Hadoop's snappy compression on
> > in-
> > > > and
> > > > > > output files. It reduced our storage requirements to 10% of the
> > > > original
> > > > > > file size. Apparently Nutch' data structures are easily
> > compressed. It
> > > > also
> > > > > > greatly reduces I/O, thus speeding up all load times. CPU usage is
> > > > > > negligible compared to I/O wait.
> > > > > >
> > > > > > Markus
> > > > > >
> > > > > > -Original message-
> > > > > > > From:Tomasz 
> > > > > > > Sent: Wednesday 24th February 2016 15:46
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: Nutch single instance
> > > > > > >
> > > > > > > Markus, thanks for sharing. Changing a bit the topic. A few
> > messages
> > > > > > > earlier I asked about storing only links between pages without a
> > > > content.
> > > > > > > With your great help I run Nutch with fetcher.store.content =
> > false
> > > > and
> > > > > > > fetcher.parse = true and omit a parse step in
> > generate/fetch/update
> > > > > > cycle.
> > > > > >

Re: Nutch single instance

2016-03-01 Thread Tomasz
Since I didn't manage to enable the compression I worked out another
solution to save space or at least save some time before I'll get it work.
After each generate/fetch/update/invertlinks cycle I gzip the most recent
segment directory since I don't need it for 30 days (next fetch time). Not
giving up to set up Nutch in pseudo-distributed mode and get the benefits
especially the compression.

2016-02-26 13:07 GMT+01:00 Markus Jelsma :

> I am not sure it will work on a single node / local instance. But it would
> be a good idea to run stuff on Yarn and HDFS anyway, even in local mode. It
> has some benefits, and perhaps even compression that works.
> Markus
>
> -Original message-
> > From:Tomasz 
> > Sent: Thursday 25th February 2016 22:25
> > To: user@nutch.apache.org
> > Subject: Re: Nutch single instance
> >
> > Thanks for the hint but still it doesn't work. I ran the commands with
> the
> > following arguments:
> >
> > -D mapreduce.map.output.compress=true -D
> > mapreduce.output.fileoutputformat.compress=false
> > and
> > -D mapreduce.map.output.compress=true -D
> > mapreduce.output.fileoutputformat.compress=true
> >
> > Used space didn't change regardless of true/false value for the 2nd
> > parameter and it consumes about 1-1.5GB for each generate/fetch/update
> > cycle which means to me I will run out of disk space in a few days. I'm
> not
> > even sure if the compression is available on the machine but didn't
> notice
> > any errors/warning on the other hand. I don't use slaves, it's a single
> > node instance and maybe mapreduce arguments doesn't work with such a
> > environment? Markus, what to do?
> >
> > Tomasz
> >
> > 2016-02-25 15:09 GMT+01:00 Markus Jelsma :
> >
> > > Hi - no, not just that. My colleague tells me you also need
> > > mapreduce.output.fileoutputformat.compress.
> > > Markus
> > >
> > > -Original message-
> > > > From:Tomasz 
> > > > Sent: Thursday 25th February 2016 11:10
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Nutch single instance
> > > >
> > > > Great, I remove crawl_generate and it helps a bit to save space. I
> run
> > > > nutch commands with -D mapreduce.map.output.compress=true but don't
> see
> > > any
> > > > significant space drop. Is this enough to enable compression? Thanks.
> > > >
> > > > 2016-02-24 21:39 GMT+01:00 Markus Jelsma  >:
> > > >
> > > > > Oh, i forgot the following; enable Hadoop's snappy compression on
> in-
> > > and
> > > > > output files. It reduced our storage requirements to 10% of the
> > > original
> > > > > file size. Apparently Nutch' data structures are easily
> compressed. It
> > > also
> > > > > greatly reduces I/O, thus speeding up all load times. CPU usage is
> > > > > negligible compared to I/O wait.
> > > > >
> > > > > Markus
> > > > >
> > > > > -Original message-
> > > > > > From:Tomasz 
> > > > > > Sent: Wednesday 24th February 2016 15:46
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: Nutch single instance
> > > > > >
> > > > > > Markus, thanks for sharing. Changing a bit the topic. A few
> messages
> > > > > > earlier I asked about storing only links between pages without a
> > > content.
> > > > > > With your great help I run Nutch with fetcher.store.content =
> false
> > > and
> > > > > > fetcher.parse = true and omit a parse step in
> generate/fetch/update
> > > > > cycle.
> > > > > > What more I remove parse_text from segments directory after each
> > > cycle to
> > > > > > save space, but space used by segments is growing rapidly and I
> > > wonder
> > > > > if I
> > > > > > really need all the data. Let me summarise my case - I crawl
> only to
> > > get
> > > > > > connections between pages (inverted links with anchors) and I
> don't
> > > need
> > > > > > the content. I run generate/fetch/update cycle continuously (I've
> > > set up
> > > > > > time limit for fetcher to run max 90 min). Is there a way I can
> save
> > > more
> > > > > > storage space? Thanks.

RE: Nutch single instance

2016-02-26 Thread Markus Jelsma
I am not sure it will work on a single node / local instance. But it would be a 
good idea to run stuff on Yarn and HDFS anyway, even in local mode. It has some 
benefits, and perhaps even compression that works.
Markus
 
-Original message-
> From:Tomasz 
> Sent: Thursday 25th February 2016 22:25
> To: user@nutch.apache.org
> Subject: Re: Nutch single instance
> 
> Thanks for the hint but still it doesn't work. I ran the commands with the
> following arguments:
> 
> -D mapreduce.map.output.compress=true -D
> mapreduce.output.fileoutputformat.compress=false
> and
> -D mapreduce.map.output.compress=true -D
> mapreduce.output.fileoutputformat.compress=true
> 
> Used space didn't change regardless of true/false value for the 2nd
> parameter and it consumes about 1-1.5GB for each generate/fetch/update
> cycle which means to me I will run out of disk space in a few days. I'm not
> even sure if the compression is available on the machine but didn't notice
> any errors/warning on the other hand. I don't use slaves, it's a single
> node instance and maybe mapreduce arguments doesn't work with such a
> environment? Markus, what to do?
> 
> Tomasz
> 
> 2016-02-25 15:09 GMT+01:00 Markus Jelsma :
> 
> > Hi - no, not just that. My colleague tells me you also need
> > mapreduce.output.fileoutputformat.compress.
> > Markus
> >
> > -Original message-
> > > From:Tomasz 
> > > Sent: Thursday 25th February 2016 11:10
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch single instance
> > >
> > > Great, I remove crawl_generate and it helps a bit to save space. I run
> > > nutch commands with -D mapreduce.map.output.compress=true but don't see
> > any
> > > significant space drop. Is this enough to enable compression? Thanks.
> > >
> > > 2016-02-24 21:39 GMT+01:00 Markus Jelsma :
> > >
> > > > Oh, i forgot the following; enable Hadoop's snappy compression on in-
> > and
> > > > output files. It reduced our storage requirements to 10% of the
> > original
> > > > file size. Apparently Nutch' data structures are easily compressed. It
> > also
> > > > greatly reduces I/O, thus speeding up all load times. CPU usage is
> > > > negligible compared to I/O wait.
> > > >
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:Tomasz 
> > > > > Sent: Wednesday 24th February 2016 15:46
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Nutch single instance
> > > > >
> > > > > Markus, thanks for sharing. Changing a bit the topic. A few messages
> > > > > earlier I asked about storing only links between pages without a
> > content.
> > > > > With your great help I run Nutch with fetcher.store.content = false
> > and
> > > > > fetcher.parse = true and omit a parse step in generate/fetch/update
> > > > cycle.
> > > > > What more I remove parse_text from segments directory after each
> > cycle to
> > > > > save space, but space used by segments is growing rapidly and I
> > wonder
> > > > if I
> > > > > really need all the data. Let me summarise my case - I crawl only to
> > get
> > > > > connections between pages (inverted links with anchors) and I don't
> > need
> > > > > the content. I run generate/fetch/update cycle continuously (I've
> > set up
> > > > > time limit for fetcher to run max 90 min). Is there a way I can save
> > more
> > > > > storage space? Thanks.
> > > > >
> > > > > Tomasz
> > > > >
> > > > > 2016-02-24 12:09 GMT+01:00 Markus Jelsma  > >:
> > > > >
> > > > > > Hi - see inline.
> > > > > > Markus
> > > > > >
> > > > > > -Original message-
> > > > > > > From:Tomasz 
> > > > > > > Sent: Wednesday 24th February 2016 11:54
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Nutch single instance
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > After a few days testing Nutch with Amazon EMR (1 master and 2
> > > > slaves) I
> > > > > > > had to give up. It was extremely slow (avg. fetching speed at 

Re: Nutch single instance

2016-02-25 Thread Tomasz
Thanks for the hint but still it doesn't work. I ran the commands with the
following arguments:

-D mapreduce.map.output.compress=true -D
mapreduce.output.fileoutputformat.compress=false
and
-D mapreduce.map.output.compress=true -D
mapreduce.output.fileoutputformat.compress=true

Used space didn't change regardless of true/false value for the 2nd
parameter and it consumes about 1-1.5GB for each generate/fetch/update
cycle which means to me I will run out of disk space in a few days. I'm not
even sure if the compression is available on the machine but didn't notice
any errors/warning on the other hand. I don't use slaves, it's a single
node instance and maybe mapreduce arguments doesn't work with such a
environment? Markus, what to do?

Tomasz

2016-02-25 15:09 GMT+01:00 Markus Jelsma :

> Hi - no, not just that. My colleague tells me you also need
> mapreduce.output.fileoutputformat.compress.
> Markus
>
> -Original message-
> > From:Tomasz 
> > Sent: Thursday 25th February 2016 11:10
> > To: user@nutch.apache.org
> > Subject: Re: Nutch single instance
> >
> > Great, I remove crawl_generate and it helps a bit to save space. I run
> > nutch commands with -D mapreduce.map.output.compress=true but don't see
> any
> > significant space drop. Is this enough to enable compression? Thanks.
> >
> > 2016-02-24 21:39 GMT+01:00 Markus Jelsma :
> >
> > > Oh, i forgot the following; enable Hadoop's snappy compression on in-
> and
> > > output files. It reduced our storage requirements to 10% of the
> original
> > > file size. Apparently Nutch' data structures are easily compressed. It
> also
> > > greatly reduces I/O, thus speeding up all load times. CPU usage is
> > > negligible compared to I/O wait.
> > >
> > > Markus
> > >
> > > -Original message-
> > > > From:Tomasz 
> > > > Sent: Wednesday 24th February 2016 15:46
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Nutch single instance
> > > >
> > > > Markus, thanks for sharing. Changing a bit the topic. A few messages
> > > > earlier I asked about storing only links between pages without a
> content.
> > > > With your great help I run Nutch with fetcher.store.content = false
> and
> > > > fetcher.parse = true and omit a parse step in generate/fetch/update
> > > cycle.
> > > > What more I remove parse_text from segments directory after each
> cycle to
> > > > save space, but space used by segments is growing rapidly and I
> wonder
> > > if I
> > > > really need all the data. Let me summarise my case - I crawl only to
> get
> > > > connections between pages (inverted links with anchors) and I don't
> need
> > > > the content. I run generate/fetch/update cycle continuously (I've
> set up
> > > > time limit for fetcher to run max 90 min). Is there a way I can save
> more
> > > > storage space? Thanks.
> > > >
> > > > Tomasz
> > > >
> > > > 2016-02-24 12:09 GMT+01:00 Markus Jelsma  >:
> > > >
> > > > > Hi - see inline.
> > > > > Markus
> > > > >
> > > > > -Original message-
> > > > > > From:Tomasz 
> > > > > > Sent: Wednesday 24th February 2016 11:54
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Nutch single instance
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > After a few days testing Nutch with Amazon EMR (1 master and 2
> > > slaves) I
> > > > > > had to give up. It was extremely slow (avg. fetching speed at 8
> > > urls/sec
> > > > > > counting those 2 slaves) and along with map-reduce overhead the
> whole
> > > > > > solution hasn't satisfied me at all. I moved Nutch crawl
> databases
> > > and
> > > > > > segments to single EC2 instance and it works pretty fast now
> > > reaching 35
> > > > > > fetched pages/sec with an avg. 25/sec. I know that Nutch is
> designed
> > > to
> > > > > > work with Hadoop environment and regret it didn't work in my
> case.
> > > > >
> > > > > Setting up Nutch the correct way is a delicate matter and quite
> some
> > > trial
> > > > > and error. But in general, more machines are faster. But in some
> > > cases

RE: Nutch single instance

2016-02-25 Thread Markus Jelsma
Hi - no, not just that. My colleague tells me you also need 
mapreduce.output.fileoutputformat.compress.
Markus 
 
-Original message-
> From:Tomasz 
> Sent: Thursday 25th February 2016 11:10
> To: user@nutch.apache.org
> Subject: Re: Nutch single instance
> 
> Great, I remove crawl_generate and it helps a bit to save space. I run
> nutch commands with -D mapreduce.map.output.compress=true but don't see any
> significant space drop. Is this enough to enable compression? Thanks.
> 
> 2016-02-24 21:39 GMT+01:00 Markus Jelsma :
> 
> > Oh, i forgot the following; enable Hadoop's snappy compression on in- and
> > output files. It reduced our storage requirements to 10% of the original
> > file size. Apparently Nutch' data structures are easily compressed. It also
> > greatly reduces I/O, thus speeding up all load times. CPU usage is
> > negligible compared to I/O wait.
> >
> > Markus
> >
> > -Original message-
> > > From:Tomasz 
> > > Sent: Wednesday 24th February 2016 15:46
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch single instance
> > >
> > > Markus, thanks for sharing. Changing a bit the topic. A few messages
> > > earlier I asked about storing only links between pages without a content.
> > > With your great help I run Nutch with fetcher.store.content = false and
> > > fetcher.parse = true and omit a parse step in generate/fetch/update
> > cycle.
> > > What more I remove parse_text from segments directory after each cycle to
> > > save space, but space used by segments is growing rapidly and I wonder
> > if I
> > > really need all the data. Let me summarise my case - I crawl only to get
> > > connections between pages (inverted links with anchors) and I don't need
> > > the content. I run generate/fetch/update cycle continuously (I've set up
> > > time limit for fetcher to run max 90 min). Is there a way I can save more
> > > storage space? Thanks.
> > >
> > > Tomasz
> > >
> > > 2016-02-24 12:09 GMT+01:00 Markus Jelsma :
> > >
> > > > Hi - see inline.
> > > > Markus
> > > >
> > > > -Original message-
> > > > > From:Tomasz 
> > > > > Sent: Wednesday 24th February 2016 11:54
> > > > > To: user@nutch.apache.org
> > > > > Subject: Nutch single instance
> > > > >
> > > > > Hello,
> > > > >
> > > > > After a few days testing Nutch with Amazon EMR (1 master and 2
> > slaves) I
> > > > > had to give up. It was extremely slow (avg. fetching speed at 8
> > urls/sec
> > > > > counting those 2 slaves) and along with map-reduce overhead the whole
> > > > > solution hasn't satisfied me at all. I moved Nutch crawl databases
> > and
> > > > > segments to single EC2 instance and it works pretty fast now
> > reaching 35
> > > > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed
> > to
> > > > > work with Hadoop environment and regret it didn't work in my case.
> > > >
> > > > Setting up Nutch the correct way is a delicate matter and quite some
> > trial
> > > > and error. But in general, more machines are faster. But in some
> > cases, one
> > > > fast beast can easily outperform a few less powerful machines.
> > > >
> > > > >
> > > > > Anyway I would like to know if I'm alone with the approach and
> > everybody
> > > > > set up Nutch with Hadoop. If no and some of you runs Nutch in a
> > single
> > > > > instance maybe you can share with some best practices e.g. do you use
> > > > crawl
> > > > > script or generate/fetch/update continuously perhaps using some cron
> > > > jobs?
> > > >
> > > > Well, in both cases you need some script(s) to run the jobs. We have a
> > lot
> > > > of complicated scripts that get stuff from everywhere. We have
> > integrated
> > > > Nutch in our Sitesearch platform so it has to be coupled to a lot of
> > > > different systems. We still rely on bash scripts but probably Python is
> > > > easier if scripts are complicated. Ideally, in a distributed
> > environment,
> > > > you use Apache Oozie to run the crawls.
> > > >
> > > > >
> > > > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats 

Re: Nutch single instance

2016-02-25 Thread Tomasz
Great, I remove crawl_generate and it helps a bit to save space. I run
nutch commands with -D mapreduce.map.output.compress=true but don't see any
significant space drop. Is this enough to enable compression? Thanks.

2016-02-24 21:39 GMT+01:00 Markus Jelsma :

> Oh, i forgot the following; enable Hadoop's snappy compression on in- and
> output files. It reduced our storage requirements to 10% of the original
> file size. Apparently Nutch' data structures are easily compressed. It also
> greatly reduces I/O, thus speeding up all load times. CPU usage is
> negligible compared to I/O wait.
>
> Markus
>
> -Original message-
> > From:Tomasz 
> > Sent: Wednesday 24th February 2016 15:46
> > To: user@nutch.apache.org
> > Subject: Re: Nutch single instance
> >
> > Markus, thanks for sharing. Changing a bit the topic. A few messages
> > earlier I asked about storing only links between pages without a content.
> > With your great help I run Nutch with fetcher.store.content = false and
> > fetcher.parse = true and omit a parse step in generate/fetch/update
> cycle.
> > What more I remove parse_text from segments directory after each cycle to
> > save space, but space used by segments is growing rapidly and I wonder
> if I
> > really need all the data. Let me summarise my case - I crawl only to get
> > connections between pages (inverted links with anchors) and I don't need
> > the content. I run generate/fetch/update cycle continuously (I've set up
> > time limit for fetcher to run max 90 min). Is there a way I can save more
> > storage space? Thanks.
> >
> > Tomasz
> >
> > 2016-02-24 12:09 GMT+01:00 Markus Jelsma :
> >
> > > Hi - see inline.
> > > Markus
> > >
> > > -Original message-
> > > > From:Tomasz 
> > > > Sent: Wednesday 24th February 2016 11:54
> > > > To: user@nutch.apache.org
> > > > Subject: Nutch single instance
> > > >
> > > > Hello,
> > > >
> > > > After a few days testing Nutch with Amazon EMR (1 master and 2
> slaves) I
> > > > had to give up. It was extremely slow (avg. fetching speed at 8
> urls/sec
> > > > counting those 2 slaves) and along with map-reduce overhead the whole
> > > > solution hasn't satisfied me at all. I moved Nutch crawl databases
> and
> > > > segments to single EC2 instance and it works pretty fast now
> reaching 35
> > > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed
> to
> > > > work with Hadoop environment and regret it didn't work in my case.
> > >
> > > Setting up Nutch the correct way is a delicate matter and quite some
> trial
> > > and error. But in general, more machines are faster. But in some
> cases, one
> > > fast beast can easily outperform a few less powerful machines.
> > >
> > > >
> > > > Anyway I would like to know if I'm alone with the approach and
> everybody
> > > > set up Nutch with Hadoop. If no and some of you runs Nutch in a
> single
> > > > instance maybe you can share with some best practices e.g. do you use
> > > crawl
> > > > script or generate/fetch/update continuously perhaps using some cron
> > > jobs?
> > >
> > > Well, in both cases you need some script(s) to run the jobs. We have a
> lot
> > > of complicated scripts that get stuff from everywhere. We have
> integrated
> > > Nutch in our Sitesearch platform so it has to be coupled to a lot of
> > > different systems. We still rely on bash scripts but probably Python is
> > > easier if scripts are complicated. Ideally, in a distributed
> environment,
> > > you use Apache Oozie to run the crawls.
> > >
> > > >
> > > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats -
> > > what
> > > > exactly does it mean?
> > >
> > > These are transient errors, e.g. connection time outs, connection
> resets
> > > but also 5xx errors that are usually transient. They are eligble for
> > > recrawl 24 hours later. By default, after retry 3, the records goes
> from
> > > db_unfetched to db_gone.
> > >
> > > >
> > > > Regards,
> > > > Tomasz
> > > >
> > > > Here are my current crawldb stats:
> > > > TOTAL urls: 16347942
> > > > retry 0:16012503
> > > > retry 1:134346
> > > > retry 2:106037
> > > > retry 3:95056
> > > > min score:  0.0
> > > > avg score:  0.04090025
> > > > max score:  331.052
> > > > status 1 (db_unfetched):14045806
> > > > status 2 (db_fetched):  1769382
> > > > status 3 (db_gone): 160768
> > > > status 4 (db_redir_temp):   68104
> > > > status 5 (db_redir_perm):   151944
> > > > status 6 (db_notmodified):  151938
> > > >
> > >
> >
>


RE: Nutch single instance

2016-02-24 Thread Markus Jelsma
Oh, i forgot the following; enable Hadoop's snappy compression on in- and 
output files. It reduced our storage requirements to 10% of the original file 
size. Apparently Nutch' data structures are easily compressed. It also greatly 
reduces I/O, thus speeding up all load times. CPU usage is negligible compared 
to I/O wait.

Markus 
 
-Original message-
> From:Tomasz 
> Sent: Wednesday 24th February 2016 15:46
> To: user@nutch.apache.org
> Subject: Re: Nutch single instance
> 
> Markus, thanks for sharing. Changing a bit the topic. A few messages
> earlier I asked about storing only links between pages without a content.
> With your great help I run Nutch with fetcher.store.content = false and
> fetcher.parse = true and omit a parse step in generate/fetch/update cycle.
> What more I remove parse_text from segments directory after each cycle to
> save space, but space used by segments is growing rapidly and I wonder if I
> really need all the data. Let me summarise my case - I crawl only to get
> connections between pages (inverted links with anchors) and I don't need
> the content. I run generate/fetch/update cycle continuously (I've set up
> time limit for fetcher to run max 90 min). Is there a way I can save more
> storage space? Thanks.
> 
> Tomasz
> 
> 2016-02-24 12:09 GMT+01:00 Markus Jelsma :
> 
> > Hi - see inline.
> > Markus
> >
> > -Original message-
> > > From:Tomasz 
> > > Sent: Wednesday 24th February 2016 11:54
> > > To: user@nutch.apache.org
> > > Subject: Nutch single instance
> > >
> > > Hello,
> > >
> > > After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I
> > > had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec
> > > counting those 2 slaves) and along with map-reduce overhead the whole
> > > solution hasn't satisfied me at all. I moved Nutch crawl databases and
> > > segments to single EC2 instance and it works pretty fast now reaching 35
> > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to
> > > work with Hadoop environment and regret it didn't work in my case.
> >
> > Setting up Nutch the correct way is a delicate matter and quite some trial
> > and error. But in general, more machines are faster. But in some cases, one
> > fast beast can easily outperform a few less powerful machines.
> >
> > >
> > > Anyway I would like to know if I'm alone with the approach and everybody
> > > set up Nutch with Hadoop. If no and some of you runs Nutch in a single
> > > instance maybe you can share with some best practices e.g. do you use
> > crawl
> > > script or generate/fetch/update continuously perhaps using some cron
> > jobs?
> >
> > Well, in both cases you need some script(s) to run the jobs. We have a lot
> > of complicated scripts that get stuff from everywhere. We have integrated
> > Nutch in our Sitesearch platform so it has to be coupled to a lot of
> > different systems. We still rely on bash scripts but probably Python is
> > easier if scripts are complicated. Ideally, in a distributed environment,
> > you use Apache Oozie to run the crawls.
> >
> > >
> > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats -
> > what
> > > exactly does it mean?
> >
> > These are transient errors, e.g. connection time outs, connection resets
> > but also 5xx errors that are usually transient. They are eligble for
> > recrawl 24 hours later. By default, after retry 3, the records goes from
> > db_unfetched to db_gone.
> >
> > >
> > > Regards,
> > > Tomasz
> > >
> > > Here are my current crawldb stats:
> > > TOTAL urls: 16347942
> > > retry 0:16012503
> > > retry 1:134346
> > > retry 2:106037
> > > retry 3:95056
> > > min score:  0.0
> > > avg score:  0.04090025
> > > max score:  331.052
> > > status 1 (db_unfetched):14045806
> > > status 2 (db_fetched):  1769382
> > > status 3 (db_gone): 160768
> > > status 4 (db_redir_temp):   68104
> > > status 5 (db_redir_perm):   151944
> > > status 6 (db_notmodified):  151938
> > >
> >
> 


RE: Nutch single instance

2016-02-24 Thread Markus Jelsma
Hello - it seems you only need some segments directories. I am sure you can 
remove crawl_generate but not immediately sure about some of the others. You 
would need to check FetchOutputFormat and ParseOutputFormat to make sure which 
directory contains the data structures you need. Or maybe there was a page on 
the wiki that explains the precise contents of each dir.

Markus
 
-Original message-
> From:Tomasz 
> Sent: Wednesday 24th February 2016 15:46
> To: user@nutch.apache.org
> Subject: Re: Nutch single instance
> 
> Markus, thanks for sharing. Changing a bit the topic. A few messages
> earlier I asked about storing only links between pages without a content.
> With your great help I run Nutch with fetcher.store.content = false and
> fetcher.parse = true and omit a parse step in generate/fetch/update cycle.
> What more I remove parse_text from segments directory after each cycle to
> save space, but space used by segments is growing rapidly and I wonder if I
> really need all the data. Let me summarise my case - I crawl only to get
> connections between pages (inverted links with anchors) and I don't need
> the content. I run generate/fetch/update cycle continuously (I've set up
> time limit for fetcher to run max 90 min). Is there a way I can save more
> storage space? Thanks.
> 
> Tomasz
> 
> 2016-02-24 12:09 GMT+01:00 Markus Jelsma :
> 
> > Hi - see inline.
> > Markus
> >
> > -Original message-
> > > From:Tomasz 
> > > Sent: Wednesday 24th February 2016 11:54
> > > To: user@nutch.apache.org
> > > Subject: Nutch single instance
> > >
> > > Hello,
> > >
> > > After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I
> > > had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec
> > > counting those 2 slaves) and along with map-reduce overhead the whole
> > > solution hasn't satisfied me at all. I moved Nutch crawl databases and
> > > segments to single EC2 instance and it works pretty fast now reaching 35
> > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to
> > > work with Hadoop environment and regret it didn't work in my case.
> >
> > Setting up Nutch the correct way is a delicate matter and quite some trial
> > and error. But in general, more machines are faster. But in some cases, one
> > fast beast can easily outperform a few less powerful machines.
> >
> > >
> > > Anyway I would like to know if I'm alone with the approach and everybody
> > > set up Nutch with Hadoop. If no and some of you runs Nutch in a single
> > > instance maybe you can share with some best practices e.g. do you use
> > crawl
> > > script or generate/fetch/update continuously perhaps using some cron
> > jobs?
> >
> > Well, in both cases you need some script(s) to run the jobs. We have a lot
> > of complicated scripts that get stuff from everywhere. We have integrated
> > Nutch in our Sitesearch platform so it has to be coupled to a lot of
> > different systems. We still rely on bash scripts but probably Python is
> > easier if scripts are complicated. Ideally, in a distributed environment,
> > you use Apache Oozie to run the crawls.
> >
> > >
> > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats -
> > what
> > > exactly does it mean?
> >
> > These are transient errors, e.g. connection time outs, connection resets
> > but also 5xx errors that are usually transient. They are eligble for
> > recrawl 24 hours later. By default, after retry 3, the records goes from
> > db_unfetched to db_gone.
> >
> > >
> > > Regards,
> > > Tomasz
> > >
> > > Here are my current crawldb stats:
> > > TOTAL urls: 16347942
> > > retry 0:16012503
> > > retry 1:134346
> > > retry 2:106037
> > > retry 3:95056
> > > min score:  0.0
> > > avg score:  0.04090025
> > > max score:  331.052
> > > status 1 (db_unfetched):14045806
> > > status 2 (db_fetched):  1769382
> > > status 3 (db_gone): 160768
> > > status 4 (db_redir_temp):   68104
> > > status 5 (db_redir_perm):   151944
> > > status 6 (db_notmodified):  151938
> > >
> >
> 


Re: Nutch single instance

2016-02-24 Thread Tomasz
Markus, thanks for sharing. Changing a bit the topic. A few messages
earlier I asked about storing only links between pages without a content.
With your great help I run Nutch with fetcher.store.content = false and
fetcher.parse = true and omit a parse step in generate/fetch/update cycle.
What more I remove parse_text from segments directory after each cycle to
save space, but space used by segments is growing rapidly and I wonder if I
really need all the data. Let me summarise my case - I crawl only to get
connections between pages (inverted links with anchors) and I don't need
the content. I run generate/fetch/update cycle continuously (I've set up
time limit for fetcher to run max 90 min). Is there a way I can save more
storage space? Thanks.

Tomasz

2016-02-24 12:09 GMT+01:00 Markus Jelsma :

> Hi - see inline.
> Markus
>
> -Original message-
> > From:Tomasz 
> > Sent: Wednesday 24th February 2016 11:54
> > To: user@nutch.apache.org
> > Subject: Nutch single instance
> >
> > Hello,
> >
> > After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I
> > had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec
> > counting those 2 slaves) and along with map-reduce overhead the whole
> > solution hasn't satisfied me at all. I moved Nutch crawl databases and
> > segments to single EC2 instance and it works pretty fast now reaching 35
> > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to
> > work with Hadoop environment and regret it didn't work in my case.
>
> Setting up Nutch the correct way is a delicate matter and quite some trial
> and error. But in general, more machines are faster. But in some cases, one
> fast beast can easily outperform a few less powerful machines.
>
> >
> > Anyway I would like to know if I'm alone with the approach and everybody
> > set up Nutch with Hadoop. If no and some of you runs Nutch in a single
> > instance maybe you can share with some best practices e.g. do you use
> crawl
> > script or generate/fetch/update continuously perhaps using some cron
> jobs?
>
> Well, in both cases you need some script(s) to run the jobs. We have a lot
> of complicated scripts that get stuff from everywhere. We have integrated
> Nutch in our Sitesearch platform so it has to be coupled to a lot of
> different systems. We still rely on bash scripts but probably Python is
> easier if scripts are complicated. Ideally, in a distributed environment,
> you use Apache Oozie to run the crawls.
>
> >
> > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats -
> what
> > exactly does it mean?
>
> These are transient errors, e.g. connection time outs, connection resets
> but also 5xx errors that are usually transient. They are eligble for
> recrawl 24 hours later. By default, after retry 3, the records goes from
> db_unfetched to db_gone.
>
> >
> > Regards,
> > Tomasz
> >
> > Here are my current crawldb stats:
> > TOTAL urls: 16347942
> > retry 0:16012503
> > retry 1:134346
> > retry 2:106037
> > retry 3:95056
> > min score:  0.0
> > avg score:  0.04090025
> > max score:  331.052
> > status 1 (db_unfetched):14045806
> > status 2 (db_fetched):  1769382
> > status 3 (db_gone): 160768
> > status 4 (db_redir_temp):   68104
> > status 5 (db_redir_perm):   151944
> > status 6 (db_notmodified):  151938
> >
>


RE: Nutch single instance

2016-02-24 Thread Markus Jelsma
Hi - see inline.
Markus
 
-Original message-
> From:Tomasz 
> Sent: Wednesday 24th February 2016 11:54
> To: user@nutch.apache.org
> Subject: Nutch single instance
> 
> Hello,
> 
> After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I
> had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec
> counting those 2 slaves) and along with map-reduce overhead the whole
> solution hasn't satisfied me at all. I moved Nutch crawl databases and
> segments to single EC2 instance and it works pretty fast now reaching 35
> fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to
> work with Hadoop environment and regret it didn't work in my case.

Setting up Nutch the correct way is a delicate matter and quite some trial and 
error. But in general, more machines are faster. But in some cases, one fast 
beast can easily outperform a few less powerful machines.

> 
> Anyway I would like to know if I'm alone with the approach and everybody
> set up Nutch with Hadoop. If no and some of you runs Nutch in a single
> instance maybe you can share with some best practices e.g. do you use crawl
> script or generate/fetch/update continuously perhaps using some cron jobs?

Well, in both cases you need some script(s) to run the jobs. We have a lot of 
complicated scripts that get stuff from everywhere. We have integrated Nutch in 
our Sitesearch platform so it has to be coupled to a lot of different systems. 
We still rely on bash scripts but probably Python is easier if scripts are 
complicated. Ideally, in a distributed environment, you use Apache Oozie to run 
the crawls.

> 
> Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats - what
> exactly does it mean?

These are transient errors, e.g. connection time outs, connection resets but 
also 5xx errors that are usually transient. They are eligble for recrawl 24 
hours later. By default, after retry 3, the records goes from db_unfetched to 
db_gone.

> 
> Regards,
> Tomasz
> 
> Here are my current crawldb stats:
> TOTAL urls: 16347942
> retry 0:16012503
> retry 1:134346
> retry 2:106037
> retry 3:95056
> min score:  0.0
> avg score:  0.04090025
> max score:  331.052
> status 1 (db_unfetched):14045806
> status 2 (db_fetched):  1769382
> status 3 (db_gone): 160768
> status 4 (db_redir_temp):   68104
> status 5 (db_redir_perm):   151944
> status 6 (db_notmodified):  151938
>