subject:"RE\: very slow generator step"

Re: very slow generator step

2012-11-12 Thread Tejas Patil

If you were observing low performance with the urlfilter-regex, directly
switching to url-filter automation may or may not
help. As Julien pointed out, bad performance might be accounted to some
nasty urls which consume lot of time. To check this you can run
urlfilter-regex plugin as a standalone plugin (
http://wiki.apache.org/nutch/bin/nutch%20plugin) and pass all the urls to
it. With a minor tweak you can dump the time taken for each url. If you are
sure that the low perf is not due to nasty urls, switching to url-filter
automation will be best thing to do. You must carefully design the rules in
automaton-urlfilter.txt as it has limited capability.

Even crawlspace expansion could be a reason ie. nutch found a huge number
of urls all of a sudden. This had happened with me when nutch crawled
sitemap pages which had enormous outlinks. This can be checked by observing
the fetched count for the earlier rounds and the recent round.

On top of everything, I agree with what Markus suggested. ie. using
-noFilter option for generate. It gives good perf.
Update phase is already preventing unwanted urls being added. So no need to
do filtering again in generate (unless you want to do custom crawling of
some specific hosts or urls and quickly get it data).

thanks,
Tejas


On Mon, Nov 12, 2012 at 1:21 PM, Markus Jelsma
wrote:

> You may need to change your expressions but it is performant. Not all
> features of traditional regex are supported.
> http://wiki.apache.org/nutch/RegexURLFiltersBenchs
>
>
>
> -Original message-
> > From:Mohammad wrk 
> > Sent: Mon 12-Nov-2012 22:17
> > To: user@nutch.apache.org
> > Subject: Re: very slow generator step
> >
> >
> >
> > That's a good thinking. I have never used url-filter automation. Where
> can I find more info?
> >
> > Thanks,
> > Mohammad
> >
> > 
> >  From: Julien Nioche 
> > To: user@nutch.apache.org; Mohammad wrk 
> > Sent: Monday, November 12, 2012 12:38:44 PM
> > Subject: Re: very slow generator step
> >
> > Could be that a particularly long and tricky URL got into your crawldb
> and
> > put the regex into a spin. I'd use the url-filter automaton instead as it
> > is much faster. Would be interesting to know what caused the regex to
> take
> > so much time, in case you fancy a bit of debugging ;-)
> >
> > Julien
> >
> > On 12 November 2012 20:29, Mohammad wrk  wrote:
> >
> > > Thanks for the tip. It went down to 2 minutes :-)
> > >
> > > What I don't understand is that how come everything was working fine
> with
> > > the default configuration for about 4 days and all of a sudden one
> crawl
> > > causes a jump of 100 minutes?
> > >
> > > Cheers,
> > > Mohammad
> > >
> > >
> > > 
> > >  From: Markus Jelsma 
> > > To: "user@nutch.apache.org" 
> > > Sent: Monday, November 12, 2012 11:19:11 AM
> > > Subject: RE: very slow generator step
> > >
> > > Hi - Please use the -noFilter option. It is usually useless to filter
> in
> > > the generator because they've already been filtered in the parse step
> and
> > > or update step.
> > >
> > >
> > >
> > > -Original message-
> > > > From:Mohammad wrk 
> > > > Sent: Mon 12-Nov-2012 18:43
> > > > To: user@nutch.apache.org
> > > > Subject: very slow generator step
> > > >
> > > > Hi,
> > > >
> > > > The generator time has gone from 8 minutes to 106 minutes few days
> ago
> > > and stayed there since then. AFAIK, I haven't made any configuration
> > > changes recently (attached you can find some of the configurations
> that I
> > > thought might be related).
> > > >
> > > > A quick CPU sampling shows that most of the time is spent on
> > > java.util.regex.Matcher.find(). Since I'm using default regex
> > > configurations and my crawldb has only 3,052,412 urls, I was wondering
> if
> > > this is a known issue with nutch-1.5.1 ?
> > > >
> > > > Here are some more information that might help:
> > > >
> > > > = Generator logs
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting
> at
> > > 2012-11-09 03:14:50
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> > > best-scoring urls due for fetch.
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: f

RE: very slow generator step

2012-11-12 Thread Markus Jelsma

You may need to change your expressions but it is performant. Not all features 
of traditional regex are supported.
http://wiki.apache.org/nutch/RegexURLFiltersBenchs

 
 
-Original message-
> From:Mohammad wrk 
> Sent: Mon 12-Nov-2012 22:17
> To: user@nutch.apache.org
> Subject: Re: very slow generator step
> 
> 
> 
> That's a good thinking. I have never used url-filter automation. Where can I 
> find more info?
> 
> Thanks,
> Mohammad
> 
> 
>  From: Julien Nioche 
> To: user@nutch.apache.org; Mohammad wrk  
> Sent: Monday, November 12, 2012 12:38:44 PM
> Subject: Re: very slow generator step
>  
> Could be that a particularly long and tricky URL got into your crawldb and
> put the regex into a spin. I'd use the url-filter automaton instead as it
> is much faster. Would be interesting to know what caused the regex to take
> so much time, in case you fancy a bit of debugging ;-)
> 
> Julien
> 
> On 12 November 2012 20:29, Mohammad wrk  wrote:
> 
> > Thanks for the tip. It went down to 2 minutes :-)
> >
> > What I don't understand is that how come everything was working fine with
> > the default configuration for about 4 days and all of a sudden one crawl
> > causes a jump of 100 minutes?
> >
> > Cheers,
> > Mohammad
> >
> >
> > ____
> >  From: Markus Jelsma 
> > To: "user@nutch.apache.org" 
> > Sent: Monday, November 12, 2012 11:19:11 AM
> > Subject: RE: very slow generator step
> >
> > Hi - Please use the -noFilter option. It is usually useless to filter in
> > the generator because they've already been filtered in the parse step and
> > or update step.
> >
> >
> >
> > -Original message-
> > > From:Mohammad wrk 
> > > Sent: Mon 12-Nov-2012 18:43
> > > To: user@nutch.apache.org
> > > Subject: very slow generator step
> > >
> > > Hi,
> > >
> > > The generator time has gone from 8 minutes to 106 minutes few days ago
> > and stayed there since then. AFAIK, I haven't made any configuration
> > changes recently (attached you can find some of the configurations that I
> > thought might be related).
> > >
> > > A quick CPU sampling shows that most of the time is spent on
> > java.util.regex.Matcher.find(). Since I'm using default regex
> > configurations and my crawldb has only 3,052,412 urls, I was wondering if
> > this is a known issue with nutch-1.5.1 ?
> > >
> > > Here are some more information that might help:
> > >
> > > = Generator logs
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
> > 2012-11-09 03:14:50
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> > true
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
> > true
> > > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> > segments/20121109032340
> > > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
> > 2012-11-09 03:23:47, elapsed: 00:08:56
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
> > 2012-11-09 05:35:14
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> > true
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
> > true
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> > segments/20121109072143
> > > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
> > 2012-11

Re: very slow generator step

2012-11-12 Thread Mohammad wrk



That's a good thinking. I have never used url-filter automation. Where can I 
find more info?

Thanks,
Mohammad


 From: Julien Nioche 
To: user@nutch.apache.org; Mohammad wrk  
Sent: Monday, November 12, 2012 12:38:44 PM
Subject: Re: very slow generator step
 
Could be that a particularly long and tricky URL got into your crawldb and
put the regex into a spin. I'd use the url-filter automaton instead as it
is much faster. Would be interesting to know what caused the regex to take
so much time, in case you fancy a bit of debugging ;-)

Julien

On 12 November 2012 20:29, Mohammad wrk  wrote:

> Thanks for the tip. It went down to 2 minutes :-)
>
> What I don't understand is that how come everything was working fine with
> the default configuration for about 4 days and all of a sudden one crawl
> causes a jump of 100 minutes?
>
> Cheers,
> Mohammad
>
>
> 
>  From: Markus Jelsma 
> To: "user@nutch.apache.org" 
> Sent: Monday, November 12, 2012 11:19:11 AM
> Subject: RE: very slow generator step
>
> Hi - Please use the -noFilter option. It is usually useless to filter in
> the generator because they've already been filtered in the parse step and
> or update step.
>
>
>
> -Original message-
> > From:Mohammad wrk 
> > Sent: Mon 12-Nov-2012 18:43
> > To: user@nutch.apache.org
> > Subject: very slow generator step
> >
> > Hi,
> >
> > The generator time has gone from 8 minutes to 106 minutes few days ago
> and stayed there since then. AFAIK, I haven't made any configuration
> changes recently (attached you can find some of the configurations that I
> thought might be related).
> >
> > A quick CPU sampling shows that most of the time is spent on
> java.util.regex.Matcher.find(). Since I'm using default regex
> configurations and my crawldb has only 3,052,412 urls, I was wondering if
> this is a known issue with nutch-1.5.1 ?
> >
> > Here are some more information that might help:
> >
> > = Generator logs
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
> 2012-11-09 03:14:50
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> true
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> segments/20121109032340
> > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
> 2012-11-09 03:23:47, elapsed: 00:08:56
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
> 2012-11-09 05:35:14
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> true
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> segments/20121109072143
> > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
> 2012-11-09 07:21:51, elapsed: 01:46:36
> >
> > = CrawlDb statistics
> > CrawlDb statistics start: ./crawldb
> > Statistics for CrawlDb: ./crawldb
> > TOTAL urls:3052412
> > retry 0:3047404
> > retry 1:338
> > retry 2:1192
> > retry 3:822
> > retry 4:336
> > retry 5:2320
> > min score:0.0
> > avg score:0.015368268
> > max score:48.608
> > status 1 (db_unfetched):2813249
> > status 2 (db_fetched):196717
> > status 3 (db_gone):14204
> > status 4 (db_redir_temp):10679
> > status 5 (db_redir_perm):17563
> > CrawlDb statistics: done
> >
> > = System info
> > Memory: 4 GB
> > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
> > Available diskspace: 171.7 GB
> > OS: Release 12.10 (quantal) 64-bit
> >
> >
> > Thanks,
> > Mohammad
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: very slow generator step

2012-11-12 Thread Julien Nioche

Could be that a particularly long and tricky URL got into your crawldb and
put the regex into a spin. I'd use the url-filter automaton instead as it
is much faster. Would be interesting to know what caused the regex to take
so much time, in case you fancy a bit of debugging ;-)

Julien

On 12 November 2012 20:29, Mohammad wrk  wrote:

> Thanks for the tip. It went down to 2 minutes :-)
>
> What I don't understand is that how come everything was working fine with
> the default configuration for about 4 days and all of a sudden one crawl
> causes a jump of 100 minutes?
>
> Cheers,
> Mohammad
>
>
> 
>  From: Markus Jelsma 
> To: "user@nutch.apache.org" 
> Sent: Monday, November 12, 2012 11:19:11 AM
> Subject: RE: very slow generator step
>
> Hi - Please use the -noFilter option. It is usually useless to filter in
> the generator because they've already been filtered in the parse step and
> or update step.
>
>
>
> -Original message-
> > From:Mohammad wrk 
> > Sent: Mon 12-Nov-2012 18:43
> > To: user@nutch.apache.org
> > Subject: very slow generator step
> >
> > Hi,
> >
> > The generator time has gone from 8 minutes to 106 minutes few days ago
> and stayed there since then. AFAIK, I haven't made any configuration
> changes recently (attached you can find some of the configurations that I
> thought might be related).
> >
> > A quick CPU sampling shows that most of the time is spent on
> java.util.regex.Matcher.find(). Since I'm using default regex
> configurations and my crawldb has only 3,052,412 urls, I was wondering if
> this is a known issue with nutch-1.5.1 ?
> >
> > Here are some more information that might help:
> >
> > = Generator logs
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
> 2012-11-09 03:14:50
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> true
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> segments/20121109032340
> > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
> 2012-11-09 03:23:47, elapsed: 00:08:56
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
> 2012-11-09 05:35:14
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> true
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> segments/20121109072143
> > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
> 2012-11-09 07:21:51, elapsed: 01:46:36
> >
> > = CrawlDb statistics
> > CrawlDb statistics start: ./crawldb
> > Statistics for CrawlDb: ./crawldb
> > TOTAL urls:3052412
> > retry 0:3047404
> > retry 1:338
> > retry 2:1192
> > retry 3:822
> > retry 4:336
> > retry 5:2320
> > min score:0.0
> > avg score:0.015368268
> > max score:48.608
> > status 1 (db_unfetched):2813249
> > status 2 (db_fetched):196717
> > status 3 (db_gone):14204
> > status 4 (db_redir_temp):10679
> > status 5 (db_redir_perm):17563
> > CrawlDb statistics: done
> >
> > = System info
> > Memory: 4 GB
> > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
> > Available diskspace: 171.7 GB
> > OS: Release 12.10 (quantal) 64-bit
> >
> >
> > Thanks,
> > Mohammad
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: very slow generator step

2012-11-12 Thread Mohammad wrk

Thanks for the tip. It went down to 2 minutes :-)

What I don't understand is that how come everything was working fine with the 
default configuration for about 4 days and all of a sudden one crawl causes a 
jump of 100 minutes?

Cheers,
Mohammad



 From: Markus Jelsma 
To: "user@nutch.apache.org"  
Sent: Monday, November 12, 2012 11:19:11 AM
Subject: RE: very slow generator step
 
Hi - Please use the -noFilter option. It is usually useless to filter in the 
generator because they've already been filtered in the parse step and or update 
step.



-Original message-
> From:Mohammad wrk 
> Sent: Mon 12-Nov-2012 18:43
> To: user@nutch.apache.org
> Subject: very slow generator step
> 
> Hi,
> 
> The generator time has gone from 8 minutes to 106 minutes few days ago and 
> stayed there since then. AFAIK, I haven't made any configuration changes 
> recently (attached you can find some of the configurations that I thought 
> might be related). 
> 
> A quick CPU sampling shows that most of the time is spent on 
> java.util.regex.Matcher.find(). Since I'm using default regex configurations 
> and my crawldb has only 3,052,412 urls, I was wondering if this is a known 
> issue with nutch-1.5.1 ?
> 
> Here are some more information that might help:
> 
> = Generator logs
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 03:14:50
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment: 
> segments/20121109032340
> 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 03:23:47, elapsed: 00:08:56
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 05:35:14
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment: 
> segments/20121109072143
> 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 07:21:51, elapsed: 01:46:36
> 
> = CrawlDb statistics
> CrawlDb statistics start: ./crawldb
> Statistics for CrawlDb: ./crawldb
> TOTAL urls:3052412
> retry 0:3047404
> retry 1:338
> retry 2:1192
> retry 3:822
> retry 4:336
> retry 5:2320
> min score:0.0
> avg score:0.015368268
> max score:48.608
> status 1 (db_unfetched):2813249
> status 2 (db_fetched):196717
> status 3 (db_gone):14204
> status 4 (db_redir_temp):10679
> status 5 (db_redir_perm):17563
> CrawlDb statistics: done
> 
> = System info
> Memory: 4 GB
> CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 
> Available diskspace: 171.7 GB
> OS: Release 12.10 (quantal) 64-bit
> 
> 
> Thanks,
> Mohammad
>

RE: very slow generator step

2012-11-12 Thread Markus Jelsma

Hi - Please use the -noFilter option. It is usually useless to filter in the 
generator because they've already been filtered in the parse step and or update 
step.

 
 
-Original message-
> From:Mohammad wrk 
> Sent: Mon 12-Nov-2012 18:43
> To: user@nutch.apache.org
> Subject: very slow generator step
> 
> Hi,
> 
> The generator time has gone from 8 minutes to 106 minutes few days ago and 
> stayed there since then. AFAIK, I haven't made any configuration changes 
> recently (attached you can find some of the configurations that I thought 
> might be related). 
> 
> A quick CPU sampling shows that most of the time is spent on 
> java.util.regex.Matcher.find(). Since I'm using default regex configurations 
> and my crawldb has only 3,052,412 urls, I was wondering if this is a known 
> issue with nutch-1.5.1 ?
> 
> Here are some more information that might help:
> 
> = Generator logs
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 03:14:50
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment: 
> segments/20121109032340
> 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 03:23:47, elapsed: 00:08:56
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 05:35:14
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment: 
> segments/20121109072143
> 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 07:21:51, elapsed: 01:46:36
> 
> = CrawlDb statistics
> CrawlDb statistics start: ./crawldb
> Statistics for CrawlDb: ./crawldb
> TOTAL urls:3052412
> retry 0:3047404
> retry 1:338
> retry 2:1192
> retry 3:822
> retry 4:336
> retry 5:2320
> min score:0.0
> avg score:0.015368268
> max score:48.608
> status 1 (db_unfetched):2813249
> status 2 (db_fetched):196717
> status 3 (db_gone):14204
> status 4 (db_redir_temp):10679
> status 5 (db_redir_perm):17563
> CrawlDb statistics: done
> 
> = System info
> Memory: 4 GB
> CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 
> Available diskspace: 171.7 GB
> OS: Release 12.10 (quantal) 64-bit
> 
> 
> Thanks,
> Mohammad
>

Re: very slow generator step

RE: very slow generator step

Re: very slow generator step

Re: very slow generator step

Re: very slow generator step

RE: very slow generator step

6 matches

Site Navigation

Mail list logo

Footer information