Re: Generator taking time

Markus Jelsma Thu, 22 Mar 2012 04:02:44 -0700

If the state of your CrawlDB is already normalized then do not use a 
normalizer unless your really have to. Same is true for filtering in this 
step.


On Thursday 22 March 2012 11:48:40 James Ford wrote:
> Hello,
> 
> I am having problems with the Generator step of my crawls. It takes a lot
> of time compared to indexing and fetching? Right now the generator step is
> taking about 50min compared to fetching, parsing and indexing that only
> takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
> the time:
> 
> 2012-03-22 11:13:28,277 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'partition', using default
> 2012-03-22 11:16:00,734 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 
> Crawldb dump:
> 
> 20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - Statistics for
> CrawlDb: crawldb/
> 20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - TOTAL urls: 7819485
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 0:    7811052
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 1:    2994
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 2:    1214
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 3:    1125
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 4:    1124
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 5:    1303
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 6:    673
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - min score:  0.0
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - avg score:
> 0.0015287232
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - max score:  2.0
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 1
> (db_unfetched):    6946135
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 2
> (db_fetched):      795070
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 3 (db_gone):
> 34358
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 4
> (db_redir_temp):   21861
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 5
> (db_redir_perm):   22044
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 6
> (db_notmodified):  17
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - CrawlDb statistics:
> done
> 
> Does anyone have a clue how to fix this?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106
> .html Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Generator taking time

Reply via email to