If the state of your CrawlDB is already normalized then do not use a normalizer unless your really have to. Same is true for filtering in this step.
On Thursday 22 March 2012 11:48:40 James Ford wrote: > Hello, > > I am having problems with the Generator step of my crawls. It takes a lot > of time compared to indexing and fetching? Right now the generator step is > taking about 50min compared to fetching, parsing and indexing that only > takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up > the time: > > 2012-03-22 11:13:28,277 INFO regex.RegexURLNormalizer - can't find rules > for scope 'partition', using default > 2012-03-22 11:16:00,734 INFO crawl.FetchScheduleFactory - Using > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > 2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule - > defaultInterval=2592000 > 2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule - > maxInterval=7776000 > > Crawldb dump: > > 20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - Statistics for > CrawlDb: crawldb/ > 20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - TOTAL urls: 7819485 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 0: 7811052 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 1: 2994 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 2: 1214 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 3: 1125 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 4: 1124 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 5: 1303 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 6: 673 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - min score: 0.0 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - avg score: > 0.0015287232 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - max score: 2.0 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 1 > (db_unfetched): 6946135 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 2 > (db_fetched): 795070 > 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 3 (db_gone): > 34358 > 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 4 > (db_redir_temp): 21861 > 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 5 > (db_redir_perm): 22044 > 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 6 > (db_notmodified): 17 > 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - CrawlDb statistics: > done > > Does anyone have a clue how to fix this? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106 > .html Sent from the Nutch - User mailing list archive at Nabble.com. -- Markus Jelsma - CTO - Openindex