Hello, I am having problems with the Generator step of my crawls. It takes a lot of time compared to indexing and fetching? Right now the generator step is taking about 50min compared to fetching, parsing and indexing that only takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up the time:
2012-03-22 11:13:28,277 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2012-03-22 11:16:00,734 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 Crawldb dump: 20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - Statistics for CrawlDb: crawldb/ 20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - TOTAL urls: 7819485 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 0: 7811052 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 1: 2994 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 2: 1214 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 3: 1125 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 4: 1124 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 5: 1303 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 6: 673 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - min score: 0.0 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - avg score: 0.0015287232 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - max score: 2.0 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 1 (db_unfetched): 6946135 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 2 (db_fetched): 795070 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 3 (db_gone): 34358 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 4 (db_redir_temp): 21861 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 5 (db_redir_perm): 22044 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 6 (db_notmodified): 17 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - CrawlDb statistics: done Does anyone have a clue how to fix this? -- View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106.html Sent from the Nutch - User mailing list archive at Nabble.com.