Hello,

I am having problems with the Generator step of my crawls. It takes a lot of
time compared to indexing and fetching? Right now the generator step is
taking about 50min compared to fetching, parsing and indexing that only
takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
the time:

2012-03-22 11:13:28,277 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2012-03-22 11:16:00,734 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000

Crawldb dump:

20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - Statistics for
CrawlDb: crawldb/                                                               
    
20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - TOTAL urls: 7819485      
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 0:    7811052      
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 1:    2994         
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 2:    1214         
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 3:    1125         
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 4:    1124         
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 5:    1303         
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 6:    673          
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - min score:  0.0          
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - avg score: 
0.0015287232                                                                    
       
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - max score:  2.0          
                                                                          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 1
(db_unfetched):    6946135                                                      
          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 2
(db_fetched):      795070                                                       
          
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 3 (db_gone):
34358                                                                          
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 4
(db_redir_temp):   21861                                                        
          
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 5
(db_redir_perm):   22044                                                        
          
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 6
(db_notmodified):  17                                                           
          
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - CrawlDb statistics:
done   

Does anyone have a clue how to fix this?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to