RE: shouldFetch rejected

Markus Jelsma Mon, 17 Dec 2012 04:34:23 -0800
Hi - curTime does not exceed fetchTime, thus the record is not eligible for 
fetch.
 
 
-----Original message-----
> From:Jan Philippe Wimmer <i...@jepse.net>
> Sent: Mon 17-Dec-2012 13:31
> To: user@nutch.apache.org
> Subject: Re: shouldFetch rejected
> 
> Hi again.
> 
> i still have that issue. I start with a complete new crawl directory 
> structure and get the following error:
> 
> -shouldFetch rejected 'http://www.lequipe.fr/Football/', 
> fetchTime=1359626286623, curTime=1355738313780
> 
> Full-Log:
> crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
> rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
> threads = 20
> depth = 3
> solrUrl=http://192.168.1.144:8983/solr/
> topN = 400
> Injector: starting at 2012-12-17 10:57:36
> Injector: crawlDb: 
> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
> Generator: starting at 2012-12-17 10:57:51
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 400
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: 
> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in 
> 'http.robots.agents' property.
> Fetcher: starting at 2012-12-17 10:58:06
> Fetcher: segment: 
> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> Using queue mode : byHost
> Fetcher: threads: 20
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> fetching http://www.lequipe.fr/Football/
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold: -1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
> ParseSegment: starting at 2012-12-17 10:58:13
> ParseSegment: segment: 
> /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
> CrawlDb update: starting at 2012-12-17 10:58:20
> CrawlDb update: db: 
> /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
> CrawlDb update: segments: 
> [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
> Generator: starting at 2012-12-17 10:58:33
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 400
> Generator: jobtracker is 'local', generating exactly one partition.
> -shouldFetch rejected 'http://www.lequipe.fr/Football/', 
> fetchTime=1359626286623, curTime=1355738313780
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting at 2012-12-17 10:58:40
> LinkDb: linkdb: /opt/project/current/crawl_project/nutch/crawl/1300/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: internal links will be ignored.
> LinkDb: adding segment: 
> file:/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
> LinkDb: finished at 2012-12-17 10:58:47, elapsed: 00:00:07
> SolrIndexer: starting at 2012-12-17 10:58:47
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false
> SolrIndexer: finished at 2012-12-17 10:59:09, elapsed: 00:00:22
> SolrDeleteDuplicates: starting at 2012-12-17 10:59:09
> SolrDeleteDuplicates: Solr url: http://192.168.1.144:8983/solr/
> SolrDeleteDuplicates: finished at 2012-12-17 10:59:47, elapsed: 00:00:37
> 
> Am 25.11.2012 21:02, schrieb Sebastian Nagel:
> >> But, i create a complete new crawl dir for every crawl.
> > Then all should work as expected.
> >
> >> why the the cralwer set a "page to fetch" to rejected. Because obviously
> >> the crawler never saw this page before (because i deleted all the old 
> >> crawl dirs).
> >> In the crawl log i see many page to fetch, but at the end all of them are 
> >> rejected
> > Are you sure they aren't fetched at all? This debug log output in Generator 
> > mapper
> > is shown also for URLs fetched in previous cycles. You should check the 
> > complete
> > log for the "rejected" URLs.
> >
> >
> > On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
> >> Hey Sebastian! Thanks for your answer.
> >>
> >> But, i create a complete new crawl dir for every crawl. In other words i 
> >> just have the crawl data of
> >> the current, running crawl-process. When i recrawl a urlset, i delete the 
> >> old crawl dir and create a
> >> new one. At the end of any crawl i index it to solr. So i keep all crawled 
> >> content in the index. I
> >> don't need any nutch crawl dirs, because i want to crawl all relevant 
> >> pages in every crawl process.
> >> again and again.
> >>
> >> I totaly don't understand, why the the cralwer set a "page to fetch" to 
> >> rejected. Because obviously
> >> the crawler never saw this page before (because i deleted all the old 
> >> crawl dirs). In the crawl log
> >> i see many page to fetch, but at the end all of them are rejected. Any 
> >> ideas?
> >>
> >> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
> >>>> I want my crawler to crawl the complete page without setting up 
> >>>> schedulers at all. Every crawl
> >>>> process should crawl every page again without having setup wait 
> >>>> intervals.
> >>> That's quite easy: remove all data and launch the crawl again.
> >>> - Nutch 1.x : remove crawldb, segments, and linkdb
> >>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
> >>>
> >>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
> >>>> Hi there,
> >>>>
> >>>> how can i avoid the following error:
> >>>> -shouldFetch rejected 'http://www.page.com/shop', 
> >>>> fetchTime=1356347311285, curTime=1353755337755
> >>>>
> >>>> I want my crawler to crawl the complete page without setting up 
> >>>> schedulers at all. Every crawl
> >>>> process should crawl every page again without having setup wait 
> >>>> intervals.
> >>>>
> >>>> Any soluti
> 
>
RE: shouldFetch rejected

Reply via email to