The CrawlDB contains information on all URL's and their status e.g. what HTTP code did they get, the interval, some metadata and their fetch time. Use the readdb command to inspect a specific URL.
-----Original message----- > From:kamaci <furkankam...@gmail.com> > Sent: Wed 20-Mar-2013 23:52 > To: user@nutch.apache.org > Subject: Re: Does Nutch Checks Whether A Page crawled before or not > > Where does Nutch stores that information? > > 2013/3/21 Markus Jelsma-2 [via Lucene] < > ml-node+s472066n4049568...@n3.nabble.com> > > > Nutch selects records that are eligible for fetch. It's either due to a > > transient failure or if the fetch interval has been expired. This means > > that failed fetches due to network issues are refetched within 24 hours. > > Successfully fetched pages are only refetched if the current time exceeds > > the previously fetchTime + interval. > > > > > > > > -----Original message----- > > > > > From:kamaci <[hidden > > > email]<http://user/SendEmail.jtp?type=node&node=4049568&i=0>> > > > > > Sent: Wed 20-Mar-2013 23:46 > > > To: [hidden email]<http://user/SendEmail.jtp?type=node&node=4049568&i=1> > > > Subject: Does Nutch Checks Whether A Page crawled before or not > > > > > > Lets assume that I am crawling wikipedia.org with depth 1 and topN 1. > > After > > > it finishes crawling if I rerun that command and after finishes again > > and > > > again. What happens? Does Nutch skips previous fetched pages or try to > > crawl > > > same pages again? > > > > > > > > > > > > -- > > > View this message in context: > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > ------------------------------ > > If you reply to this email, your message will be added to the discussion > > below: > > > > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049568.html > > To unsubscribe from Does Nutch Checks Whether A Page crawled before or > > not, click > > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4049564&code=ZnVya2Fua2FtYWNpQGdtYWlsLmNvbXw0MDQ5NTY0fDEyODM4MDc0Mg==> > > . > > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049569.html > Sent from the Nutch - User mailing list archive at Nabble.com.