I can answer this one myself. it would. Thanks.
On Sat, Nov 16, 2013 at 4:52 PM, Amit Sela <[email protected]> wrote: > Would _pst_ exist in metadata even if I'm crawling with: > db.update.additions.allowed=false > > (I have a use case where I don't really crawl, but actually just fetch, > and sometimes the list is too long for one execution so I have to > re-execute on the same crawlDB but I don't want to crawl outside the seed > list). > > Thanks. > > > On Fri, Nov 15, 2013 at 10:05 PM, Sebastian Nagel < > [email protected]> wrote: > >> Hi Amit, >> >> here the answer for Nutch 1.7 >> (or are you using 2.x?): >> >> Every URL is stored in CrawlDb even with >> http.redirect.max = 10 >> >> For redirects, the target URL is stored in CrawlDatum's >> metadata under key _pst_ (protocol status): >> >> http://issues.apache.org/jira/browse/NUTCH Version: 7 >> Status: 4 (db_redir_temp) >> Fetch time: Sun Dec 15 20:38:53 CET 2013 >> Modified time: Fri Nov 15 20:38:53 CET 2013 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 0.00941915 >> Signature: null >> Metadata: >> Content-Type=text/html >> _maxdepth_=1000 >> _pst_=temp_moved(13), lastModified=0: >> https://issues.apache.org/jira/browse/NUTCH >> _depth_=2 >> >> Sebastian >> >> On 11/14/2013 12:56 PM, Amit Sela wrote: >> > Hi all, >> > >> > I'm readin the crawldb as CrawledPage and I see the fetched URL, content >> > etc. >> > In case of a redirection (I allow 10 redirections in nutch-site.xml) the >> > fetched URL is not the original URL the Fetcher turned to, and I would >> like >> > to get that as well. >> > >> > Does nutch store it somewhere, I'm basically looking for mapping between >> > URLs attempted to fetch and actually fetched. >> > >> > Thanks, >> > >> > Amit. >> > >> >> >

