Re: Get original URL from crawldb in case of redirect

Amit Sela Sun, 17 Nov 2013 01:37:40 -0800

I can answer this one myself. it would.
Thanks.


On Sat, Nov 16, 2013 at 4:52 PM, Amit Sela <[email protected]> wrote:

> Would _pst_ exist in metadata even if I'm crawling with:
> db.update.additions.allowed=false
>
> (I have a use case where I don't really crawl, but actually just fetch,
> and sometimes the list is too long for one execution so I have to
> re-execute on the same crawlDB but I don't want to crawl outside the seed
> list).
>
> Thanks.
>
>
> On Fri, Nov 15, 2013 at 10:05 PM, Sebastian Nagel <
> [email protected]> wrote:
>
>> Hi Amit,
>>
>> here the answer for Nutch 1.7
>> (or are you using 2.x?):
>>
>> Every URL is stored in CrawlDb even with
>>   http.redirect.max = 10
>>
>> For redirects, the target URL is stored in CrawlDatum's
>> metadata under key _pst_ (protocol status):
>>
>> http://issues.apache.org/jira/browse/NUTCH      Version: 7
>> Status: 4 (db_redir_temp)
>> Fetch time: Sun Dec 15 20:38:53 CET 2013
>> Modified time: Fri Nov 15 20:38:53 CET 2013
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 0.00941915
>> Signature: null
>> Metadata:
>>         Content-Type=text/html
>>         _maxdepth_=1000
>>         _pst_=temp_moved(13), lastModified=0:
>> https://issues.apache.org/jira/browse/NUTCH
>>         _depth_=2
>>
>> Sebastian
>>
>> On 11/14/2013 12:56 PM, Amit Sela wrote:
>> > Hi all,
>> >
>> > I'm readin the crawldb as CrawledPage and I see the fetched URL, content
>> > etc.
>> > In case of a redirection (I allow 10 redirections in nutch-site.xml) the
>> > fetched URL is not the original URL the Fetcher turned to, and I would
>> like
>> > to get that as well.
>> >
>> > Does nutch store it somewhere, I'm basically looking for mapping between
>> > URLs attempted to fetch and actually fetched.
>> >
>> > Thanks,
>> >
>> > Amit.
>> >
>>
>>
>

Re: Get original URL from crawldb in case of redirect

Reply via email to