Re: Nutch 1.6 - sequence in which crawler works its way to a URL

A Laxmi Thu, 01 Aug 2013 03:58:31 -0700

Is there any way to find an *inlink *of a crawled site?


On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> wrote:

> Thanks for your help, Ahme! I would be interested in more than a
> timestamp. I would like to understand how a particular URL was crawled - in
> better terms, the sequence or how nutch landed up with a particular link in
> its crawldb.
>
> My problem is I found one site from the crawled list of URLS with a
> horrible URL format something like '
> www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
> - as you can see this link it has got some backslashes for some reason. I
> tried to reach that url starting from the landing page "
> www.domainabc.com/level1/level2/" but I could not find that URL with such
> a bad format. So, I want to know how did nutch reach that url? Is there
> some link nutch crawled which has the url " '
> www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
> somewhere? In what sequence did nutch did the crawling starting from a seed
> url to crawl such a url? I hope I made it clear. Please let me know if you
> have any questions. Any help is much appreciated.
>
>
> On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ 
> <[email protected]>wrote:
>
>> Hello,
>>
>> Does timestamp give what you need? There should be a timestamp indicating
>> the time of the operation.
>>
>>
>>
>>
>> ----- Orijinal Mesaj -----
>> Kimden: "A Laxmi" <[email protected]>
>> Kime: [email protected]
>> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
>> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>>
>> Hello,
>>
>> For example, I have a single *seed *url say "http://nutch.apache.org/";
>> and
>> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
>> urls generated/fetched/updated from a single seed url. While looking at
>> these 1220 new urls, I am interested to know how a particular site eg.
>> "www.abc/xy.com" has been crawled. Better question would be - in what
>> sequence did the crawler work its way to a particular url "www.abc/xy.com
>> "?
>>
>> Thanks for your help!
>>
>
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Reply via email to