Thanks for your help, Ahme! I would be interested in more than a timestamp.
I would like to understand how a particular URL was crawled - in better
terms, the sequence or how nutch landed up with a particular link in its
crawldb.

My problem is I found one site from the crawled list of URLS with a
horrible URL format something like '
www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1?" - as
you can see this link it has got some backslashes for some reason. I tried
to reach that url starting from the landing page "
www.domainabc.com/level1/level2/" but I could not find that URL with such a
bad format. So, I want to know how did nutch reach that url? Is there some
link nutch crawled which has the url " '
www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1?"
somewhere? In what sequence did nutch did the crawling starting from a seed
url to crawl such a url? I hope I made it clear. Please let me know if you
have any questions. Any help is much appreciated.


On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <[email protected]>wrote:

> Hello,
>
> Does timestamp give what you need? There should be a timestamp indicating
> the time of the operation.
>
>
>
>
> ----- Orijinal Mesaj -----
> Kimden: "A Laxmi" <[email protected]>
> Kime: [email protected]
> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>
> Hello,
>
> For example, I have a single *seed *url say "http://nutch.apache.org/"; and
> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
> urls generated/fetched/updated from a single seed url. While looking at
> these 1220 new urls, I am interested to know how a particular site eg.
> "www.abc/xy.com" has been crawled. Better question would be - in what
> sequence did the crawler work its way to a particular url "www.abc/xy.com
> "?
>
> Thanks for your help!
>

Reply via email to