Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Julien Nioche Thu, 01 Aug 2013 06:29:42 -0700

Why don't you create a linkdb then read it with the nutch readlinkdb
command?



On 1 August 2013 11:57, A Laxmi <[email protected]> wrote:

> Is there any way to find an *inlink *of a crawled site?
>
>
> On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> wrote:
>
> > Thanks for your help, Ahme! I would be interested in more than a
> > timestamp. I would like to understand how a particular URL was crawled -
> in
> > better terms, the sequence or how nutch landed up with a particular link
> in
> > its crawldb.
> >
> > My problem is I found one site from the crawled list of URLS with a
> > horrible URL format something like '
> > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> >?"
> > - as you can see this link it has got some backslashes for some reason. I
> > tried to reach that url starting from the landing page "
> > www.domainabc.com/level1/level2/" but I could not find that URL with
> such
> > a bad format. So, I want to know how did nutch reach that url? Is there
> > some link nutch crawled which has the url " '
> > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> >?"
> > somewhere? In what sequence did nutch did the crawling starting from a
> seed
> > url to crawl such a url? I hope I made it clear. Please let me know if
> you
> > have any questions. Any help is much appreciated.
> >
> >
> > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> [email protected]>wrote:
> >
> >> Hello,
> >>
> >> Does timestamp give what you need? There should be a timestamp
> indicating
> >> the time of the operation.
> >>
> >>
> >>
> >>
> >> ----- Orijinal Mesaj -----
> >> Kimden: "A Laxmi" <[email protected]>
> >> Kime: [email protected]
> >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> >>
> >> Hello,
> >>
> >> For example, I have a single *seed *url say "http://nutch.apache.org/";
> >> and
> >> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
> >> urls generated/fetched/updated from a single seed url. While looking at
> >> these 1220 new urls, I am interested to know how a particular site eg.
> >> "www.abc/xy.com" has been crawled. Better question would be - in what
> >> sequence did the crawler work its way to a particular url "www.abc/
> xy.com
> >> "?
> >>
> >> Thanks for your help!
> >>
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Reply via email to