Re: Nutch 1.6 - sequence in which crawler works its way to a URL

A Laxmi Thu, 01 Aug 2013 06:36:26 -0700

Hi Julien. Thanks for the suggestion! Could you please provide a reference
link on how to use linkdb?



On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche <[email protected]
> wrote:

> Why don't you create a linkdb then read it with the nutch readlinkdb
> command?
>
>
> On 1 August 2013 11:57, A Laxmi <[email protected]> wrote:
>
> > Is there any way to find an *inlink *of a crawled site?
> >
> >
> > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> wrote:
> >
> > > Thanks for your help, Ahme! I would be interested in more than a
> > > timestamp. I would like to understand how a particular URL was crawled
> -
> > in
> > > better terms, the sequence or how nutch landed up with a particular
> link
> > in
> > > its crawldb.
> > >
> > > My problem is I found one site from the crawled list of URLS with a
> > > horrible URL format something like '
> > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > >?"
> > > - as you can see this link it has got some backslashes for some
> reason. I
> > > tried to reach that url starting from the landing page "
> > > www.domainabc.com/level1/level2/" but I could not find that URL with
> > such
> > > a bad format. So, I want to know how did nutch reach that url? Is there
> > > some link nutch crawled which has the url " '
> > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > >?"
> > > somewhere? In what sequence did nutch did the crawling starting from a
> > seed
> > > url to crawl such a url? I hope I made it clear. Please let me know if
> > you
> > > have any questions. Any help is much appreciated.
> > >
> > >
> > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> > [email protected]>wrote:
> > >
> > >> Hello,
> > >>
> > >> Does timestamp give what you need? There should be a timestamp
> > indicating
> > >> the time of the operation.
> > >>
> > >>
> > >>
> > >>
> > >> ----- Orijinal Mesaj -----
> > >> Kimden: "A Laxmi" <[email protected]>
> > >> Kime: [email protected]
> > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> > >>
> > >> Hello,
> > >>
> > >> For example, I have a single *seed *url say "http://nutch.apache.org/
> "
> > >> and
> > >> I am crawling it for "n" times. At the end of the crawl, I have 1220
> new
> > >> urls generated/fetched/updated from a single seed url. While looking
> at
> > >> these 1220 new urls, I am interested to know how a particular site eg.
> > >> "www.abc/xy.com" has been crawled. Better question would be - in what
> > >> sequence did the crawler work its way to a particular url "www.abc/
> > xy.com
> > >> "?
> > >>
> > >> Thanks for your help!
> > >>
> > >
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Reply via email to