Re: Nutch 1.6 - sequence in which crawler works its way to a URL

A Laxmi Thu, 01 Aug 2013 06:48:27 -0700

Julien - Sure. Thanks for your help! Your response atleast gave me a
direction - linkdb.



On Thu, Aug 1, 2013 at 9:40 AM, Julien Nioche <[email protected]
> wrote:

> What about  Googling / reading the WIKI / doing a bit of research yourself
>  before asking questions on the mailing list?
>
> On 1 August 2013 14:34, A Laxmi <[email protected]> wrote:
>
> > Hi Julien. Thanks for the suggestion! Could you please provide a
> reference
> > link on how to use linkdb?
> >
> >
> > On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche <
> > [email protected]
> > > wrote:
> >
> > > Why don't you create a linkdb then read it with the nutch readlinkdb
> > > command?
> > >
> > >
> > > On 1 August 2013 11:57, A Laxmi <[email protected]> wrote:
> > >
> > > > Is there any way to find an *inlink *of a crawled site?
> > > >
> > > >
> > > > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]>
> > wrote:
> > > >
> > > > > Thanks for your help, Ahme! I would be interested in more than a
> > > > > timestamp. I would like to understand how a particular URL was
> > crawled
> > > -
> > > > in
> > > > > better terms, the sequence or how nutch landed up with a particular
> > > link
> > > > in
> > > > > its crawldb.
> > > > >
> > > > > My problem is I found one site from the crawled list of URLS with a
> > > > > horrible URL format something like '
> > > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1
> <
> > > >
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > > >?"
> > > > > - as you can see this link it has got some backslashes for some
> > > reason. I
> > > > > tried to reach that url starting from the landing page "
> > > > > www.domainabc.com/level1/level2/" but I could not find that URL
> with
> > > > such
> > > > > a bad format. So, I want to know how did nutch reach that url? Is
> > there
> > > > > some link nutch crawled which has the url " '
> > > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1
> <
> > > >
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > > >?"
> > > > > somewhere? In what sequence did nutch did the crawling starting
> from
> > a
> > > > seed
> > > > > url to crawl such a url? I hope I made it clear. Please let me know
> > if
> > > > you
> > > > > have any questions. Any help is much appreciated.
> > > > >
> > > > >
> > > > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> > > > [email protected]>wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> Does timestamp give what you need? There should be a timestamp
> > > > indicating
> > > > >> the time of the operation.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> ----- Orijinal Mesaj -----
> > > > >> Kimden: "A Laxmi" <[email protected]>
> > > > >> Kime: [email protected]
> > > > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> > > > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> > > > >>
> > > > >> Hello,
> > > > >>
> > > > >> For example, I have a single *seed *url say "
> > http://nutch.apache.org/
> > > "
> > > > >> and
> > > > >> I am crawling it for "n" times. At the end of the crawl, I have
> 1220
> > > new
> > > > >> urls generated/fetched/updated from a single seed url. While
> looking
> > > at
> > > > >> these 1220 new urls, I am interested to know how a particular site
> > eg.
> > > > >> "www.abc/xy.com" has been crawled. Better question would be - in
> > what
> > > > >> sequence did the crawler work its way to a particular url
> "www.abc/
> > > > xy.com
> > > > >> "?
> > > > >>
> > > > >> Thanks for your help!
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Reply via email to