Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Julien Nioche Thu, 01 Aug 2013 06:42:28 -0700

What about  Googling / reading the WIKI / doing a bit of research yourself
 before asking questions on the mailing list?


On 1 August 2013 14:34, A Laxmi <[email protected]> wrote:

> Hi Julien. Thanks for the suggestion! Could you please provide a reference
> link on how to use linkdb?
>
>
> On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche <
> [email protected]
> > wrote:
>
> > Why don't you create a linkdb then read it with the nutch readlinkdb
> > command?
> >
> >
> > On 1 August 2013 11:57, A Laxmi <[email protected]> wrote:
> >
> > > Is there any way to find an *inlink *of a crawled site?
> > >
> > >
> > > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]>
> wrote:
> > >
> > > > Thanks for your help, Ahme! I would be interested in more than a
> > > > timestamp. I would like to understand how a particular URL was
> crawled
> > -
> > > in
> > > > better terms, the sequence or how nutch landed up with a particular
> > link
> > > in
> > > > its crawldb.
> > > >
> > > > My problem is I found one site from the crawled list of URLS with a
> > > > horrible URL format something like '
> > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > >?"
> > > > - as you can see this link it has got some backslashes for some
> > reason. I
> > > > tried to reach that url starting from the landing page "
> > > > www.domainabc.com/level1/level2/" but I could not find that URL with
> > > such
> > > > a bad format. So, I want to know how did nutch reach that url? Is
> there
> > > > some link nutch crawled which has the url " '
> > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<
> > >
> >
> http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1
> > > >?"
> > > > somewhere? In what sequence did nutch did the crawling starting from
> a
> > > seed
> > > > url to crawl such a url? I hope I made it clear. Please let me know
> if
> > > you
> > > > have any questions. Any help is much appreciated.
> > > >
> > > >
> > > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
> > > [email protected]>wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> Does timestamp give what you need? There should be a timestamp
> > > indicating
> > > >> the time of the operation.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> ----- Orijinal Mesaj -----
> > > >> Kimden: "A Laxmi" <[email protected]>
> > > >> Kime: [email protected]
> > > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
> > > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
> > > >>
> > > >> Hello,
> > > >>
> > > >> For example, I have a single *seed *url say "
> http://nutch.apache.org/
> > "
> > > >> and
> > > >> I am crawling it for "n" times. At the end of the crawl, I have 1220
> > new
> > > >> urls generated/fetched/updated from a single seed url. While looking
> > at
> > > >> these 1220 new urls, I am interested to know how a particular site
> eg.
> > > >> "www.abc/xy.com" has been crawled. Better question would be - in
> what
> > > >> sequence did the crawler work its way to a particular url "www.abc/
> > > xy.com
> > > >> "?
> > > >>
> > > >> Thanks for your help!
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Reply via email to