Julien - Sure. Thanks for your help! Your response atleast gave me a direction - linkdb.
On Thu, Aug 1, 2013 at 9:40 AM, Julien Nioche <[email protected] > wrote: > What about Googling / reading the WIKI / doing a bit of research yourself > before asking questions on the mailing list? > > On 1 August 2013 14:34, A Laxmi <[email protected]> wrote: > > > Hi Julien. Thanks for the suggestion! Could you please provide a > reference > > link on how to use linkdb? > > > > > > On Thu, Aug 1, 2013 at 9:28 AM, Julien Nioche < > > [email protected] > > > wrote: > > > > > Why don't you create a linkdb then read it with the nutch readlinkdb > > > command? > > > > > > > > > On 1 August 2013 11:57, A Laxmi <[email protected]> wrote: > > > > > > > Is there any way to find an *inlink *of a crawled site? > > > > > > > > > > > > On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> > > wrote: > > > > > > > > > Thanks for your help, Ahme! I would be interested in more than a > > > > > timestamp. I would like to understand how a particular URL was > > crawled > > > - > > > > in > > > > > better terms, the sequence or how nutch landed up with a particular > > > link > > > > in > > > > > its crawldb. > > > > > > > > > > My problem is I found one site from the crawled list of URLS with a > > > > > horrible URL format something like ' > > > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1 > < > > > > > > > > > > http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1 > > > > >?" > > > > > - as you can see this link it has got some backslashes for some > > > reason. I > > > > > tried to reach that url starting from the landing page " > > > > > www.domainabc.com/level1/level2/" but I could not find that URL > with > > > > such > > > > > a bad format. So, I want to know how did nutch reach that url? Is > > there > > > > > some link nutch crawled which has the url " ' > > > > > www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1 > < > > > > > > > > > > http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1 > > > > >?" > > > > > somewhere? In what sequence did nutch did the crawling starting > from > > a > > > > seed > > > > > url to crawl such a url? I hope I made it clear. Please let me know > > if > > > > you > > > > > have any questions. Any help is much appreciated. > > > > > > > > > > > > > > > On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ < > > > > [email protected]>wrote: > > > > > > > > > >> Hello, > > > > >> > > > > >> Does timestamp give what you need? There should be a timestamp > > > > indicating > > > > >> the time of the operation. > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> ----- Orijinal Mesaj ----- > > > > >> Kimden: "A Laxmi" <[email protected]> > > > > >> Kime: [email protected] > > > > >> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45 > > > > >> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL > > > > >> > > > > >> Hello, > > > > >> > > > > >> For example, I have a single *seed *url say " > > http://nutch.apache.org/ > > > " > > > > >> and > > > > >> I am crawling it for "n" times. At the end of the crawl, I have > 1220 > > > new > > > > >> urls generated/fetched/updated from a single seed url. While > looking > > > at > > > > >> these 1220 new urls, I am interested to know how a particular site > > eg. > > > > >> "www.abc/xy.com" has been crawled. Better question would be - in > > what > > > > >> sequence did the crawler work its way to a particular url > "www.abc/ > > > > xy.com > > > > >> "? > > > > >> > > > > >> Thanks for your help! > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > http://twitter.com/digitalpebble > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

