Thanks Talat! I am using Nutch 1.6. Does Hive good for Nutch 1.6?
On Thu, Aug 1, 2013 at 8:48 AM, Talat UYARER <[email protected]>wrote: > Hi, > > I had same problem. I solved with Hive. I mapped hbase table to hive. > After than i write little query. If you use Hive i can help you. But your > problem is url-validation plugin problem. you should add in your > nutch-site.xml. Doest come by the default. > > > > 01-08-2013 13:57 tarihinde, A Laxmi yazdı: > >> Is there any way to find an *inlink *of a crawled site? >> >> >> >> On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> wrote: >> >> Thanks for your help, Ahme! I would be interested in more than a >>> timestamp. I would like to understand how a particular URL was crawled - >>> in >>> better terms, the sequence or how nutch landed up with a particular link >>> in >>> its crawldb. >>> >>> My problem is I found one site from the crawled list of URLS with a >>> horrible URL format something like ' >>> www.domainabc.com/level1/**level2/level3\\\\\\\\\\\\/**level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1> >>> <http://www.**domainabc.com/level1/level2/**level3%5C%5C%5C%5C%5C%5C%5C% >>> **5C%5C%5C%5C%5C/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1> >>> **>?" >>> >>> - as you can see this link it has got some backslashes for some reason. I >>> tried to reach that url starting from the landing page " >>> www.domainabc.com/level1/**level2/<http://www.domainabc.com/level1/level2/>" >>> but I could not find that URL with such >>> a bad format. So, I want to know how did nutch reach that url? Is there >>> some link nutch crawled which has the url " ' >>> www.domainabc.com/level1/**level2/level3\\\\\\\\\\\\/**level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1> >>> <http://www.**domainabc.com/level1/level2/**level3%5C%5C%5C%5C%5C%5C%5C% >>> **5C%5C%5C%5C%5C/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1> >>> **>?" >>> >>> somewhere? In what sequence did nutch did the crawling starting from a >>> seed >>> url to crawl such a url? I hope I made it clear. Please let me know if >>> you >>> have any questions. Any help is much appreciated. >>> >>> >>> On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ < >>> [email protected]>wrote: >>> >>> Hello, >>>> >>>> Does timestamp give what you need? There should be a timestamp >>>> indicating >>>> the time of the operation. >>>> >>>> >>>> >>>> >>>> ----- Orijinal Mesaj ----- >>>> Kimden: "A Laxmi" <[email protected]> >>>> Kime: [email protected] >>>> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45 >>>> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL >>>> >>>> Hello, >>>> >>>> For example, I have a single *seed *url say "http://nutch.apache.org/" >>>> and >>>> I am crawling it for "n" times. At the end of the crawl, I have 1220 new >>>> urls generated/fetched/updated from a single seed url. While looking at >>>> these 1220 new urls, I am interested to know how a particular site eg. >>>> "www.abc/xy.com" has been crawled. Better question would be - in what >>>> sequence did the crawler work its way to a particular url "www.abc/ >>>> xy.com >>>> "? >>>> >>>> Thanks for your help! >>>> >>>> >>> >

