Re: Nutch 1.6 - sequence in which crawler works its way to a URL

A Laxmi Thu, 01 Aug 2013 06:02:26 -0700

Thanks Talat! I am using Nutch 1.6. Does Hive good for Nutch 1.6?


On Thu, Aug 1, 2013 at 8:48 AM, Talat UYARER <[email protected]>wrote:

> Hi,
>
> I had same problem. I solved with Hive. I  mapped hbase table to hive.
> After than i write little query. If you use Hive i can help you. But your
> problem is url-validation plugin problem. you should add in your
> nutch-site.xml. Doest come by the default.
>
>
>
> 01-08-2013 13:57 tarihinde, A Laxmi yazdı:
>
>> Is there any way to find an *inlink *of a crawled site?
>>
>>
>>
>> On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> wrote:
>>
>>  Thanks for your help, Ahme! I would be interested in more than a
>>> timestamp. I would like to understand how a particular URL was crawled -
>>> in
>>> better terms, the sequence or how nutch landed up with a particular link
>>> in
>>> its crawldb.
>>>
>>> My problem is I found one site from the crawled list of URLS with a
>>> horrible URL format something like '
>>> www.domainabc.com/level1/**level2/level3\\\\\\\\\\\\/**level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> <http://www.**domainabc.com/level1/level2/**level3%5C%5C%5C%5C%5C%5C%5C%
>>> **5C%5C%5C%5C%5C/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> **>?"
>>>
>>> - as you can see this link it has got some backslashes for some reason. I
>>> tried to reach that url starting from the landing page "
>>> www.domainabc.com/level1/**level2/<http://www.domainabc.com/level1/level2/>"
>>> but I could not find that URL with such
>>> a bad format. So, I want to know how did nutch reach that url? Is there
>>> some link nutch crawled which has the url " '
>>> www.domainabc.com/level1/**level2/level3\\\\\\\\\\\\/**level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> <http://www.**domainabc.com/level1/level2/**level3%5C%5C%5C%5C%5C%5C%5C%
>>> **5C%5C%5C%5C%5C/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>
>>> **>?"
>>>
>>> somewhere? In what sequence did nutch did the crawling starting from a
>>> seed
>>> url to crawl such a url? I hope I made it clear. Please let me know if
>>> you
>>> have any questions. Any help is much appreciated.
>>>
>>>
>>> On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <
>>> [email protected]>wrote:
>>>
>>>  Hello,
>>>>
>>>> Does timestamp give what you need? There should be a timestamp
>>>> indicating
>>>> the time of the operation.
>>>>
>>>>
>>>>
>>>>
>>>> ----- Orijinal Mesaj -----
>>>> Kimden: "A Laxmi" <[email protected]>
>>>> Kime: [email protected]
>>>> Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
>>>> Konu: Nutch 1.6 - sequence in which crawler works its way to a URL
>>>>
>>>> Hello,
>>>>
>>>> For example, I have a single *seed *url say "http://nutch.apache.org/";
>>>> and
>>>> I am crawling it for "n" times. At the end of the crawl, I have 1220 new
>>>> urls generated/fetched/updated from a single seed url. While looking at
>>>> these 1220 new urls, I am interested to know how a particular site eg.
>>>> "www.abc/xy.com" has been crawled. Better question would be - in what
>>>> sequence did the crawler work its way to a particular url "www.abc/
>>>> xy.com
>>>> "?
>>>>
>>>> Thanks for your help!
>>>>
>>>>
>>>
>

Re: Nutch 1.6 - sequence in which crawler works its way to a URL

Reply via email to