Why would you want to do that?
------Original Message------
From: sgirao
To: [email protected]
ReplyTo: [email protected]
Subject: Re: How to get the html that i crawled
Sent: Apr 28, 2009 17:36
Hello Ray, thank you for answer me.
Yes what i want is the content of the html crawled.
I do this bin/nutch readseg -dump crawl/segments/xxxxx/ output -nofetch
-nogenerate -noparse -noparsedata -noparsetext and i see the output and
there is all informition
of the pages i crawled. But what i need its for each url crawled retrieve
one html document, Example:
if i crawled url1 and url2 i would like retrieve the content in url1.html
and url2.html.
If you can help me i appreciated.
Thank you.
Raymond Balmès wrote:
>
> Well you should look into segments/xxxxxxxx/content/part-xxxxx if I'm not
> mistaken, but you don't get the HTML only the content and/or the
> meta-data.
> I understand nutch correctly.
> Not sure why you want to read the HTML
>
> -Ray-
>
> 2009/4/27 sgirao <[email protected]>
>
>>
>> Hello, i'm new at this, i'm using the nutch version 1.0 , and i want to
>> retrieved the html that i crawl.
>> I use the wiki http://wiki.apache.org/nutch/ to understand how works the
>> nucth.
>> I know the things that was crawled are in the folder segments, but i was
>> searching how to get the html and i don't find nothing!
>> If anyone can help me i appreciated.
>>
>> P.S. - Forgive my English.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-get-the-html-that-i-crawled-tp23254318p23254318.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/How-to-get-the-html-content-that-i-crawled-tp23254318p23271758.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Sent via BlackBerry® from Vodafone