Re: How to get the html that i crawled

fadzi Tue, 28 Apr 2009 00:40:01 -0700

Why would you want to do that?

------Original Message------
From: sgirao
To: [email protected]
ReplyTo: [email protected]
Subject: Re: How to get the html that i crawled
Sent: Apr 28, 2009 17:36



Hello Ray, thank you for answer me.

Yes what i want is the content of the html crawled.
I do this bin/nutch readseg -dump crawl/segments/xxxxx/ output -nofetch
-nogenerate -noparse -noparsedata -noparsetext and i see the output and
there is all informition
 of the pages i crawled. But what i need its for each url crawled retrieve
one html document, Example:
    
   if i crawled url1 and url2 i would like retrieve the content in url1.html
and url2.html.

If you can help me i appreciated.

Thank you.



Raymond Balmès wrote:
> 
> Well you should look into segments/xxxxxxxx/content/part-xxxxx if I'm not
> mistaken, but you don't get the HTML only the content and/or the
> meta-data.
> I understand nutch correctly.
> Not sure why you want to read the HTML
> 
> -Ray-
> 
> 2009/4/27 sgirao <[email protected]>
> 
>>
>> Hello, i'm new at this, i'm using the nutch version 1.0 , and i want to
>> retrieved the html that i crawl.
>> I use the wiki http://wiki.apache.org/nutch/ to understand how works the
>> nucth.
>> I know the things that was crawled are in the folder segments, but i was
>> searching how to get the html and i don't find nothing!
>> If anyone can help me i appreciated.
>>
>> P.S. -  Forgive my English.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-get-the-html-that-i-crawled-tp23254318p23254318.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-get-the-html-content-that-i-crawled-tp23254318p23271758.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Sent via BlackBerry® from Vodafone

Re: How to get the html that i crawled

Reply via email to