Re: How to get the html that i crawled

Dennis Kubes Thu, 30 Apr 2009 06:47:03 -0700

To do that you would need to write a MR job to grab the bytes from Nutchcontents and write it out to a file. But the question would be whyindividual files, you are going to take a performance hit if processingthem that way.


Dennis


sgirao wrote:

Hello Ray, thank you for answer me.

Yes what i want is the content of the html crawled.
I do this bin/nutch readseg -dump crawl/segments/xxxxx/ output -nofetch
-nogenerate -noparse -noparsedata -noparsetext and i see the output and
there is all informition
 of the pages i crawled. But what i need its for each url crawled retrieve
one html document, Example:

if i crawled url1 and url2 i would like retrieve the content in url1.html

and url2.html.

If you can help me i appreciated.

Thank you.



Raymond Balmès wrote:

Well you should look into segments/xxxxxxxx/content/part-xxxxx if I'm not
mistaken, but you don't get the HTML only the content and/or the
meta-data.
I understand nutch correctly.
Not sure why you want to read the HTML

-Ray-

2009/4/27 sgirao <[email protected]>

Hello, i'm new at this, i'm using the nutch version 1.0 , and i want to
retrieved the html that i crawl.
I use the wiki http://wiki.apache.org/nutch/ to understand how works the
nucth.
I know the things that was crawled are in the folder segments, but i was
searching how to get the html and i don't find nothing!
If anyone can help me i appreciated.

P.S. -  Forgive my English.

--
View this message in context:
http://www.nabble.com/How-to-get-the-html-that-i-crawled-tp23254318p23254318.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: How to get the html that i crawled

Reply via email to