You may want to look at Tika's HtmlParser to strip out all the HTML tags and 
return only the raw text content from the crawled pages.  This could then be 
written out to sequence files with the URL as key and the raw text as values.



________________________________
 From: Pat Ferrel <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Thursday, April 12, 2012 7:06 PM
Subject: Recommended way to consume Nutch data in Mahout
 
I'd like to use Nutch to gather data to process with Mahout. Nutch creates 
parsed text for the pages it crawls. Nutch also has several cl tools to turn 
the data into a text file (readseg for instance). The tools I've found either 
create one big text file with markers in it for records or allow you to get one 
record from the big text file. Mahout expects a sequence file or a directory 
full of text files and includes at least one special purpose reader for 
wikipedia dump files.

Does anyone have a simple way to turn the nutch data into sequence files? I'd 
ideally like to preserve the urls for use with named vectors later in the 
pipeline. It seems a simple tool to write but maybe it's already there 
somewhere?

Reply via email to