One solution is to use Solr, which integrates nicely with Nutch. Read data
off Solr using SolrReader API.

On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel <[email protected]> wrote:

> I'd like to use Nutch to gather data to process with Mahout. Nutch creates
> parsed text for the pages it crawls. Nutch also has several cl tools to
> turn the data into a text file (readseg for instance). The tools I've found
> either create one big text file with markers in it for records or allow you
> to get one record from the big text file. Mahout expects a sequence file or
> a directory full of text files and includes at least one special purpose
> reader for wikipedia dump files.
>
> Does anyone have a simple way to turn the nutch data into sequence files?
> I'd ideally like to preserve the urls for use with named vectors later in
> the pipeline. It seems a simple tool to write but maybe it's already there
> somewhere?
>
>

Reply via email to