One solution is to use Solr, which integrates nicely with Nutch. Read data off Solr using SolrReader API.
On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel <[email protected]> wrote: > I'd like to use Nutch to gather data to process with Mahout. Nutch creates > parsed text for the pages it crawls. Nutch also has several cl tools to > turn the data into a text file (readseg for instance). The tools I've found > either create one big text file with markers in it for records or allow you > to get one record from the big text file. Mahout expects a sequence file or > a directory full of text files and includes at least one special purpose > reader for wikipedia dump files. > > Does anyone have a simple way to turn the nutch data into sequence files? > I'd ideally like to preserve the urls for use with named vectors later in > the pipeline. It seems a simple tool to write but maybe it's already there > somewhere? > >
