Recommended way to consume Nutch data in Mahout

Pat Ferrel Fri, 13 Apr 2012 07:03:57 -0700

I'd like to use Nutch to gather data to process with Mahout. Nutchcreates parsed text for the pages it crawls. Nutch also has several cltools to turn the data into a text file (readseg for instance). Thetools I've found either create one big text file with markers in it forrecords or allow you to get one record from the big text file. Mahoutexpects a sequence file or a directory full of text files and includesat least one special purpose reader for wikipedia dump files.

Does anyone have a simple way to turn the nutch data into sequencefiles? I'd ideally like to preserve the urls for use with named vectorslater in the pipeline. It seems a simple tool to write but maybe it'salready there somewhere?

Recommended way to consume Nutch data in Mahout

Reply via email to