I'd like to use Nutch to gather data to process with Mahout. Nutch creates parsed text for the pages it crawls. Nutch also has several cl tools to turn the data into a text file (readseg for instance). The tools I've found either create one big text file with markers in it for records or allow you to get one record from the big text file. Mahout expects a sequence file or a directory full of text files and includes at least one special purpose reader for wikipedia dump files.

Does anyone have a simple way to turn the nutch data into sequence files? I'd ideally like to preserve the urls for use with named vectors later in the pipeline. It seems a simple tool to write but maybe it's already there somewhere?

Reply via email to