Does anyone have a simple way to turn the nutch data into sequence files? I'd ideally like to preserve the urls for use with named vectors later in the pipeline. It seems a simple tool to write but maybe it's already there somewhere?
I'd like to use Nutch to gather data to process with Mahout. Nutch
creates parsed text for the pages it crawls. Nutch also has several cl
tools to turn the data into a text file (readseg for instance). The
tools I've found either create one big text file with markers in it for
records or allow you to get one record from the big text file. Mahout
expects a sequence file or a directory full of text files and includes
at least one special purpose reader for wikipedia dump files.
- Recommended way to consume Nutch data in Mahout Pat Ferrel
- Re: Recommended way to consume Nutch data in Mahout Peyman Mohajerian
- Re: Recommended way to consume Nutch data in Ma... Pat Ferrel
- Re: Recommended way to consume Nutch data i... Ted Dunning
- Web Crawler to Mahout Pat Ferrel
- Re: Recommended way to consume Nutch data in Mahout Suneel Marthi
- Re: Recommended way to consume Nutch data in Ma... Ken Krugler
- removing boilerplate from crawled pages Pat Ferrel
