Nutch has scalability limits that other crawlers are able to avoid so it isn't quite as fashionable lately.
Ken Krugler's work with common crawl and Bixo is a bit more current. On Fri, Apr 13, 2012 at 6:36 PM, Pat Ferrel <[email protected]> wrote: > Thanks I'll check that out. > > Actually it was pretty easy to write a custom > SequenceFilesFromDirectoryFilt**er. I'm just a little surprised no one is > using crawled data from Nutch already. > > On 4/13/12 4:22 PM, Peyman Mohajerian wrote: > >> One solution is to use Solr, which integrates nicely with Nutch. Read data >> off Solr using SolrReader API. >> >> On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel<[email protected]> >> wrote: >> >> I'd like to use Nutch to gather data to process with Mahout. Nutch >>> creates >>> parsed text for the pages it crawls. Nutch also has several cl tools to >>> turn the data into a text file (readseg for instance). The tools I've >>> found >>> either create one big text file with markers in it for records or allow >>> you >>> to get one record from the big text file. Mahout expects a sequence file >>> or >>> a directory full of text files and includes at least one special purpose >>> reader for wikipedia dump files. >>> >>> Does anyone have a simple way to turn the nutch data into sequence files? >>> I'd ideally like to preserve the urls for use with named vectors later in >>> the pipeline. It seems a simple tool to write but maybe it's already >>> there >>> somewhere? >>> >>> >>>
