Nutch has scalability limits that other crawlers are able to avoid so it
isn't quite as fashionable lately.

Ken Krugler's work with common crawl and Bixo is a bit more current.

On Fri, Apr 13, 2012 at 6:36 PM, Pat Ferrel <[email protected]> wrote:

> Thanks I'll check that out.
>
> Actually it was pretty easy to write a custom
> SequenceFilesFromDirectoryFilt**er. I'm just a little surprised no one is
> using crawled data from Nutch already.
>
> On 4/13/12 4:22 PM, Peyman Mohajerian wrote:
>
>> One solution is to use Solr, which integrates nicely with Nutch. Read data
>> off Solr using SolrReader API.
>>
>> On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel<[email protected]>
>>  wrote:
>>
>>  I'd like to use Nutch to gather data to process with Mahout. Nutch
>>> creates
>>> parsed text for the pages it crawls. Nutch also has several cl tools to
>>> turn the data into a text file (readseg for instance). The tools I've
>>> found
>>> either create one big text file with markers in it for records or allow
>>> you
>>> to get one record from the big text file. Mahout expects a sequence file
>>> or
>>> a directory full of text files and includes at least one special purpose
>>> reader for wikipedia dump files.
>>>
>>> Does anyone have a simple way to turn the nutch data into sequence files?
>>> I'd ideally like to preserve the urls for use with named vectors later in
>>> the pipeline. It seems a simple tool to write but maybe it's already
>>> there
>>> somewhere?
>>>
>>>
>>>

Reply via email to