Web Crawler to Mahout

Pat Ferrel Mon, 16 Apr 2012 10:09:14 -0700

Bixo I've heard about but did not realize it has gotten such goodtraction. I will definitely grab it and take a deeper look. Thanks


On 4/13/12 7:49 PM, Ted Dunning wrote:

Nutch has scalability limits that other crawlers are able to avoid so it
isn't quite as fashionable lately.


Ken Krugler's work with common crawl and Bixo is a bit more current.

On Fri, Apr 13, 2012 at 6:36 PM, Pat Ferrel<[email protected]>  wrote:

Thanks I'll check that out.

Actually it was pretty easy to write a custom
SequenceFilesFromDirectoryFilt**er. I'm just a little surprised no one is
using crawled data from Nutch already.

On 4/13/12 4:22 PM, Peyman Mohajerian wrote:

One solution is to use Solr, which integrates nicely with Nutch. Read data
off Solr using SolrReader API.

On Fri, Apr 13, 2012 at 7:03 AM, Pat Ferrel<[email protected]>
  wrote:

  I'd like to use Nutch to gather data to process with Mahout. Nutch

creates
parsed text for the pages it crawls. Nutch also has several cl tools to
turn the data into a text file (readseg for instance). The tools I've
found
either create one big text file with markers in it for records or allow
you
to get one record from the big text file. Mahout expects a sequence file
or
a directory full of text files and includes at least one special purpose
reader for wikipedia dump files.

Does anyone have a simple way to turn the nutch data into sequence files?
I'd ideally like to preserve the urls for use with named vectors later in
the pipeline. It seems a simple tool to write but maybe it's already
there
somewhere?

Web Crawler to Mahout

Reply via email to