Hi Max (& Ted),

On Nov 6, 2009, at 11:57am, Ted Dunning wrote:

The question that I don't see addressed is whether you choose to use a fully streaming approach as is done in Bixo or whether you will use a document
repository approach as is more common in most search engines.

I think the issue here isn't about streaming vs. document repository - all systems have elements of both, it's just that...

a. Bixo exposes this more explicitly, by focusing on the workflow aspects of web mining.

But Nutch also has sequences of map-reduce tasks that are run during a crawl (e.g. filter URLs, group them, then fetch & parse).

b. Bixo doesn't have a baked in URL database, or file-system scheme for saving content.

If you look at the example SimpleCrawlTool class in Bixo, for example, you'll see that it (similar to Nutch) is using a SequenceFile to store the URL state, and sequence files in sub-directories for fetched content & parse results.

But Bixo just does the simple thing of propagates the URL state forward into successive crawl directories, versus updating a single URL database. Having a URL DB is what you'd want for large-scale web crawling.

If you wanted to configure Bixo to use HBase to store the URL state and fetched/parsed content, you'd use an HBase tap (in Cascading- speak) versus the Hfs tap.

Hbase is reputedly ready enough to serve as a document repository. Using such an approach would be very helpful for the incremental nature of web
crawls.

I'd gotten the same input from Andrew Purtell, who's been able to stream lots of crawl data into HBase, after a bit of fiddling with configuration settings and also some patching on the writer side of things.

As far as pre-processing and feature extraction, both could be implemented as Cascading operations (that wind up mapping to Hadoop tasks).

As Ted noted, actually doing the named entity extraction and feature extraction will be the real challenge.

See this talk for an example of doing web mining using Bixo - 
http://www.slideshare.net/sh1mmer/the-bixo-web-mining-toolkit

-- Ken


On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll <[email protected]>wrote:


This is obviously only a first draft of what we think would be a suited
overall
architecture




--
Ted Dunning, CTO
DeepDyve

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to