Boilerpipe is exactly what I needed!
One approach I've used in the past is to calculate the IDF based on
domain. The purpose of the IDF/domain is to turn boilerplate into low
weight terms. This works fairly well but has some drawbacks. Some of the
boilerplate when used in a document body is actually quite meaningful.
The benefit is that it is really easy to do (except in mahout?) and
works with most any web domain regardless of the HTML structure.
Having run a couple experiments with boilerpipe it looks impressive. I
can see that I'll want to use it for a while without IDF/domain to see
how well it does. IDF/domain in Mahout seems like it might be a bit of
work.
On 4/16/12 8:49 AM, Ken Krugler wrote:
On Apr 16, 2012, at 8:04am, Suneel Marthi wrote:
You may want to look at Tika's HtmlParser to strip out all the HTML tags and
return only the raw text content from the crawled pages. This could then be
written out to sequence files with the URL as key and the raw text as values.
If you use Tika, you might also want to use the BoilerpipeContentExtractor to filter out
"chrome" that can generate noise during text analytics (headers, menus,
footers, etc)
-- Ken
From: Pat Ferrel<[email protected]>
To: "[email protected]"<[email protected]>
Sent: Thursday, April 12, 2012 7:06 PM
Subject: Recommended way to consume Nutch data in Mahout
I'd like to use Nutch to gather data to process with Mahout. Nutch creates
parsed text for the pages it crawls. Nutch also has several cl tools to turn
the data into a text file (readseg for instance). The tools I've found either
create one big text file with markers in it for records or allow you to get one
record from the big text file. Mahout expects a sequence file or a directory
full of text files and includes at least one special purpose reader for
wikipedia dump files.
Does anyone have a simple way to turn the nutch data into sequence files? I'd
ideally like to preserve the urls for use with named vectors later in the
pipeline. It seems a simple tool to write but maybe it's already there
somewhere?
--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions& training
Hadoop, Cascading, Mahout& Solr