removing boilerplate from crawled pages

Pat Ferrel Mon, 16 Apr 2012 10:32:37 -0700

Boilerpipe is exactly what I needed!

One approach I've used in the past is to calculate the IDF based ondomain. The purpose of the IDF/domain is to turn boilerplate into lowweight terms. This works fairly well but has some drawbacks. Some of theboilerplate when used in a document body is actually quite meaningful.The benefit is that it is really easy to do (except in mahout?) andworks with most any web domain regardless of the HTML structure.

Having run a couple experiments with boilerpipe it looks impressive. Ican see that I'll want to use it for a while without IDF/domain to seehow well it does. IDF/domain in Mahout seems like it might be a bit ofwork.


On 4/16/12 8:49 AM, Ken Krugler wrote:

On Apr 16, 2012, at 8:04am, Suneel Marthi wrote:

You may want to look at Tika's HtmlParser to strip out all the HTML tags and 
return only the raw text content from the crawled pages.  This could then be 
written out to sequence files with the URL as key and the raw text as values.

If you use Tika, you might also want to use the BoilerpipeContentExtractor to filter out 
"chrome" that can generate noise during text analytics (headers, menus, 
footers, etc)

-- Ken

From: Pat Ferrel<[email protected]>
To: "[email protected]"<[email protected]>
Sent: Thursday, April 12, 2012 7:06 PM
Subject: Recommended way to consume Nutch data in Mahout

I'd like to use Nutch to gather data to process with Mahout. Nutch creates 
parsed text for the pages it crawls. Nutch also has several cl tools to turn 
the data into a text file (readseg for instance). The tools I've found either 
create one big text file with markers in it for records or allow you to get one 
record from the big text file. Mahout expects a sequence file or a directory 
full of text files and includes at least one special purpose reader for 
wikipedia dump files.

Does anyone have a simple way to turn the nutch data into sequence files? I'd 
ideally like to preserve the urls for use with named vectors later in the 
pipeline. It seems a simple tool to write but maybe it's already there 
somewhere?

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions&  training
Hadoop, Cascading, Mahout&  Solr

removing boilerplate from crawled pages

Reply via email to