Announcement: Boilerplate removal library

Christian Kohlschütter Thu, 03 Dec 2009 15:43:43 -0800

Dear all,

I think the following announcement is of interest for the Lucene community.


Today I have released Boilerpipe 1.0.

Boilerpipe is a Java library for boilerplate removal and fulltext extraction 
from HTML pages.
It is based upon my paper "Boilerplate Detection using Shallow Text Features"  
to be presented at WSDM 2010 -- The Third ACM International Conference on Web 
Search and Data Mining, 3-6 February 2010, New York City, NY USA.

The boilerpipe library provides algorithms to detect and remove the surplus 
"clutter" (boilerplate, templates) around the main textual content of a 
website. It already provides specific strategies for common tasks (for example: 
news article extraction) and may also be easily extended for individual problem 
settings. Extracting content is very fast (milliseconds), just needs the input 
document (no global or site-level information required) and usually quite 
accurate.

You can find Boilerpipe at http://code.google.com/p/boilerpipe/

The code is released under the Apache 2.0 license and you are very welcomed to 
use Boilerpipe for whatever you like to. Please let me know if it helps you, if 
you have questions about it, difficulties with it or ideas how to improve it.

Cheers,
Christian
-- 
Christian Kohlschütter
kohlschuet...@l3s.de

Forschungszentrum L3S
Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Announcement: Boilerplate removal library

Reply via email to