Dear all, I think the following announcement is of interest for the Lucene community.
Today I have released Boilerpipe 1.0. Boilerpipe is a Java library for boilerplate removal and fulltext extraction from HTML pages. It is based upon my paper "Boilerplate Detection using Shallow Text Features" to be presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining, 3-6 February 2010, New York City, NY USA. The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a website. It already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and usually quite accurate. You can find Boilerpipe at http://code.google.com/p/boilerpipe/ The code is released under the Apache 2.0 license and you are very welcomed to use Boilerpipe for whatever you like to. Please let me know if it helps you, if you have questions about it, difficulties with it or ideas how to improve it. Cheers, Christian -- Christian Kohlschütter kohlschuet...@l3s.de Forschungszentrum L3S Leibniz Universität Hannover http://www.L3S.de/~kohlschuetter/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org