[ https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877047#action_12877047 ]
Christian Kohlschütter commented on TIKA-420: --------------------------------------------- Lately I have been busy with other things, unfortunately. Here is a short update, given the recent presentation of Safari Reader (based upon arc90's readability bookmarklet), which provides functionality similar to boilerpipe. Using the L3S-GN1 test document collection (622 news articles, crawled via GoogleNews; http://www.l3s.de/~kohlschuetter/boilerplate/ ) I found out that Safari Reader in many cases it essentially fails to produce any content (there is no "Reader" button available for 238 of the 622 pages), yielding a very low average F1 score and a significantly lower median score than boilerpipe's DefaultExtractor or ArticleExtractor. I am currently reviewing the results from arc90's Readability code to see whether there are any fundamental differences between Apple's implementation and theirs. To summarize, I think boilerpipe is a very efficient, effective and especially stable tool (read: consistent over a broad variety of sources) for removing boilerplate / clutter, and ahead of the competition :) > [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext > Extraction from HTML pages > ---------------------------------------------------------------------------------------------- > > Key: TIKA-420 > URL: https://issues.apache.org/jira/browse/TIKA-420 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Christian Kohlschütter > Assignee: Ken Krugler > Attachments: tika-app.patch, tika-parsers.patch > > > Hi all, > while Tika already provides a parser for HTML that extracts the plain text > from it, the output generally contains all text portions, including the > surplus "clutter" such as navigation menus, links to related pages etc. > around the actual main content. This "boilerplate text" typically is not > related to the main content and may deteriorate search precision. > I think Tika should be able to automatically detect and remove the > boilerplate text. I propose to use "boilerpipe" for this purpose, an Apache > 2.0 licensed Java library written by me. Boilerpipe provides both generic and > specific strategies for common tasks (for example: news article extraction) > and may also be easily extended for individual problem settings. > Extracting content is very fast (milliseconds), just needs the input document > (no global or site-level information required) and is usually quite accurate. > In fact, it outperformed the state-of-the-art approaches for several test > collections. > The algorithms used by the library are based on (and extending) some concepts > of my paper "Boilerplate Detection using Shallow Text Features", presented at > WSDM 2010 -- The Third ACM International Conference on Web Search and Data > Mining New York City, NY USA. (online at > http://www.l3s.de/~kohlschuetter/boilerplate/ ) > To use boilerpipe with Tika, I have developed a custom ContentHandler > (BoilerpipeContentHandler; provided as a patch to tika-parsers) that can > simply be passed to HtmlParser#parse. The BoilerpipeContentHandler can be > configured in several ways, particularly which extraction strategy should be > used and where the extracted content should go -- into Metadata or to a > Writer). > I also provide a patch to TikaCLI, such that you can use boilerpipe via Tika > from the command line (use a capital "-T" flag instead of "-t" to extract the > main content only). > I must note that boilerplate removal is considered a research problem: > While one can always find clever rules to extract the main content from > particular web pages with 100% accuracy, applying it to random, previously > unseen pages on the web is non-trivial. > In my paper, I have shown that on the Web (i.e. independent of a particular > site owner, page layout etc.), textual content can apparently be grouped into > two classes, long text (i.e., a lot of subsequent words without markup -- > most likely the actual content) and short text (i.e., a few words between two > HTML tags, most likely navigational boilerplate text) respectively. Removing > the words from the short text class alone already is a good strategy for > cleaning boilerplate and using a combination of multiple shallow text > features achieves an almost perfect accuracy. To a large extent the detection > of boilerplate text does not require any inter-document knowledge (frequency > of text blocks, common page layout etc.) nor any training at token level. The > costs for detecting boilerplates are negligible, as it comes down simply to > counting words. > The algorithms provided in my paper seem to generally work well and > especially for news article-like pages (for a Zipf-representative collection > of English news pages crawled via Google News: 90-95% F1 on average, 95-98% > F1 median), well ahead of the competition (78-89% avg. F1, 87-95% median F1). > Patches are attached, questions welcome. > Best, > Christian -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.