Detection problem with RFC822 file with HTML content

2015-11-12 Thread Vjeran Marcinko
Hello, I saved 2 .eml files saved by my Thunderbird, and one of them contained plain text content, whereas other one rich HTML content. The plain text one got recognized by Tika as "message/rfc822" file, but the other one incorrectly as "text/html" (and textual content being incorrectly extr

Re: Extraction table from HTML document in Tika

2015-11-12 Thread Chris Mattmann
Also take a look at Scrapy and the work that Hyperion Grey is doing with Splash and Avatar/HH. Cheers, Chris — Chris Mattmann chris.mattm...@gmail.com -Original Message- From: Ken Krugler Reply-To: Date: Thursday, November 12, 2015 at 10:58 AM To: Subject: RE: Extraction table fr

RE: Extraction table from HTML document in Tika

2015-11-12 Thread Ken Krugler
There's no (semi)automated method. For simple tables you could create a custom ContentHandler that triggers of appropriate HTML tags. But a general purpose extractor is a serious technical challenge. Companies like Factual have invested heavily in being able to find & extract this type of stru

Extraction table from HTML document in Tika

2015-11-12 Thread Sznajder ForMailingList
Hi Is there a way to extract tables from a HTML document using Tika? thanks! Benjamin