Re: Extract HTML objects using TIKA

Ken Krugler Thu, 24 May 2018 13:09:51 -0700

Hi Jaya,

> On May 24, 2018, at 12:42 PM, Johnson, Jaya <[email protected]> wrote:
> 
>  
> I was wondering if it was possible to extract all tables from an HTML 
> document using TIKA is there anything out of the box or would one have to 
> write something.


Tika will call the content handler you provide with the standard set of table 
elements. From DefaultHtmlMapper.java:

        ….
        put("TABLE", "table");
        put("THEAD", "thead");
        put("TBODY", "tbody");
        put("TR", "tr");
        put("TH", "th");
        put("TD", "td”);
        ….

But often when people ask about extracting tables, they’re actually interested 
in getting structured data (column names, data types, etc). And that’s 
something Tika doesn’t automagically do for you.

It would be interesting to create such a thing (similar to what we did for 
Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde 
<https://github.com/seagatesoft/sde>

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Re: Extract HTML objects using TIKA

Reply via email to