Hi Jaya,
> On May 24, 2018, at 12:42 PM, Johnson, Jaya <[email protected]> wrote:
>
>
> I was wondering if it was possible to extract all tables from an HTML
> document using TIKA is there anything out of the box or would one have to
> write something.
Tika will call the content handler you provide with the standard set of table
elements. From DefaultHtmlMapper.java:
….
put("TABLE", "table");
put("THEAD", "thead");
put("TBODY", "tbody");
put("TR", "tr");
put("TH", "th");
put("TD", "td”);
….
But often when people ask about extracting tables, they’re actually interested
in getting structured data (column names, data types, etc). And that’s
something Tika doesn’t automagically do for you.
It would be interesting to create such a thing (similar to what we did for
Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde
<https://github.com/seagatesoft/sde>
— Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra