Hi Benjamin,
It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get
transformed), and your own content handler, so that you get all of the tag
start/end SAX events. So something like...
Metadata metadata = new Metadata();
ParseContext parseContext = new P
Hi
I am a new user of Tika.
I am handling HTML documents... I succeeded to parse the HTML documents to
a "clean" text string.
However, I am interested to get the structure of the documents : what are
the different sections, what are the titles of these sections etc...
Is there a way to do that