RE: Extracting the structure of an HTML Document

2015-08-17 Thread Ken Krugler
Hi Benjamin, It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get transformed), and your own content handler, so that you get all of the tag start/end SAX events. So something like... Metadata metadata = new Metadata(); ParseContext parseContext = new P

Extracting the structure of an HTML Document

2015-08-17 Thread Sznajder ForMailingList
Hi I am a new user of Tika. I am handling HTML documents... I succeeded to parse the HTML documents to a "clean" text string. However, I am interested to get the structure of the documents : what are the different sections, what are the titles of these sections etc... Is there a way to do that