Hi Benjamin, It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get transformed), and your own content handler, so that you get all of the tag start/end SAX events. So something like...
Metadata metadata = new Metadata(); ParseContext parseContext = new ParseContext(); parseContext.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE); new HtmlParser().parse ( myInputStream, myContentHandler, metadata, parseContext); Where myContentHandler is an instance of a custom class that extends org.xml.sax.helpers.DefaultHandler (similar to ToTextContentHandler in Tika). This will get called with all of the SAX events, in particular startElement(), endElement(), and characters() -- Ken > From: Sznajder ForMailingList > Sent: August 17, 2015 8:51:03am PDT > To: user@tika.apache.org > Subject: Extracting the structure of an HTML Document > > Hi > > I am a new user of Tika. > > I am handling HTML documents... I succeeded to parse the HTML documents to a > "clean" text string. > > However, I am interested to get the structure of the documents : what are the > different sections, what are the titles of these sections etc... > > Is there a way to do that with Tika? > > Thanks! > > Benjamin -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr