Hi Benjamin,

It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get 
transformed), and your own content handler, so that you get all of the tag 
start/end SAX events. So something like...

        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        parseContext.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

        new HtmlParser().parse (
                myInputStream,
                myContentHandler, 
                metadata,
                parseContext);

Where myContentHandler is an instance of a custom class that extends 
org.xml.sax.helpers.DefaultHandler (similar to ToTextContentHandler in Tika). 
This will get called with all of the SAX events, in particular startElement(), 
endElement(), and characters()

-- Ken

> From: Sznajder ForMailingList
> Sent: August 17, 2015 8:51:03am PDT
> To: user@tika.apache.org
> Subject: Extracting the structure of an HTML Document
> 
> Hi
> 
> I am a new user of Tika.
> 
> I am handling HTML documents... I succeeded to parse the HTML documents to a 
> "clean" text string.
> 
> However, I am interested to get the structure of the documents : what are the 
> different sections, what are the titles of these sections etc...
> 
> Is there a way to do that with Tika?
> 
> Thanks!
> 
> Benjamin

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to