RE: Extracting the structure of an HTML Document

Ken Krugler Mon, 17 Aug 2015 16:42:07 -0700

Hi Benjamin,

It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get 
transformed), and your own content handler, so that you get all of the tag 
start/end SAX events. So something like...


        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        parseContext.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

        new HtmlParser().parse (
                myInputStream,
                myContentHandler, 
                metadata,
                parseContext);

Where myContentHandler is an instance of a custom class that extends 
org.xml.sax.helpers.DefaultHandler (similar to ToTextContentHandler in Tika). 
This will get called with all of the SAX events, in particular startElement(), 
endElement(), and characters()

-- Ken

> From: Sznajder ForMailingList
> Sent: August 17, 2015 8:51:03am PDT
> To: user@tika.apache.org
> Subject: Extracting the structure of an HTML Document
> 
> Hi
> 
> I am a new user of Tika.
> 
> I am handling HTML documents... I succeeded to parse the HTML documents to a 
> "clean" text string.
> 
> However, I am interested to get the structure of the documents : what are the 
> different sections, what are the titles of these sections etc...
> 
> Is there a way to do that with Tika?
> 
> Thanks!
> 
> Benjamin

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

RE: Extracting the structure of an HTML Document

Reply via email to