Hi Andrea,

If you know that you're only parsing XHTML, and you want to use XPath 
expressions to extract specific elements, then I'm not sure Tika buys you much.

In those situations we typically just use TagSoup (or JSoup or NekoHTML) to 
first clean up the document, then run it through some DOM parser.

-- Ken

> From: Andrea Asta
> Sent: June 3, 2015 1:08:29am PDT
> To: user@tika.apache.org
> Subject: XPath support for attributes
> 
> Hi,
> I've created a parser which extracts some sections from HTML documents.
> I'm producing a new form of XHTML for the content handler, with for example 
> <section id = "myID">SECTION CONTENT</section>. Which is the proper content 
> handler for having all the sections? Will the XPath implementation for Tika 
> support expressions with @ATTRIBUTE and [@ATTRIBUTE='value'] ?
> 
> Thank you
> Andrea

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to