Hi 2008/9/24 Brian Levay <[EMAIL PROTECTED]>
> I'll submit the updates when I'm done (along with unit tests). I'm having > a > problem though. I sync'ed my tika baseline this morning and the Matcher > stopped matching the <meta> tags. Any idea what my be causing this? I've > tried many variations of the xpath expressions to match the <meta> tags. > Right now my code in HTMLParser looks like this: > > Matcher body = xpath.parse("/HTML/BODY//node()"); > Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()"); > Matcher meta = xpath.parse("/HTML/HEAD/META//node()"); > handler = new TeeContentHandler( > new MatchingContentHandler(getBodyHandler(xhtml), body), > new MatchingContentHandler(getTitleHandler(metadata), > title), > new MatchingContentHandler(getMetaHandler(metadata), meta)); > > The <meta> handler isn't being called. If I use /HTML/HEAD//node() the > handler will get called for the <head> and <title> tags but it will skip > right past the <meta> tags. I know the tika code is seeing the META tags > because I see the tags trying to be matched in the startElement method of > MatchingContentHandler. Any ideas? > > --Brian > I am using effectively the same thing in a local copy and have just re-based it again HEAD (shown in the diff below), and it appears to be working fine for me. What is your test XML like? Cheers, Dave Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java =================================================================== --- src/main/java/org/apache/tika/parser/html/HtmlParser.java (revision 698705) +++ src/main/java/org/apache/tika/parser/html/HtmlParser.java (working copy) @@ -95,9 +95,11 @@ XPathParser xpath = new XPathParser(null, ""); Matcher body = xpath.parse("/HTML/BODY//node()"); Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()"); + Matcher meta = xpath.parse("/HTML/HEAD/META//node()"); handler = new TeeContentHandler( new MatchingContentHandler(getBodyHandler(xhtml), body), - new MatchingContentHandler(getTitleHandler(metadata), title)); + new MatchingContentHandler(getTitleHandler(metadata), title), + new MatchingContentHandler(getMetaHandler(metadata), meta)); // Parse the HTML document xhtml.startDocument(); @@ -116,6 +118,17 @@ }; } + private ContentHandler getMetaHandler(final Metadata metadata) { + return new WriteOutContentHandler() { + @Override + public void startElement( + String uri, String local, String name, Attributes atts) + throws SAXException { + metadata.set(atts.getValue(0), atts.getValue(1)); + } + }; + } + private ContentHandler getBodyHandler(final XHTMLContentHandler xhtml) { return new TextContentHandler(xhtml) {