Hi

2008/9/24 Brian Levay <[EMAIL PROTECTED]>

> I'll submit the updates when I'm done (along with unit tests).  I'm having
> a
> problem though.  I sync'ed my tika baseline this morning and the Matcher
> stopped matching the <meta> tags.  Any idea what my be causing this?  I've
> tried many variations of the xpath expressions to match the <meta> tags.
> Right now my code in HTMLParser looks like this:
>
>        Matcher body = xpath.parse("/HTML/BODY//node()");
>        Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
>        Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
>        handler = new TeeContentHandler(
>                new MatchingContentHandler(getBodyHandler(xhtml), body),
>                new MatchingContentHandler(getTitleHandler(metadata),
> title),
>                new MatchingContentHandler(getMetaHandler(metadata), meta));
>
> The <meta> handler isn't being called.  If I use /HTML/HEAD//node() the
> handler will get called for the <head> and <title> tags but it will skip
> right past the <meta> tags.  I know the tika code is seeing the META tags
> because I see the tags trying to be matched in the startElement method of
> MatchingContentHandler.  Any ideas?
>
> --Brian
>

I am using effectively the same thing in a local copy and have just re-based
it again HEAD (shown in the diff below), and it appears to be working fine
for me.

What is your test XML like?

Cheers,
Dave


Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
===================================================================
--- src/main/java/org/apache/tika/parser/html/HtmlParser.java    (revision
698705)
+++ src/main/java/org/apache/tika/parser/html/HtmlParser.java    (working
copy)
@@ -95,9 +95,11 @@
         XPathParser xpath = new XPathParser(null, "");
         Matcher body = xpath.parse("/HTML/BODY//node()");
         Matcher title = xpath.parse("/HTML/HEAD/TITLE//node()");
+        Matcher meta = xpath.parse("/HTML/HEAD/META//node()");
         handler = new TeeContentHandler(
                 new MatchingContentHandler(getBodyHandler(xhtml), body),
-                new MatchingContentHandler(getTitleHandler(metadata),
title));
+                new MatchingContentHandler(getTitleHandler(metadata),
title),
+                new MatchingContentHandler(getMetaHandler(metadata),
meta));

         // Parse the HTML document
         xhtml.startDocument();
@@ -116,6 +118,17 @@
         };
     }

+    private ContentHandler getMetaHandler(final Metadata metadata) {
+        return new WriteOutContentHandler() {
+            @Override
+            public void startElement(
+                    String uri, String local, String name, Attributes atts)
+                    throws SAXException {
+                    metadata.set(atts.getValue(0), atts.getValue(1));
+            }
+        };
+    }
+
     private ContentHandler getBodyHandler(final XHTMLContentHandler xhtml)
{
         return new TextContentHandler(xhtml) {

Reply via email to