Hi Jukka,

I was looking into a failure in a Bixo test, when using BodyContentHandler 
(wrapped by XHTMLContentHandler).

The issue is that BodyContentHandler uses MatchingContentHandler to find only 
text in nodes under the /html/body hierarchy.

And this in turn winds up not matching the <html> element.

Looking at the MatchingContentHandler's startElement() method (code below), the 
issue I see is that the initial matcher.descend() works, in that it matches the 
html element, but the Matcher it returns is a NamedElementMatcher. This in turn 
always returns false from its matchesElement() method, so you only wind up 
actually descending if the html element has attributes - which I don't, for 
this case.

It seems like NamedElementMatcher needs to set state (matched or not) when 
descend() is called, that state needs to be returned by its matchesElement(), 
and the initial matcher should be queried for matchesElement, not the matcher 
you descend into.

But this whole Matcher/XPath parser stuff is pretty convoluted, so I didn't 
want to file an issue & try fixing it until I got some input from you.

Regards,

-- Ken



    public void startElement(
            String uri, String localName, String name, Attributes attributes)
            throws SAXException {
        matchers.addFirst(matcher);
        matcher = matcher.descend(uri, localName);

        AttributesImpl matches = new AttributesImpl();
        for (int i = 0; i < attributes.getLength(); i++) {
            String attributeURI = attributes.getURI(i);
            String attributeName = attributes.getLocalName(i);
            if (matcher.matchesAttribute(attributeURI, attributeName)) {
                matches.addAttribute(
                        attributeURI, attributeName, attributes.getQName(i),
                        attributes.getType(i), attributes.getValue(i));
            }
        }

        if (matcher.matchesElement() || matches.getLength() > 0) {
            super.startElement(uri, localName, name, matches);
            if (!matcher.matchesElement()) {
                // Force the matcher to match the current element, so the
                // endElement method knows to emit the correct event
                matcher =
                    new CompositeMatcher(matcher, ElementMatcher.INSTANCE);
            }
        }
    }

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to