Hi, On Mon, Sep 3, 2012 at 7:50 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > No, the html _does_ match, which it needs to as it descends the DOM hierarchy.
Note that we're dealing with SAX events instead of DOM hierarchies here. So what the startElement() methods does not (and is not meant to) traverse the underlying subtree, but rather just decide whether to pass on or filter out that specific SAX event. The MatchingContentHandler class is essentially a state machine that switches state by calling the descend() method of the current Matcher object to get another Matcher object appropriate for matching or filtering SAX events that occur at that position of the event stream. In the endElement() method the stack of Matcher objects is rewound to maintain the correct matching state at each level of the tree. > Note the pattern used by BodyContentHandler is: > > private static final Matcher MATCHER = > PARSER.parse("/xhtml:html/xhtml:body/descendant::node()"); > > The problem I'm seeing (details in my previous email) is that once the > /xhtml:html portion > of the path has been matched, the code decides that it doesn't have a match, > and if there > are no attributes then it bails out. As explained above, that startElement() call simply decides whether that specific <html> start element should be passed on or filtered out from the event stream. Since the pattern only matches elements inside the <body> element, it correctly infers that the <html> element should be filtered out. A simple example sequence of SAX events would be processed like this: startElement("html"); // no match, ignore startElement("head"); // no match, ignore endElement("head"); // no match, ignore startElement("body"); // no match, ignore startElement("p"); // match, call super.startElement("p") endElement("p"); // match, call super.endElement("p") endElement("body"); // no match, ignore endElement("html"); // no match, ignore BR, Jukka Zitting