On Sep 3, 2012, at 8:02am, Jukka Zitting wrote: > Hi, > > On Thu, Aug 30, 2012 at 7:35 PM, Ken Krugler > <kkrugler_li...@transpac.com> wrote: >> The issue is that BodyContentHandler uses MatchingContentHandler to find only >> text in nodes under the /html/body hierarchy. >> >> And this in turn winds up not matching the <html> element. > > That's as intented, as the BodyContentHandler is only interested in > stuff inside the <body> element, not outside it.
No, the html _does_ match, which it needs to as it descends the DOM hierarchy. Note the pattern used by BodyContentHandler is: private static final Matcher MATCHER = PARSER.parse("/xhtml:html/xhtml:body/descendant::node()"); The problem I'm seeing (details in my previous email) is that once the /xhtml:html portion of the path has been matched, the code decides that it doesn't have a match, and if there are no attributes then it bails out. -- Ken -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr