On Sep 3, 2012, at 8:02am, Jukka Zitting wrote:

> Hi,
> 
> On Thu, Aug 30, 2012 at 7:35 PM, Ken Krugler
> <kkrugler_li...@transpac.com> wrote:
>> The issue is that BodyContentHandler uses MatchingContentHandler to find only
>> text in nodes under the /html/body hierarchy.
>> 
>> And this in turn winds up not matching the <html> element.
> 
> That's as intented, as the BodyContentHandler is only interested in
> stuff inside the <body> element, not outside it.

No, the html _does_ match, which it needs to as it descends the DOM hierarchy.

Note the pattern used by BodyContentHandler is:

    private static final Matcher MATCHER =
        PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");

The problem I'm seeing (details in my previous email) is that once the 
/xhtml:html portion of the path has been matched, the code decides that it 
doesn't have a match, and if there are no attributes then it bails out.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to