On Sep 3, 2012, at 8:02am, Jukka Zitting wrote:
> Hi,
>
> On Thu, Aug 30, 2012 at 7:35 PM, Ken Krugler
> <[email protected]> wrote:
>> The issue is that BodyContentHandler uses MatchingContentHandler to find only
>> text in nodes under the /html/body hierarchy.
>>
>> And this in turn winds up not matching the <html> element.
>
> That's as intented, as the BodyContentHandler is only interested in
> stuff inside the <body> element, not outside it.
No, the html _does_ match, which it needs to as it descends the DOM hierarchy.
Note the pattern used by BodyContentHandler is:
private static final Matcher MATCHER =
PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");
The problem I'm seeing (details in my previous email) is that once the
/xhtml:html portion of the path has been matched, the code decides that it
doesn't have a match, and if there are no attributes then it bails out.
-- Ken
--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr