Re: earth-to-digester

James Strachan Wed, 01 Aug 2001 09:19:49 -0700
Hi Craig

From: "Craig R. McClanahan" <[EMAIL PROTECTED]>
> Remembering that Digester is based on SAX, any matching technology has to
> be able to function when the appropriate SAX events are fired.  In the
> current implementation, Digester calls getRules() to determine the ones
> that match in two different places:
>
> * startElement() so it can call the Rule.begin() method on all
>   matching rules, and
>
> * endElement() so it can call the Rule.body() method (if there was
>   any body text) and Rule.end() method of all matching rules
>
> So, with no changes, you could figure out a way to select matching rules
> based on XPath expressions (or anything else that depended on
> attributes) only in the startElement, where the attributes are available.
> Therefore, to implement something like XPath matching, we'd have to keep a
> stack of the actual elements (and attributes) representing the current
> nesting in the graph.
>
>
> While this probably isn't so complex, it does slow down performance pretty
> significantly on the "simple cases" that Digester was originally designed
> for.  Remember, we're doing *one pass* through the XML document, and the
> whole idea of SAX is to avoid building the entire document in memory like
> a DOM structure does.

Agreed. I was just suggesting we use 'XPath-like' syntax but without loading
a full DOM-ish structure into memory.


> It seems to me that changing getRules() to use regexp matching on the
> element names match expression would deal with most of the use cases that
> have been presented.  That way, you can do things like:
>
> * Match if my element is *anywhere* in the match list, or at the front
>   only, or at the back only ("a/*", "*/a/*", "*/a", and more complicated
>   combinations)
>
> * Match on an element that is nested inside another element, no matter
>   how far (a/*/b)
>
> * Match nested pairs of elements, no matter how deep they are
>   (a/b/*, */a/b/*, */a/b)
>
> What do you think?

I agree - its what I meant really - that we could support XPath-like
expressions, via regexp, without using a DOM-ish tree and a real XPath
engine. Right now digester is maintaining a String path for the current
element so this could all just be done right now in the getRules( path )
function.

An alternative implementation could be to maintain the element path as a
List of Strings rather than a single String. So the path "/a/b/c/d" would be
a List like [ "a", "b", "c", "d" ]. This would avoid the performance cost of
String concatenation & construction & substring() for each startElement()
and endElement() method call. If the ArrayList is big enough then the
startElement() and endElement() would be as efficient as array index
lookups. Though this is a fairly minor cost.

James


_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com
Re: earth-to-digester

Reply via email to