Simone Tripodi wrote:
Hi all guys,
I'm very sorry if I don't appear frequently on the ML but since April
I've been working very hard for a customer client in Paris that don't
let me some spare time to dedicate to OS projects.

Don't be sorry. We all have our own jobs/interest/duties that have driven us away from Cocoon. Glad to see you back!

I'm writing because I'm sure the XInclude transformer I submitted time
ago could be optimized, so I'd like to ask you a little help :)

The state of the art is that, when including an entire document, it is
processed efficiently through SAX APIs; the problem comes when
processing a document referenced by xinclude+xpointer, that forces the
processor to extract a sub-document of the included.

To perform this, I implemented a DOM parsing, then through XPath I
extract the sub-document the processor has to be included, then
navigating the elements will be converted to SAX events. As you
noticed, this takes time, too much IMO, but I didn't find/don't know
any better solution :(
Since you experienced the stax, maybe you're able to suggest me a fast
way to parse a document with xpath and invoke SAX events, so I'm able
to provide you a much better - and faster, above all - solution.

Any hint? Every suggestion will be very appreciated.

The problem with XPath and XML streaming (be it SAX or StAX) is that XPath is a language that allows exploring the document tree in all directions and thus inherently expects having the whole document tree available, which is clearly not compatible with streaming.

There are different approaches to solving this :
- use a deferred loading DOM implementation, which buffers events only when it needs them to traverse the tree. Axiom [1] provides this IIRC, along with an XPath implementation. - restrain the XPointer expression to a subset of XPath that can easily be implemented on top of a stream. This means restricting selection only on the current element, its attribute and its ancestors. There's an implementation of this approach in Tika.

The XInclude transformer can be smart enough to use the most efficient implementation for the given XPath expression, i.e. try to parse it with Tika's restricted subset, and fallback to something more costly, either Axiom or plain DOM.

Sylvain

[1] http://ws.apache.org/commons/axiom/
[2] https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/

--
Sylvain Wallez - http://bluxte.net

Reply via email to