Re: XInclude optimization

2009-12-10 Thread Simone Tripodi
Hi Reinhard
Very appreciated, thanks!!! :)
alles gute, auf wiedersehen!
Simo


On Fri, Dec 11, 2009 at 8:44 AM, Reinhard Pötz  wrote:
> Simone Tripodi wrote:
>> Hi Guys,
>> do you have some spare time to review the last patch submitted on [1]?
>> I know it requires time...
>> Thanks in advance, best regards,
>
> Unless somebody else is quicker than me, I will have a look at your
> patch before I create the release.
>
> --
> Reinhard Pötz                           Managing Director, {Indoqa} GmbH
>                         http://www.indoqa.com/en/people/reinhard.poetz/
>
> Member of the Apache Software Foundation
> Apache Cocoon Committer, PMC member                  reinh...@apache.org
> 
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-12-10 Thread Reinhard Pötz
Simone Tripodi wrote:
> Hi Guys,
> do you have some spare time to review the last patch submitted on [1]?
> I know it requires time...
> Thanks in advance, best regards,

Unless somebody else is quicker than me, I will have a look at your
patch before I create the release.

-- 
Reinhard Pötz   Managing Director, {Indoqa} GmbH
 http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member  reinh...@apache.org



Re: XInclude optimization

2009-12-09 Thread Simone Tripodi
Hi Guys,
do you have some spare time to review the last patch submitted on [1]?
I know it requires time...
Thanks in advance, best regards,
Simone

[1] https://issues.apache.org/jira/browse/COCOON3-3

On Tue, Nov 24, 2009 at 12:42 PM, Simone Tripodi
 wrote:
> Hi all,
> Thank you both guys, my question was about legal issues that you clarified me 
> :)
>
> Reinhard, no problem about the optionals, even if I remember the
> policy I appreciate you reminded me it :) BTW, after a quick overview
> on Tika, I was thinking about importing just the needed classes and
> modifying them according to our needs, so if you agree I'd add the
> XInclude in the cocoon-sax module... what do you think about it? Just
> let me know!
>
> See you guys and thanks a *lot* for your help :)
> Best regards
> Simo
>
> On Tue, Nov 24, 2009 at 12:16 PM, Reinhard Pötz  wrote:
>> Simone Tripodi wrote:
>>> Hi Sylvain
>>> Sorry but I forgot to ask you a short question in the previous email:
>>> can the Tika code be imported/modified into Cocoon3?
>>
>> Do you really have to modify Tika code? If so it would be best to give
>> back your contributions to the their project.
>>
>> Since you have to include a library I strongly recommend that everything
>> goes into cocoon-optional in order to keep the number of required
>> libraries low for the pipeline API.
>>
>>> AFAIK it should
>>> be allowed, but I don't know the conditions under which it can be
>>> done.
>>
>> If your questions is about licensing, then it's very simple: You don't
>> have to do anything because Tika is an ASF project.
>>
>> --
>> Reinhard Pötz                           Managing Director, {Indoqa} GmbH
>>                         http://www.indoqa.com/en/people/reinhard.poetz/
>>
>> Member of the Apache Software Foundation
>> Apache Cocoon Committer, PMC member                  reinh...@apache.org
>> 
>>
>
>
>
> --
> http://www.google.com/profiles/simone.tripodi
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-24 Thread Simone Tripodi
Hi all,
Thank you both guys, my question was about legal issues that you clarified me :)

Reinhard, no problem about the optionals, even if I remember the
policy I appreciate you reminded me it :) BTW, after a quick overview
on Tika, I was thinking about importing just the needed classes and
modifying them according to our needs, so if you agree I'd add the
XInclude in the cocoon-sax module... what do you think about it? Just
let me know!

See you guys and thanks a *lot* for your help :)
Best regards
Simo

On Tue, Nov 24, 2009 at 12:16 PM, Reinhard Pötz  wrote:
> Simone Tripodi wrote:
>> Hi Sylvain
>> Sorry but I forgot to ask you a short question in the previous email:
>> can the Tika code be imported/modified into Cocoon3?
>
> Do you really have to modify Tika code? If so it would be best to give
> back your contributions to the their project.
>
> Since you have to include a library I strongly recommend that everything
> goes into cocoon-optional in order to keep the number of required
> libraries low for the pipeline API.
>
>> AFAIK it should
>> be allowed, but I don't know the conditions under which it can be
>> done.
>
> If your questions is about licensing, then it's very simple: You don't
> have to do anything because Tika is an ASF project.
>
> --
> Reinhard Pötz                           Managing Director, {Indoqa} GmbH
>                         http://www.indoqa.com/en/people/reinhard.poetz/
>
> Member of the Apache Software Foundation
> Apache Cocoon Committer, PMC member                  reinh...@apache.org
> 
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-24 Thread Reinhard Pötz
Simone Tripodi wrote:
> Hi Sylvain
> Sorry but I forgot to ask you a short question in the previous email:
> can the Tika code be imported/modified into Cocoon3? 

Do you really have to modify Tika code? If so it would be best to give
back your contributions to the their project.

Since you have to include a library I strongly recommend that everything
goes into cocoon-optional in order to keep the number of required
libraries low for the pipeline API.

> AFAIK it should
> be allowed, but I don't know the conditions under which it can be
> done.

If your questions is about licensing, then it's very simple: You don't
have to do anything because Tika is an ASF project.

-- 
Reinhard Pötz   Managing Director, {Indoqa} GmbH
 http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member  reinh...@apache.org



Re: XInclude optimization

2009-11-24 Thread Sylvain Wallez

Simone Tripodi wrote:

Hi Sylvain
Sorry but I forgot to ask you a short question in the previous email:
can the Tika code be imported/modified into Cocoon3? AFAIK it should
be allowed, but I don't know the conditions under which it can be
done.
  


I don't really understand your question. Tika is an Apache project, so 
there's no license issue.


Now if the question is about how, technically, to include Tika into 
Cocoon, I admit having no clue about that.


Sylvain

--
Sylvain Wallez - http://bluxte.net



Re: XInclude optimization

2009-11-24 Thread Simone Tripodi
Hi Sylvain
Sorry but I forgot to ask you a short question in the previous email:
can the Tika code be imported/modified into Cocoon3? AFAIK it should
be allowed, but I don't know the conditions under which it can be
done.
A bientot!!!
Simo

On Tue, Nov 24, 2009 at 10:29 AM, Simone Tripodi
 wrote:
> Hi Sylvain,
> there are no words to say thank you, very very appreciated, I'll
> follow your suggestions :)
> A bientot
> Simone
>
> On Tue, Nov 24, 2009 at 10:21 AM, Sylvain Wallez  wrote:
>> Simone Tripodi wrote:
>>>
>>> Hi Sylvain and Simone,
>>> thank you a lot, the suggestions you provided are all very very
>>> interesting, so I wonder now if it is possible to realize a processor
>>> able to use at the same time the Tika way when it recognizes some kind
>>> of paths, the "XSL-on-the-fly" for more complex cases. What do you
>>> think?
>>>
>>
>> As I suggested previously: first try to parse the XPath expression with
>> Tika's parser, and if it fails because the expression doesn't match the
>> subset it accepts, fall back to XSL-on-the-fly.
>>
>> Looking at Tika's parser [1], it looks like you'll have to overload the
>> parse() method to fail hard by throwing an exception rather than returning
>> Matcher.FAIL to be able to detect XPath features outside of the subset it
>> accepts.
>>
>>> Sylvain, I still haven't read the Tika documentation, can you just
>>> point me the related doc about this topic?
>>>
>>
>> There's no specific documentation on this particular feature, as its more an
>> internal utility than a primary feature in Tika. Now the code is pretty
>> straightforward.
>>>
>>> Simo, did you already give a try about the XSLT generation on the fly?
>>> The most basic operation I thought is generating the XSL string by a
>>> template, then pass it to the XSL parser, but I'm sure it could be
>>> implemented in a better way :P
>>>
>>
>> Sounds like the way to go, but you should cache the resulting template
>> object to avoid recreating and reparsing the XSL at every request. The same
>> applies to Tika matcher objects.
>>
>> Sylvain
>>
>> [1]
>> https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/XPathParser.java
>>
>> --
>> Sylvain Wallez - http://bluxte.net
>>
>>
>
>
>
> --
> http://www.google.com/profiles/simone.tripodi
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-24 Thread Simone Tripodi
Hi Sylvain,
there are no words to say thank you, very very appreciated, I'll
follow your suggestions :)
A bientot
Simone

On Tue, Nov 24, 2009 at 10:21 AM, Sylvain Wallez  wrote:
> Simone Tripodi wrote:
>>
>> Hi Sylvain and Simone,
>> thank you a lot, the suggestions you provided are all very very
>> interesting, so I wonder now if it is possible to realize a processor
>> able to use at the same time the Tika way when it recognizes some kind
>> of paths, the "XSL-on-the-fly" for more complex cases. What do you
>> think?
>>
>
> As I suggested previously: first try to parse the XPath expression with
> Tika's parser, and if it fails because the expression doesn't match the
> subset it accepts, fall back to XSL-on-the-fly.
>
> Looking at Tika's parser [1], it looks like you'll have to overload the
> parse() method to fail hard by throwing an exception rather than returning
> Matcher.FAIL to be able to detect XPath features outside of the subset it
> accepts.
>
>> Sylvain, I still haven't read the Tika documentation, can you just
>> point me the related doc about this topic?
>>
>
> There's no specific documentation on this particular feature, as its more an
> internal utility than a primary feature in Tika. Now the code is pretty
> straightforward.
>>
>> Simo, did you already give a try about the XSLT generation on the fly?
>> The most basic operation I thought is generating the XSL string by a
>> template, then pass it to the XSL parser, but I'm sure it could be
>> implemented in a better way :P
>>
>
> Sounds like the way to go, but you should cache the resulting template
> object to avoid recreating and reparsing the XSL at every request. The same
> applies to Tika matcher objects.
>
> Sylvain
>
> [1]
> https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/XPathParser.java
>
> --
> Sylvain Wallez - http://bluxte.net
>
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-24 Thread Sylvain Wallez

Simone Tripodi wrote:

Hi Sylvain and Simone,
thank you a lot, the suggestions you provided are all very very
interesting, so I wonder now if it is possible to realize a processor
able to use at the same time the Tika way when it recognizes some kind
of paths, the "XSL-on-the-fly" for more complex cases. What do you
think?
  


As I suggested previously: first try to parse the XPath expression with 
Tika's parser, and if it fails because the expression doesn't match the 
subset it accepts, fall back to XSL-on-the-fly.


Looking at Tika's parser [1], it looks like you'll have to overload the 
parse() method to fail hard by throwing an exception rather than 
returning Matcher.FAIL to be able to detect XPath features outside of 
the subset it accepts.



Sylvain, I still haven't read the Tika documentation, can you just
point me the related doc about this topic?
  


There's no specific documentation on this particular feature, as its 
more an internal utility than a primary feature in Tika. Now the code is 
pretty straightforward.

Simo, did you already give a try about the XSLT generation on the fly?
The most basic operation I thought is generating the XSL string by a
template, then pass it to the XSL parser, but I'm sure it could be
implemented in a better way :P
  


Sounds like the way to go, but you should cache the resulting template 
object to avoid recreating and reparsing the XSL at every request. The 
same applies to Tika matcher objects.


Sylvain

[1] 
https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/XPathParser.java


--
Sylvain Wallez - http://bluxte.net



Re: XInclude optimization

2009-11-23 Thread Simone Tripodi
Hi Sylvain and Simone,
thank you a lot, the suggestions you provided are all very very
interesting, so I wonder now if it is possible to realize a processor
able to use at the same time the Tika way when it recognizes some kind
of paths, the "XSL-on-the-fly" for more complex cases. What do you
think?

Sylvain, I still haven't read the Tika documentation, can you just
point me the related doc about this topic?

Simo, did you already give a try about the XSLT generation on the fly?
The most basic operation I thought is generating the XSL string by a
template, then pass it to the XSL parser, but I'm sure it could be
implemented in a better way :P

Every suggestion will be very appreciated, thanks in advance

Best regards, have a nice evening!!!
Simone

On Mon, Nov 23, 2009 at 7:16 PM, Sylvain Wallez  wrote:
> Simone Gianni wrote:
>>
>> Hi Simone and Sylvain,
>> aren't XSLT transformers already SAX/Xpath optimized? I mean, an XSLT
>> containing an XPath expression and used in a SAX context, isn't already able
>> to resolve the XPath while keeping buffering at the minimum possible?
>>
>> I can clearly remember that there has been a lot of work about this in
>> Xalan and other XSLT engines, and also how a complex XPath expressions could
>> change the performance of a transformation because of increased buffering.
>
> Xalan has an optimized implementation of the document tree [1], more
> efficient than the standard DOM for read-only and selection operations.
> Xalan has an incremental processing mode, but IIRC it's more about being
> able to produce some output before the whole document has been read rather
> than avoiding to build parts of the document tree. So it will allow for
> faster processing, but won't change memory consumption.
>
>> In that case, maybe, instead of reinventing it, it should be possible to
>> delegate the "transformation" (extraction of a fragment from the entire XML
>> stream) to an XSLT processor. The simplest way could be to generate an XSLT
>> on the fly :) .. the correct way would be to use the [Xalan|Saxon|any other]
>> internal APIs to perform the XPath resolution. In both cases, it will be
>> faster than transforming to DOM.
>
> Agree. It may be easier to produce a small XSL transformation from the
> XPointer expression than using Axiom. But still, for simple expressions, the
> pure streaming approach used by Tika would be way more efficient.
>
> Sylvain
>
> [1] http://xml.apache.org/xalan-j/dtm.html
>
> --
> Sylvain Wallez - http://bluxte.net
>
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-23 Thread Sylvain Wallez

Simone Gianni wrote:

Hi Simone and Sylvain,
aren't XSLT transformers already SAX/Xpath optimized? I mean, an XSLT 
containing an XPath expression and used in a SAX context, isn't 
already able to resolve the XPath while keeping buffering at the 
minimum possible?


I can clearly remember that there has been a lot of work about this in 
Xalan and other XSLT engines, and also how a complex XPath expressions 
could change the performance of a transformation because of increased 
buffering.


Xalan has an optimized implementation of the document tree [1], more 
efficient than the standard DOM for read-only and selection operations. 
Xalan has an incremental processing mode, but IIRC it's more about being 
able to produce some output before the whole document has been read 
rather than avoiding to build parts of the document tree. So it will 
allow for faster processing, but won't change memory consumption.


In that case, maybe, instead of reinventing it, it should be possible 
to delegate the "transformation" (extraction of a fragment from the 
entire XML stream) to an XSLT processor. The simplest way could be to 
generate an XSLT on the fly :) .. the correct way would be to use the 
[Xalan|Saxon|any other] internal APIs to perform the XPath resolution. 
In both cases, it will be faster than transforming to DOM.


Agree. It may be easier to produce a small XSL transformation from the 
XPointer expression than using Axiom. But still, for simple expressions, 
the pure streaming approach used by Tika would be way more efficient.


Sylvain

[1] http://xml.apache.org/xalan-j/dtm.html

--
Sylvain Wallez - http://bluxte.net



Re: XInclude optimization

2009-11-23 Thread Simone Gianni

Hi Simone and Sylvain,
aren't XSLT transformers already SAX/Xpath optimized? I mean, an XSLT 
containing an XPath expression and used in a SAX context, isn't already 
able to resolve the XPath while keeping buffering at the minimum possible?


I can clearly remember that there has been a lot of work about this in 
Xalan and other XSLT engines, and also how a complex XPath expressions 
could change the performance of a transformation because of increased 
buffering.


In that case, maybe, instead of reinventing it, it should be possible to 
delegate the "transformation" (extraction of a fragment from the entire 
XML stream) to an XSLT processor. The simplest way could be to generate 
an XSLT on the fly :) .. the correct way would be to use the 
[Xalan|Saxon|any other] internal APIs to perform the XPath resolution. 
In both cases, it will be faster than transforming to DOM.


Simone


Simone Tripodi wrote:

Hi Sylvain,
indeed, that's yet another exception I didn't think, thanks for your
clarification!!!
Bonne journée, a bientot ;)
Simo

On Mon, Nov 23, 2009 at 8:28 AM, Sylvain Wallez  wrote:
  

Jos Snellings wrote:


Hmmm, I guess the XPath expression is known before the parsing begins?
I remember I have done a similar thing, where a chunk had to be isolated
from a document that came by via a SAX stream, but here the xpath
expression was something like: "/element1/elemen...@id=somenumber]".

Theorem: any XPath expression can be evaluated with a SAX filter.
Proof?
Do you know some exceptions?

  

What about this one : //foo[bar[position() = 3]//baz], find all elements
"foo" whose 3rd "bar" child has a "baz" descendent element.

This requires to buffer the contents of every "foo" element to inspect their
chidren sub-tree.

Sylvain

--
Sylvain Wallez - http://bluxte.net







  



--
Simone GianniCEO Semeru s.r.l.   Apache Committer
http://www.simonegianni.it/



Re: XInclude optimization

2009-11-22 Thread Simone Tripodi
Hi Sylvain,
indeed, that's yet another exception I didn't think, thanks for your
clarification!!!
Bonne journée, a bientot ;)
Simo

On Mon, Nov 23, 2009 at 8:28 AM, Sylvain Wallez  wrote:
> Jos Snellings wrote:
>>
>> Hmmm, I guess the XPath expression is known before the parsing begins?
>> I remember I have done a similar thing, where a chunk had to be isolated
>> from a document that came by via a SAX stream, but here the xpath
>> expression was something like: "/element1/elemen...@id=somenumber]".
>>
>> Theorem: any XPath expression can be evaluated with a SAX filter.
>> Proof?
>> Do you know some exceptions?
>>
>
> What about this one : //foo[bar[position() = 3]//baz], find all elements
> "foo" whose 3rd "bar" child has a "baz" descendent element.
>
> This requires to buffer the contents of every "foo" element to inspect their
> chidren sub-tree.
>
> Sylvain
>
> --
> Sylvain Wallez - http://bluxte.net
>
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-22 Thread Simone Tripodi
Hi Jos,
thanks for your reply, the XPath expression is already known before
parsing the document since the XInclude processor catches the xpointer
reference before including the document.
I think your solution works but I've the suspect just for a limited
subset of the XPath expressions, the exception comes when an
expression contains siblings/parent references...
What do you think about it?
Best regards and thanks for your hint!
Simone

On Mon, Nov 23, 2009 at 7:12 AM, Jos Snellings  wrote:
> Hmmm, I guess the XPath expression is known before the parsing begins?
> I remember I have done a similar thing, where a chunk had to be isolated
> from a document that came by via a SAX stream, but here the xpath
> expression was something like: "/element1/elemen...@id=somenumber]".
>
> Theorem: any XPath expression can be evaluated with a SAX filter.
> Proof?
> Do you know some exceptions?
>
> Jos
>
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-22 Thread Sylvain Wallez

Jos Snellings wrote:

Hmmm, I guess the XPath expression is known before the parsing begins?
I remember I have done a similar thing, where a chunk had to be isolated
from a document that came by via a SAX stream, but here the xpath
expression was something like: "/element1/elemen...@id=somenumber]".

Theorem: any XPath expression can be evaluated with a SAX filter.
Proof?
Do you know some exceptions?
  


What about this one : //foo[bar[position() = 3]//baz], find all elements 
"foo" whose 3rd "bar" child has a "baz" descendent element.


This requires to buffer the contents of every "foo" element to inspect 
their chidren sub-tree.


Sylvain

--
Sylvain Wallez - http://bluxte.net



Re: XInclude optimization

2009-11-22 Thread Sylvain Wallez

Simone Tripodi wrote:

Hi Sylvain,
thanks for your kind reply! I suspected the XPath limitations you
explained very well, but deeply in my heart I was hoping to a solution
I didn't know yet, for this reason I asked it :P :P

I'll take a look at both the solutions, eve if the first sounds to me
more compliant to the xpointer recommendation and at the same time
closer with what I already did - and to older XInclude cocoon
implementations.
  


Axiom is what will give you the better compliance, but it is a 
relatively heavyweight solution compared to pure streaming. This is why 
I was suggesting to choose the actual xpath implementation according to 
the given XPath expression, since the Tika approach is really pure 
streaming. But this adds some complexity.



Thank you very much for your hints, very well appreciated :)
A bientot!
Simone

P.S. Offtopic: maybe I'm wrong, but I'm sure we met once in Tolouse, I
was one of the Asemantics juniors involved in Joost :P
  


That's right! I did not made the connection! This is a small world ;-)

Sylvain

--
Sylvain Wallez - http://bluxte.net



Re: XInclude optimization

2009-11-22 Thread Jos Snellings
Hmmm, I guess the XPath expression is known before the parsing begins?
I remember I have done a similar thing, where a chunk had to be isolated
from a document that came by via a SAX stream, but here the xpath
expression was something like: "/element1/elemen...@id=somenumber]".

Theorem: any XPath expression can be evaluated with a SAX filter.
Proof?
Do you know some exceptions?

Jos



Re: XInclude optimization

2009-11-22 Thread Simone Tripodi
Hi Sylvain,
thanks for your kind reply! I suspected the XPath limitations you
explained very well, but deeply in my heart I was hoping to a solution
I didn't know yet, for this reason I asked it :P :P

I'll take a look at both the solutions, eve if the first sounds to me
more compliant to the xpointer recommendation and at the same time
closer with what I already did - and to older XInclude cocoon
implementations.

Thank you very much for your hints, very well appreciated :)
A bientot!
Simone

P.S. Offtopic: maybe I'm wrong, but I'm sure we met once in Tolouse, I
was one of the Asemantics juniors involved in Joost :P

On Sun, Nov 22, 2009 at 3:27 PM, Sylvain Wallez  wrote:
> Simone Tripodi wrote:
>>
>> Hi all guys,
>> I'm very sorry if I don't appear frequently on the ML but since April
>> I've been working very hard for a customer client in Paris that don't
>> let me some spare time to dedicate to OS projects.
>>
>
> Don't be sorry. We all have our own jobs/interest/duties that have driven us
> away from Cocoon. Glad to see you back!
>
>> I'm writing because I'm sure the XInclude transformer I submitted time
>> ago could be optimized, so I'd like to ask you a little help :)
>>
>> The state of the art is that, when including an entire document, it is
>> processed efficiently through SAX APIs; the problem comes when
>> processing a document referenced by xinclude+xpointer, that forces the
>> processor to extract a sub-document of the included.
>>
>> To perform this, I implemented a DOM parsing, then through XPath I
>> extract the sub-document the processor has to be included, then
>> navigating the elements will be converted to SAX events. As you
>> noticed, this takes time, too much IMO, but I didn't find/don't know
>> any better solution :(
>> Since you experienced the stax, maybe you're able to suggest me a fast
>> way to parse a document with xpath and invoke SAX events, so I'm able
>> to provide you a much better - and faster, above all - solution.
>>
>> Any hint? Every suggestion will be very appreciated.
>>
>
> The problem with XPath and XML streaming (be it SAX or StAX) is that XPath
> is a language that allows exploring the document tree in all directions and
> thus inherently expects having the whole document tree available, which is
> clearly not compatible with streaming.
>
> There are different approaches to solving this :
> - use a deferred loading DOM implementation, which buffers events only when
> it needs them to traverse the tree. Axiom [1] provides this IIRC, along with
> an XPath implementation.
> - restrain the XPointer expression to a subset of XPath that can easily be
> implemented on top of a stream. This means restricting selection only on the
> current element, its attribute and its ancestors. There's an implementation
> of this approach in Tika.
>
> The XInclude transformer can be smart enough to use the most efficient
> implementation for the given XPath expression, i.e. try to parse it with
> Tika's restricted subset, and fallback to something more costly, either
> Axiom or plain DOM.
>
> Sylvain
>
> [1] http://ws.apache.org/commons/axiom/
> [2]
> https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/
>
> --
> Sylvain Wallez - http://bluxte.net
>
>



-- 
http://www.google.com/profiles/simone.tripodi


Re: XInclude optimization

2009-11-22 Thread Sylvain Wallez

Simone Tripodi wrote:

Hi all guys,
I'm very sorry if I don't appear frequently on the ML but since April
I've been working very hard for a customer client in Paris that don't
let me some spare time to dedicate to OS projects.
  


Don't be sorry. We all have our own jobs/interest/duties that have 
driven us away from Cocoon. Glad to see you back!



I'm writing because I'm sure the XInclude transformer I submitted time
ago could be optimized, so I'd like to ask you a little help :)

The state of the art is that, when including an entire document, it is
processed efficiently through SAX APIs; the problem comes when
processing a document referenced by xinclude+xpointer, that forces the
processor to extract a sub-document of the included.

To perform this, I implemented a DOM parsing, then through XPath I
extract the sub-document the processor has to be included, then
navigating the elements will be converted to SAX events. As you
noticed, this takes time, too much IMO, but I didn't find/don't know
any better solution :(
Since you experienced the stax, maybe you're able to suggest me a fast
way to parse a document with xpath and invoke SAX events, so I'm able
to provide you a much better - and faster, above all - solution.

Any hint? Every suggestion will be very appreciated.
  


The problem with XPath and XML streaming (be it SAX or StAX) is that 
XPath is a language that allows exploring the document tree in all 
directions and thus inherently expects having the whole document tree 
available, which is clearly not compatible with streaming.


There are different approaches to solving this :
- use a deferred loading DOM implementation, which buffers events only 
when it needs them to traverse the tree. Axiom [1] provides this IIRC, 
along with an XPath implementation.
- restrain the XPointer expression to a subset of XPath that can easily 
be implemented on top of a stream. This means restricting selection only 
on the current element, its attribute and its ancestors. There's an 
implementation of this approach in Tika.


The XInclude transformer can be smart enough to use the most efficient 
implementation for the given XPath expression, i.e. try to parse it with 
Tika's restricted subset, and fallback to something more costly, either 
Axiom or plain DOM.


Sylvain

[1] http://ws.apache.org/commons/axiom/
[2] 
https://svn.apache.org/repos/asf/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/sax/xpath/


--
Sylvain Wallez - http://bluxte.net



Re: XInclude

2003-08-27 Thread Niclas Hedhman
Bruno Dumon said:
> The XInclude transformer depends on the setDocumentLocator() SAX
> event for getting the base location in case there is no xml:base
> attribute. If you simply have a FileGenerator with after that the
> XInclude
> transformer, everything should work well. But maybe your problem
> is caused by using another generator, or by using a transformer
> before the XInclude transformer which doesn't let through the
> setDocumentLocator event.

Hmmm... It is a bit difficult, as Forrest adds a "forest" of stuff
that I don't follow.

The bottom line is;

1. Command-line executed Forrest correctly resolves the relative
href in XInclude.
2. Mounting Forrest in a web application, the relative href becomes
relative to the mount point, and not relative to the includer
document.

Now, perhaps you think I should deal with Forrest and not Cocoon,
but I am rather sure it is more Cocoon related.


I'll try to produce a "Out-of-the-Box" Forrest Testcase.


Cheers,
Niclas




Re: XInclude

2003-08-25 Thread Bruno Dumon
On Mon, 2003-08-25 at 13:15, Joerg Heinicke wrote:
> Bruno, who implemented the new XInclude stuff, is/was on vacation AFAIK. 

I'm back now ;-)

> Why not simply filing a bug?

agreed, and preferably with some more information, see below...

> 
> Niclas Hedhman wrote:
> > Niclas Hedhman said:
> > 
> >>
> >>
> >> does not behave correctly. The xml:base is set to the top-level
> >>directory, i.e. content/, and not to the same directory as the
> >>including document as the spec says.
> >>Setting the xml:base attribute to the current directory, absolute
> >>or relative to the content/, works but is not a solution.

The XInclude transformer depends on the setDocumentLocator() SAX event
for getting the base location in case there is no xml:base attribute. If
you simply have a FileGenerator with after that the XInclude
transformer, everything should work well. But maybe your problem is
caused by using another generator, or by using a transformer before the
XInclude transformer which doesn't let through the setDocumentLocator
event.

-- 
Bruno Dumon http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
[EMAIL PROTECTED]  [EMAIL PROTECTED]



Re: XInclude

2003-08-25 Thread Joerg Heinicke
Bruno, who implemented the new XInclude stuff, is/was on vacation AFAIK. 
Why not simply filing a bug?

Joerg

Niclas Hedhman wrote:
Niclas Hedhman said:



does not behave correctly. The xml:base is set to the top-level
directory, i.e. content/, and not to the same directory as the
including document as the spec says.
Setting the xml:base attribute to the current directory, absolute
or relative to the content/, works but is not a solution.


Maybe I wasn't clear.

1. If I have my own sub-sitemap in the default Forrest environment,
and running command-line tools, the above doesn't resolve correctly.
2. If I define the documents as Forrest types, and let Forrest
provide the resolution, it resolves OK from the command-line.
3. If I take the whole setup that works from the command-line and
deploy it live under Jetty, the xinclude again have the xml:base set
to the top level.
something is utterly wrong. Am I the only one who like Xinclude?

Niclas



Re: XInclude

2003-08-25 Thread Niclas Hedhman
Niclas Hedhman said:
> 
>
>  does not behave correctly. The xml:base is set to the top-level
> directory, i.e. content/, and not to the same directory as the
> including document as the spec says.
> Setting the xml:base attribute to the current directory, absolute
> or relative to the content/, works but is not a solution.


Maybe I wasn't clear.

1. If I have my own sub-sitemap in the default Forrest environment,
and running command-line tools, the above doesn't resolve correctly.

2. If I define the documents as Forrest types, and let Forrest
provide the resolution, it resolves OK from the command-line.

3. If I take the whole setup that works from the command-line and
deploy it live under Jetty, the xinclude again have the xml:base set
to the top level.

something is utterly wrong. Am I the only one who like Xinclude?

Niclas