Romain have worked on a StAX expression iterator which allows to split
big XML files as well but using the JAXB/StAX API.
https://issues.apache.org/jira/browse/CAMEL-3966

This requires end users to have model classes with JAXB annotations,
which you then use as matcher in the iterator.
So you would have a Records and Record classes with JAXB annotations.

This would also be a solution but is of course pure XML based as well
requires model classes. However I like this approach.
And could be a base for a StAXBuilder that Christian Mueller have
proposed in https://issues.apache.org/jira/browse/CAMEL-3998



On Sun, Oct 30, 2011 at 10:50 AM, Claus Ibsen <claus.ib...@gmail.com> wrote:
> Hi
>
> I recently had a look at improving the XSLT, XQuery and XPath
> components in Camel.
>
> For example these first two of these components now supports StAX as Source.
> And prefer StAX/SAX over DOM etc. For StAX you will need to enable it
> using allowStAX option (to be backwards compatible)
>
> The latter (XPath) does not support this, because its javax API is limited.
> Likewise the XPath engine in the JDK does not support streaming, so we
> end up loading the content into a DOM in memory.
>
> So this means that when people are trying to split a big XML file with
> XPath in Camel, they hit OOME or have a solution that eats up memory
> and the system becomes slower.
>
> The solution is to build a custom expression that will iterate the
> file source in pieces and do the "XPath splitting" manually.
> So I have enhanced the tokenizer language in Camel so it can do this for you.
>
> See the sections:
> - strem based
> - streaming big XML payloads using Tokenizer language
> at http://camel.apache.org/splitter
>
> The idea is that you provide a start and end token, and then the
> tokenizer will chop the payload by grabbing the content between those
> tokens.
> All in a streamed fashion using the java.util.Scanner from the JDK.
>
> I added some unit tests to simulate big data and to output performance
> in camel-core
> - TokenPairIteratorSplitChoicePerformanceTest
> - XPathSplitChoicePerformanceTest
>
> As well in camel-saxon we have a unit test as well
> - XPathSplitChoicePerformanceTest
>
> I noticed Saxon is faster than the JDK XPath engine, but they both eat
> up memory. I looked at Saxon and they are starting to support
> streaming but only in their EE version (which you need to buy a
> license for) and the streaming seems to be XSTL specific at first.
> (Not XPath).
>
> I also added a INFO logging in the XPathBuilder so it logs once when
> it initializes the XPathFactory. This allows you to know which factory
> is used
> INFO  XPathBuilder                   - Created default XPathFactory
> com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f
> For example if you have Saxon on the classpath it may use that instead.
>
> For example to split 40.000 elements using the JDK XPath Engine
> - Processed file with 40000 elements in: 45.521 seconds   (uses about 98mb)
>
> And 40.000 elements with the tokenizer
> - Processed file with 40000 elements in: 47.291 seconds   (uses about 6mb)
>
> And 200.000 elements with the tokenizer
> - Processed file with 200000 elements in: 3 minutes   (uses about 14mb)
>
> I could not run the 200.000 elements with XPath as it hit OOME (unless
> I bump up the JVM memory allocations a lot)
>
> So its not really about speed, but about memory usages. The tokenizer
> is very low memory usages, where as XPath will just keep eating
> memory.
> Now if the XML data was very big then only the tokenizer would be able
> to split the file.
>
> The tokenizer is of course not using a real XPath expression, so you
> can only split by chopping out a "record" of you XML file.
> But if you structure your XML data as follows, then the tokenizer can handle 
> it:
> <records>
>  <record id="1">
>  </record>
>  <record id="2">
>  </record>
>  <record id="3">
>  </record>
>   ....
>  <record id="N">
>  </record>
> </records>
>
> Also the tokenizer can support non XML as well, in case you have
> special START/END tokens for your records.
>
>
> What about other XPath libraries?
> Yes there is a few out there. Some is not so active maintained (I
> guess some the XML hyper is over now) and others have a GPL license or
> other kind
> of license that prevents us to use it at Apache
> http://www.apache.org/legal/3party.html#define-thirdpartywork
>
>
>
> --
> Claus Ibsen
> -----------------
> FuseSource
> Email: cib...@fusesource.com
> Web: http://fusesource.com
> Twitter: davsclaus, fusenews
> Blog: http://davsclaus.blogspot.com/
> Author of Camel in Action: http://www.manning.com/ibsen/
>



-- 
Claus Ibsen
-----------------
FuseSource
Email: cib...@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

Reply via email to