Romain have worked on a StAX expression iterator which allows to split big XML files as well but using the JAXB/StAX API. https://issues.apache.org/jira/browse/CAMEL-3966
This requires end users to have model classes with JAXB annotations, which you then use as matcher in the iterator. So you would have a Records and Record classes with JAXB annotations. This would also be a solution but is of course pure XML based as well requires model classes. However I like this approach. And could be a base for a StAXBuilder that Christian Mueller have proposed in https://issues.apache.org/jira/browse/CAMEL-3998 On Sun, Oct 30, 2011 at 10:50 AM, Claus Ibsen <[email protected]> wrote: > Hi > > I recently had a look at improving the XSLT, XQuery and XPath > components in Camel. > > For example these first two of these components now supports StAX as Source. > And prefer StAX/SAX over DOM etc. For StAX you will need to enable it > using allowStAX option (to be backwards compatible) > > The latter (XPath) does not support this, because its javax API is limited. > Likewise the XPath engine in the JDK does not support streaming, so we > end up loading the content into a DOM in memory. > > So this means that when people are trying to split a big XML file with > XPath in Camel, they hit OOME or have a solution that eats up memory > and the system becomes slower. > > The solution is to build a custom expression that will iterate the > file source in pieces and do the "XPath splitting" manually. > So I have enhanced the tokenizer language in Camel so it can do this for you. > > See the sections: > - strem based > - streaming big XML payloads using Tokenizer language > at http://camel.apache.org/splitter > > The idea is that you provide a start and end token, and then the > tokenizer will chop the payload by grabbing the content between those > tokens. > All in a streamed fashion using the java.util.Scanner from the JDK. > > I added some unit tests to simulate big data and to output performance > in camel-core > - TokenPairIteratorSplitChoicePerformanceTest > - XPathSplitChoicePerformanceTest > > As well in camel-saxon we have a unit test as well > - XPathSplitChoicePerformanceTest > > I noticed Saxon is faster than the JDK XPath engine, but they both eat > up memory. I looked at Saxon and they are starting to support > streaming but only in their EE version (which you need to buy a > license for) and the streaming seems to be XSTL specific at first. > (Not XPath). > > I also added a INFO logging in the XPathBuilder so it logs once when > it initializes the XPathFactory. This allows you to know which factory > is used > INFO XPathBuilder - Created default XPathFactory > com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f > For example if you have Saxon on the classpath it may use that instead. > > For example to split 40.000 elements using the JDK XPath Engine > - Processed file with 40000 elements in: 45.521 seconds (uses about 98mb) > > And 40.000 elements with the tokenizer > - Processed file with 40000 elements in: 47.291 seconds (uses about 6mb) > > And 200.000 elements with the tokenizer > - Processed file with 200000 elements in: 3 minutes (uses about 14mb) > > I could not run the 200.000 elements with XPath as it hit OOME (unless > I bump up the JVM memory allocations a lot) > > So its not really about speed, but about memory usages. The tokenizer > is very low memory usages, where as XPath will just keep eating > memory. > Now if the XML data was very big then only the tokenizer would be able > to split the file. > > The tokenizer is of course not using a real XPath expression, so you > can only split by chopping out a "record" of you XML file. > But if you structure your XML data as follows, then the tokenizer can handle > it: > <records> > <record id="1"> > </record> > <record id="2"> > </record> > <record id="3"> > </record> > .... > <record id="N"> > </record> > </records> > > Also the tokenizer can support non XML as well, in case you have > special START/END tokens for your records. > > > What about other XPath libraries? > Yes there is a few out there. Some is not so active maintained (I > guess some the XML hyper is over now) and others have a GPL license or > other kind > of license that prevents us to use it at Apache > http://www.apache.org/legal/3party.html#define-thirdpartywork > > > > -- > Claus Ibsen > ----------------- > FuseSource > Email: [email protected] > Web: http://fusesource.com > Twitter: davsclaus, fusenews > Blog: http://davsclaus.blogspot.com/ > Author of Camel in Action: http://www.manning.com/ibsen/ > -- Claus Ibsen ----------------- FuseSource Email: [email protected] Web: http://fusesource.com Twitter: davsclaus, fusenews Blog: http://davsclaus.blogspot.com/ Author of Camel in Action: http://www.manning.com/ibsen/
