[HEADS UP] - Splitting big XML files using XPath

Claus Ibsen Sun, 30 Oct 2011 02:50:59 -0700

Hi

I recently had a look at improving the XSLT, XQuery and XPath
components in Camel.


For example these first two of these components now supports StAX as Source.
And prefer StAX/SAX over DOM etc. For StAX you will need to enable it
using allowStAX option (to be backwards compatible)

The latter (XPath) does not support this, because its javax API is limited.
Likewise the XPath engine in the JDK does not support streaming, so we
end up loading the content into a DOM in memory.

So this means that when people are trying to split a big XML file with
XPath in Camel, they hit OOME or have a solution that eats up memory
and the system becomes slower.

The solution is to build a custom expression that will iterate the
file source in pieces and do the "XPath splitting" manually.
So I have enhanced the tokenizer language in Camel so it can do this for you.

See the sections:
- strem based
- streaming big XML payloads using Tokenizer language
at http://camel.apache.org/splitter

The idea is that you provide a start and end token, and then the
tokenizer will chop the payload by grabbing the content between those
tokens.
All in a streamed fashion using the java.util.Scanner from the JDK.

I added some unit tests to simulate big data and to output performance
in camel-core
- TokenPairIteratorSplitChoicePerformanceTest
- XPathSplitChoicePerformanceTest

As well in camel-saxon we have a unit test as well
- XPathSplitChoicePerformanceTest

I noticed Saxon is faster than the JDK XPath engine, but they both eat
up memory. I looked at Saxon and they are starting to support
streaming but only in their EE version (which you need to buy a
license for) and the streaming seems to be XSTL specific at first.
(Not XPath).

I also added a INFO logging in the XPathBuilder so it logs once when
it initializes the XPathFactory. This allows you to know which factory
is used
INFO  XPathBuilder                   - Created default XPathFactory
com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl@3749eb9f
For example if you have Saxon on the classpath it may use that instead.

For example to split 40.000 elements using the JDK XPath Engine
- Processed file with 40000 elements in: 45.521 seconds   (uses about 98mb)

And 40.000 elements with the tokenizer
- Processed file with 40000 elements in: 47.291 seconds   (uses about 6mb)

And 200.000 elements with the tokenizer
- Processed file with 200000 elements in: 3 minutes   (uses about 14mb)

I could not run the 200.000 elements with XPath as it hit OOME (unless
I bump up the JVM memory allocations a lot)

So its not really about speed, but about memory usages. The tokenizer
is very low memory usages, where as XPath will just keep eating
memory.
Now if the XML data was very big then only the tokenizer would be able
to split the file.

The tokenizer is of course not using a real XPath expression, so you
can only split by chopping out a "record" of you XML file.
But if you structure your XML data as follows, then the tokenizer can handle it:
<records>
  <record id="1">
  </record>
  <record id="2">
  </record>
  <record id="3">
  </record>
   ....
  <record id="N">
  </record>
</records>

Also the tokenizer can support non XML as well, in case you have
special START/END tokens for your records.


What about other XPath libraries?
Yes there is a few out there. Some is not so active maintained (I
guess some the XML hyper is over now) and others have a GPL license or
other kind
of license that prevents us to use it at Apache
http://www.apache.org/legal/3party.html#define-thirdpartywork



-- 
Claus Ibsen
-----------------
FuseSource
Email: cib...@fusesource.com
Web: http://fusesource.com
Twitter: davsclaus, fusenews
Blog: http://davsclaus.blogspot.com/
Author of Camel in Action: http://www.manning.com/ibsen/

[HEADS UP] - Splitting big XML files using XPath

Reply via email to