I developed the spark-xml-utils library because we have a large amount of XML 
in big datasets and I felt this data could be better served by providing some 
helpful xml utilities. This includes the ability to filter documents based on 
an xpath/xquery expression, return specific nodes for an xpath/xquery 
expression, or transform documents using an xquery or an xslt stylesheet. By 
providing some basic wrappers to Saxon-HE, the spark-xml-utils library exposes 
some basic xpath, xslt, and xquery functionality that can readily be leveraged 
by any Spark application (including the spark-shell).  We want to share this 
library with the community and are making it available under the Apache 2.0 
license.
For point of reference, I was able to parse and apply a fairly complex xpath 
expression against 2 million documents (130GB total and 75KB/doc average) in 
less than 3 minute on an AWS cluster (at spot price) costing less than $1/hr.  
When I have a chance, I will blog/write about some of my other investigations 
when using spark-xml-utils.
More about the project is available on 
github(https://github.com/elsevierlabs/spark-xml-utils).  There are examples 
for usage from the spark-shell as well as from a Java application.  Feel free 
to use, contribute, and/or let us know how this library can be improved.  Let 
me know if you have any questions.
Darin.

Reply via email to