I developed the spark-xml-utils library because we have a large amount of XML in big datasets and I felt this data could be better served by providing some helpful xml utilities. This includes the ability to filter documents based on an xpath/xquery expression, return specific nodes for an xpath/xquery expression, or transform documents using an xquery or an xslt stylesheet. By providing some basic wrappers to Saxon-HE, the spark-xml-utils library exposes some basic xpath, xslt, and xquery functionality that can readily be leveraged by any Spark application (including the spark-shell). We want to share this library with the community and are making it available under the Apache 2.0 license. For point of reference, I was able to parse and apply a fairly complex xpath expression against 2 million documents (130GB total and 75KB/doc average) in less than 3 minute on an AWS cluster (at spot price) costing less than $1/hr. When I have a chance, I will blog/write about some of my other investigations when using spark-xml-utils. More about the project is available on github(https://github.com/elsevierlabs/spark-xml-utils). There are examples for usage from the spark-shell as well as from a Java application. Feel free to use, contribute, and/or let us know how this library can be improved. Let me know if you have any questions. Darin.