Do you mind share your codes and sample data? It should be okay with single XML if I remember this correctly.
2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> : > Hi Darin, > > Ate you using this utility to parse single line XML? > > > Sent from Samsung Mobile. > > > -------- Original message -------- > From: Darin McBeath <ddmcbe...@yahoo.com> > Date:21/08/2016 17:44 (GMT+05:30) > To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> > > Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung < > felixcheun...@hotmail.com>, user <user@spark.apache.org> > Subject: Re: Best way to read XML data from RDD > > Another option would be to look at spark-xml-utils. We use this > extensively in the manipulation of our XML content. > > https://github.com/elsevierlabs-os/spark-xml-utils > > > > There are quite a few examples. Depending on your preference (and what > you want to do), you could use xpath, xquery, or xslt to transform, > extract, or filter. > > Like mentioned below, you want to initialize the parser in a mapPartitions > call (one of the examples shows this). > > Hope this is helpful. > > Darin. > > > > > > ________________________________ > From: Hyukjin Kwon <gurwls...@gmail.com> > To: Jörn Franke <jornfra...@gmail.com> > Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung < > felixcheun...@hotmail.com>; user <user@spark.apache.org> > Sent: Sunday, August 21, 2016 6:10 AM > Subject: Re: Best way to read XML data from RDD > > > > Hi Diwakar, > > Spark XML library can take RDD as source. > > ``` > val df = new XmlReader() > .withRowTag("book") > .xmlRdd(sqlContext, rdd) > ``` > > If performance is critical, I would also recommend to take care of > creation and destruction of the parser. > > If the parser is not serializble, then you can do the creation for each > partition within mapPartition just like > > https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9b > b188140423/sql/core/src/main/scala/org/apache/spark/sql/ > DataFrameReader.scala#L322-L325 > > > I hope this is helpful. > > > > > 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: > > I fear the issue is that this will create and destroy a XML parser object > 2 mio times, which is very inefficient - it does not really look like a > parser performance issue. Can't you do something about the format choice? > Ask your supplier to deliver another format (ideally avro or sth like > this?)? > >Otherwise you could just create one XML Parser object / node, but sharing > this among the parallel tasks on the same node is tricky. > >The other possibility could be simply more hardware ... > > > >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi < > diwakar.dhanusk...@gmail.com> wrote: > > > > > >Yes . It accepts a xml file as source but not RDD. The XML data embedded > inside json is streamed from kafka cluster. So I could get it as RDD. > >>Right now I am using spark.xml XML.loadstring method inside RDD map > function but performance wise I am not happy as it takes 4 minutes to > parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. > >> > >> > >> > >> > >>Sent from Samsung Mobile. > >> > >> > >>-------- Original message -------- > >>From: Felix Cheung <felixcheun...@hotmail.com> > >>Date:20/08/2016 09:49 (GMT+05:30) > >>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user < > user@spark.apache.org> > >>Cc: > >>Subject: Re: Best way to read XML data from RDD > >> > >> > >>Have you tried > >> > >>https://github.com/databricks/ spark-xml > >>? > >> > >> > >> > >> > >> > >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" < > diwakar.dhanusk...@gmail.com> wrote: > >> > >> > >>Hi, > >> > >> > >>There is a RDD with json data. I could read json data using > rdd.read.json . The json data has XML data in couple of key-value paris. > >> > >> > >>Which is the best method to read and parse XML from rdd. Is there any > specific xml libraries for spark. Could anyone help on this. > >> > >> > >>Thanks. >