Below is source code for parsing xml RDD which has single line xml data. import scala.xml.XML import scala.xml.Elem import scala.collection.mutable.ArrayBuffer import scala.xml.Text import scala.xml.Node
var dataArray = new ArrayBuffer[String]() def processNode(node: Node, fp1: String):Unit = node match { case Elem(prefix,label,attribs,scope,Text(text)) => dataArray.+=:("Cust.001.001.03-"+fp1+","+text) case _ => for (n <- node.child) { val fp=fp1+"/"+n.label processNode(n, fp) } } val dataDF = xmlData .map { x => val p = XML.loadString(x.get(0).toString.mkString) val xsd = utils.getXSD(p) println("xsd -- ",xsd) val f = "/" + p.label val msgId = (p \\ "Fnd" \ "Mesg" \ "Paid" \ "Record" \ "CustInit" \ "GroupFirst" \ "MesgId").text processNode(p,f,xsd) (mesgId ,utils.dataArray,x.get(1).toString()) } .flatMap{x => val msgId = x._1 val y = x._2.toIterable.map { x1 => (mesgId,x1.split(",").apply(0),x1.split(",").apply(1),x._3) } y }.toDF("key","attribute","value","type") On Mon, Aug 22, 2016 at 4:34 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Do you mind share your codes and sample data? It should be okay with > single XML if I remember this correctly. > > 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi < > diwakar.dhanusk...@gmail.com>: > >> Hi Darin, >> >> Ate you using this utility to parse single line XML? >> >> >> Sent from Samsung Mobile. >> >> >> -------- Original message -------- >> From: Darin McBeath <ddmcbe...@yahoo.com> >> Date:21/08/2016 17:44 (GMT+05:30) >> To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> >> >> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung < >> felixcheun...@hotmail.com>, user <user@spark.apache.org> >> Subject: Re: Best way to read XML data from RDD >> >> Another option would be to look at spark-xml-utils. We use this >> extensively in the manipulation of our XML content. >> >> https://github.com/elsevierlabs-os/spark-xml-utils >> >> >> >> There are quite a few examples. Depending on your preference (and what >> you want to do), you could use xpath, xquery, or xslt to transform, >> extract, or filter. >> >> Like mentioned below, you want to initialize the parser in a >> mapPartitions call (one of the examples shows this). >> >> Hope this is helpful. >> >> Darin. >> >> >> >> >> >> ________________________________ >> From: Hyukjin Kwon <gurwls...@gmail.com> >> To: Jörn Franke <jornfra...@gmail.com> >> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung < >> felixcheun...@hotmail.com>; user <user@spark.apache.org> >> Sent: Sunday, August 21, 2016 6:10 AM >> Subject: Re: Best way to read XML data from RDD >> >> >> >> Hi Diwakar, >> >> Spark XML library can take RDD as source. >> >> ``` >> val df = new XmlReader() >> .withRowTag("book") >> .xmlRdd(sqlContext, rdd) >> ``` >> >> If performance is critical, I would also recommend to take care of >> creation and destruction of the parser. >> >> If the parser is not serializble, then you can do the creation for each >> partition within mapPartition just like >> >> https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a >> 48fed9bb188140423/sql/core/src/main/scala/org/apache/ >> spark/sql/DataFrameReader.scala#L322-L325 >> >> >> I hope this is helpful. >> >> >> >> >> 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: >> >> I fear the issue is that this will create and destroy a XML parser object >> 2 mio times, which is very inefficient - it does not really look like a >> parser performance issue. Can't you do something about the format choice? >> Ask your supplier to deliver another format (ideally avro or sth like >> this?)? >> >Otherwise you could just create one XML Parser object / node, but >> sharing this among the parallel tasks on the same node is tricky. >> >The other possibility could be simply more hardware ... >> > >> >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi < >> diwakar.dhanusk...@gmail.com> wrote: >> > >> > >> >Yes . It accepts a xml file as source but not RDD. The XML data >> embedded inside json is streamed from kafka cluster. So I could get it as >> RDD. >> >>Right now I am using spark.xml XML.loadstring method inside RDD >> map function but performance wise I am not happy as it takes 4 minutes >> to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each >> environment. >> >> >> >> >> >> >> >> >> >>Sent from Samsung Mobile. >> >> >> >> >> >>-------- Original message -------- >> >>From: Felix Cheung <felixcheun...@hotmail.com> >> >>Date:20/08/2016 09:49 (GMT+05:30) >> >>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user < >> user@spark.apache.org> >> >>Cc: >> >>Subject: Re: Best way to read XML data from RDD >> >> >> >> >> >>Have you tried >> >> >> >>https://github.com/databricks/ spark-xml >> >>? >> >> >> >> >> >> >> >> >> >> >> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" < >> diwakar.dhanusk...@gmail.com> wrote: >> >> >> >> >> >>Hi, >> >> >> >> >> >>There is a RDD with json data. I could read json data using >> rdd.read.json . The json data has XML data in couple of key-value paris. >> >> >> >> >> >>Which is the best method to read and parse XML from rdd. Is there any >> specific xml libraries for spark. Could anyone help on this. >> >> >> >> >> >>Thanks. >> > >