Re: Best way to read XML data from RDD
Below is source code for parsing xml RDD which has single line xml data. import scala.xml.XML import scala.xml.Elem import scala.collection.mutable.ArrayBuffer import scala.xml.Text import scala.xml.Node var dataArray= new ArrayBuffer[String]() def processNode(node: Node, fp1: String):Unit = node match { case Elem(prefix,label,attribs,scope,Text(text)) => dataArray.+=:("Cust.001.001.03-"+fp1+","+text) case _ => for (n <- node.child) { val fp=fp1+"/"+n.label processNode(n, fp) } } val dataDF = xmlData .map { x => val p = XML.loadString(x.get(0).toString.mkString) val xsd = utils.getXSD(p) println("xsd -- ",xsd) val f = "/" + p.label val msgId = (p \\ "Fnd" \ "Mesg" \ "Paid" \ "Record" \ "CustInit" \ "GroupFirst" \ "MesgId").text processNode(p,f,xsd) (mesgId ,utils.dataArray,x.get(1).toString()) } .flatMap{x => val msgId = x._1 val y = x._2.toIterable.map { x1 => (mesgId,x1.split(",").apply(0),x1.split(",").apply(1),x._3) } y }.toDF("key","attribute","value","type") On Mon, Aug 22, 2016 at 4:34 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Do you mind share your codes and sample data? It should be okay with > single XML if I remember this correctly. > > 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi < > diwakar.dhanusk...@gmail.com>: > >> Hi Darin, >> >> Ate you using this utility to parse single line XML? >> >> >> Sent from Samsung Mobile. >> >> >> Original message >> From: Darin McBeath <ddmcbe...@yahoo.com> >> Date:21/08/2016 17:44 (GMT+05:30) >> To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> >> >> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung < >> felixcheun...@hotmail.com>, user <user@spark.apache.org> >> Subject: Re: Best way to read XML data from RDD >> >> Another option would be to look at spark-xml-utils. We use this >> extensively in the manipulation of our XML content. >> >> https://github.com/elsevierlabs-os/spark-xml-utils >> >> >> >> There are quite a few examples. Depending on your preference (and what >> you want to do), you could use xpath, xquery, or xslt to transform, >> extract, or filter. >> >> Like mentioned below, you want to initialize the parser in a >> mapPartitions call (one of the examples shows this). >> >> Hope this is helpful. >> >> Darin. >> >> >> >> >> >> >> From: Hyukjin Kwon <gurwls...@gmail.com> >> To: Jörn Franke <jornfra...@gmail.com> >> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung < >> felixcheun...@hotmail.com>; user <user@spark.apache.org> >> Sent: Sunday, August 21, 2016 6:10 AM >> Subject: Re: Best way to read XML data from RDD >> >> >> >> Hi Diwakar, >> >> Spark XML library can take RDD as source. >> >> ``` >> val df = new XmlReader() >> .withRowTag("book") >> .xmlRdd(sqlContext, rdd) >> ``` >> >> If performance is critical, I would also recommend to take care of >> creation and destruction of the parser. >> >> If the parser is not serializble, then you can do the creation for each >> partition within mapPartition just like >> >> https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a >> 48fed9bb188140423/sql/core/src/main/scala/org/apache/ >> spark/sql/DataFrameReader.scala#L322-L325 >> >> >> I hope this is helpful. >> >> >> >> >> 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: >> >> I fear the issue is that this will create and destroy a XML parser object >> 2 mio times, which is very inefficient - it does not really look like a >> parser performance issue. Can't you do something about the format choice? >> Ask your suppl
Re: Best way to read XML data from RDD
Yes, you can use it for single line XML or even a multi-line XML. In our typical mode of operation, we have sequence files (where the value is the XML). We then run operations over the XML to extract certain values or to transform the XML into another format (such as json). If i understand your question, your content is in json. Some of the values within this json are XML strings. You should be able to use spark-xml-utils to parse this string and filter/evaluate the result of an xpath expression (or xquery/xslt). One limitation of spark-xml-utils when using the evaluate operation is that it returns a string. So, you have to be a little creative when returning multiple values (such as delimiting the values with a special character and then splitting on this delimiter). Darin. From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> To: Darin McBeath <ddmcbe...@yahoo.com>; Hyukjin Kwon <gurwls...@gmail.com>; Jörn Franke <jornfra...@gmail.com> Cc: Felix Cheung <felixcheun...@hotmail.com>; user <user@spark.apache.org> Sent: Monday, August 22, 2016 6:53 AM Subject: Re: Best way to read XML data from RDD Hi Darin, Ate you using this utility to parse single line XML? Sent from Samsung Mobile. Original message From: Darin McBeath <ddmcbe...@yahoo.com> Date:21/08/2016 17:44 (GMT+05:30) To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD Another option would be to look at spark-xml-utils. We use this extensively in the manipulation of our XML content. https://github.com/elsevierlabs-os/spark-xml-utils There are quite a few examples. Depending on your preference (and what you want to do), you could use xpath, xquery, or xslt to transform, extract, or filter. Like mentioned below, you want to initialize the parser in a mapPartitions call (one of the examples shows this). Hope this is helpful. Darin. From: Hyukjin Kwon <gurwls...@gmail.com> To: Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung <felixcheun...@hotmail.com>; user <user@spark.apache.org> Sent: Sunday, August 21, 2016 6:10 AM Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? >Otherwise you could just create one XML Parser object / node, but sharing this >among the parallel tasks on the same node is tricky. >The other possibility could be simply more hardware ... > >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> >wrote: > > >Yes . It accepts a xml file as source but not RDD. The XML data embedded >inside json is streamed from kafka cluster. So I could get it as RDD. >>Right now I am using spark.xml XML.loadstring method inside RDD map >>function but performance wise I am not happy as it takes 4 minutes to >>parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. >> >> >> >> >>Sent from Samsung Mobile. >> >> >> Original message ---- >>From: Felix Cheung <felixcheun...@hotmail.com> >>Date:20/08/2016 09:49 (GMT+05:30) >>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user >><user@spark.apache.org> >>Cc: >>Subject: Re: Best way to read XML data from RDD >> >> >>Have you tried >> >>https://github.com/databricks/ spark-xml >>? >> >> >> >> >> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" >><diwakar.dhanusk...@gmail.com> wrote: >> >> >>Hi, >> >> >>There is a RDD with json data. I could read json data using rdd.read.json . >>The json data has XML data in couple of key-value paris. >> >> >>Which is the best method to read and parse XML from rdd. Is there any >>specific xml libraries for spark. Could anyone help on this. >> >> >>Thanks.
RE: Best way to read XML data from RDD
I was building a small app to stream messages from kafka via spark. The message was an xml, every message is a new xml. I wrote a simple app to do so[ this app expects the xml to be a single line] from __future__ import print_function from pyspark.sql import Row import xml.etree.ElementTree as ET import sys from pyspark import SparkContext from pyspark import SparkConf from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils ## This is where you parse the XML dict ={} def create_dict(rt,new=None): global parent_tag for child in rt: if new == None : parent_tag = child.tag else : parent_tag = parent_tag if child.getchildren(): create_dict(child,parent_tag) else: # if child.tag in dict.keys(): # tag = tag + child.tag # else: # tag=child.tag dict[parent_tag]=child.text return dict def parse_xml_to_row(xmlString): dct={} root = ET.fromstring(xmlString.encode('utf-8')) dct = create_dict(root) return Row(**dct) def toCSVLine(data): return ','.join(str(d) for d in data) ## Parsing code part ends here #sc.stop() # Configure Spark conf = SparkConf().setAppName("PythonStreamingKafkaWordCount") conf = conf.setMaster("local[*]") sc = SparkContext(conf=conf) sc.setLogLevel("WARN") ssc = StreamingContext(sc, 10) zkQuorum, topic = 'localhost:2182', 'topic-name' kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1}) lines = kvs.map(lambda x: x[1]).map(parse_xml_to_row).map(toCSVLine) # lines.pprint() lines.saveAsTextFiles('where you want to write the file ') ssc.start() ssc.awaitTerminationOrTimeout(50) ssc.stop() Hope this is helpful. Puneet From: Hyukjin Kwon [mailto:gurwls...@gmail.com] Sent: Monday, August 22, 2016 4:34 PM To: Diwakar Dhanuskodi Cc: Darin McBeath; Jörn Franke; Felix Cheung; user Subject: Re: Best way to read XML data from RDD Do you mind share your codes and sample data? It should be okay with single XML if I remember this correctly. 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>: Hi Darin, Ate you using this utility to parse single line XML? Sent from Samsung Mobile. Original message From: Darin McBeath <ddmcbe...@yahoo.com<mailto:ddmcbe...@yahoo.com>> Date:21/08/2016 17:44 (GMT+05:30) To: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>, Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>, Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, user <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Best way to read XML data from RDD Another option would be to look at spark-xml-utils. We use this extensively in the manipulation of our XML content. https://github.com/elsevierlabs-os/spark-xml-utils There are quite a few examples. Depending on your preference (and what you want to do), you could use xpath, xquery, or xslt to transform, extract, or filter. Like mentioned below, you want to initialize the parser in a mapPartitions call (one of the examples shows this). Hope this is helpful. Darin. From: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>> To: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>; Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>; user <user@spark.apache.org<mailto:user@spark.apache.org>> Sent: Sunday, August 21, 2016 6:10 AM Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>: I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideal
Re: Best way to read XML data from RDD
Do you mind share your codes and sample data? It should be okay with single XML if I remember this correctly. 2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> : > Hi Darin, > > Ate you using this utility to parse single line XML? > > > Sent from Samsung Mobile. > > > Original message > From: Darin McBeath <ddmcbe...@yahoo.com> > Date:21/08/2016 17:44 (GMT+05:30) > To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> > > Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung < > felixcheun...@hotmail.com>, user <user@spark.apache.org> > Subject: Re: Best way to read XML data from RDD > > Another option would be to look at spark-xml-utils. We use this > extensively in the manipulation of our XML content. > > https://github.com/elsevierlabs-os/spark-xml-utils > > > > There are quite a few examples. Depending on your preference (and what > you want to do), you could use xpath, xquery, or xslt to transform, > extract, or filter. > > Like mentioned below, you want to initialize the parser in a mapPartitions > call (one of the examples shows this). > > Hope this is helpful. > > Darin. > > > > > > > From: Hyukjin Kwon <gurwls...@gmail.com> > To: Jörn Franke <jornfra...@gmail.com> > Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung < > felixcheun...@hotmail.com>; user <user@spark.apache.org> > Sent: Sunday, August 21, 2016 6:10 AM > Subject: Re: Best way to read XML data from RDD > > > > Hi Diwakar, > > Spark XML library can take RDD as source. > > ``` > val df = new XmlReader() > .withRowTag("book") > .xmlRdd(sqlContext, rdd) > ``` > > If performance is critical, I would also recommend to take care of > creation and destruction of the parser. > > If the parser is not serializble, then you can do the creation for each > partition within mapPartition just like > > https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9b > b188140423/sql/core/src/main/scala/org/apache/spark/sql/ > DataFrameReader.scala#L322-L325 > > > I hope this is helpful. > > > > > 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: > > I fear the issue is that this will create and destroy a XML parser object > 2 mio times, which is very inefficient - it does not really look like a > parser performance issue. Can't you do something about the format choice? > Ask your supplier to deliver another format (ideally avro or sth like > this?)? > >Otherwise you could just create one XML Parser object / node, but sharing > this among the parallel tasks on the same node is tricky. > >The other possibility could be simply more hardware ... > > > >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi < > diwakar.dhanusk...@gmail.com> wrote: > > > > > >Yes . It accepts a xml file as source but not RDD. The XML data embedded > inside json is streamed from kafka cluster. So I could get it as RDD. > >>Right now I am using spark.xml XML.loadstring method inside RDD map > function but performance wise I am not happy as it takes 4 minutes to > parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. > >> > >> > >> > >> > >>Sent from Samsung Mobile. > >> > >> > >> Original message > >>From: Felix Cheung <felixcheun...@hotmail.com> > >>Date:20/08/2016 09:49 (GMT+05:30) > >>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user < > user@spark.apache.org> > >>Cc: > >>Subject: Re: Best way to read XML data from RDD > >> > >> > >>Have you tried > >> > >>https://github.com/databricks/ spark-xml > >>? > >> > >> > >> > >> > >> > >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" < > diwakar.dhanusk...@gmail.com> wrote: > >> > >> > >>Hi, > >> > >> > >>There is a RDD with json data. I could read json data using > rdd.read.json . The json data has XML data in couple of key-value paris. > >> > >> > >>Which is the best method to read and parse XML from rdd. Is there any > specific xml libraries for spark. Could anyone help on this. > >> > >> > >>Thanks. >
Re: Best way to read XML data from RDD
Hi Darin, Ate you using this utility to parse single line XML? Sent from Samsung Mobile. Original message From: Darin McBeath <ddmcbe...@yahoo.com> Date:21/08/2016 17:44 (GMT+05:30) To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD Another option would be to look at spark-xml-utils. We use this extensively in the manipulation of our XML content. https://github.com/elsevierlabs-os/spark-xml-utils There are quite a few examples. Depending on your preference (and what you want to do), you could use xpath, xquery, or xslt to transform, extract, or filter. Like mentioned below, you want to initialize the parser in a mapPartitions call (one of the examples shows this). Hope this is helpful. Darin. From: Hyukjin Kwon <gurwls...@gmail.com> To: Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung <felixcheun...@hotmail.com>; user <user@spark.apache.org> Sent: Sunday, August 21, 2016 6:10 AM Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? >Otherwise you could just create one XML Parser object / node, but sharing this >among the parallel tasks on the same node is tricky. >The other possibility could be simply more hardware ... > >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> >wrote: > > >Yes . It accepts a xml file as source but not RDD. The XML data embedded >inside json is streamed from kafka cluster. So I could get it as RDD. >>Right now I am using spark.xml XML.loadstring method inside RDD map >>function but performance wise I am not happy as it takes 4 minutes to >>parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. >> >> >> >> >>Sent from Samsung Mobile. >> >> >> Original message >>From: Felix Cheung <felixcheun...@hotmail.com> >>Date:20/08/2016 09:49 (GMT+05:30) >>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user >><user@spark.apache.org> >>Cc: >>Subject: Re: Best way to read XML data from RDD >> >> >>Have you tried >> >>https://github.com/databricks/ spark-xml >>? >> >> >> >> >> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" >><diwakar.dhanusk...@gmail.com> wrote: >> >> >>Hi, >> >> >>There is a RDD with json data. I could read json data using rdd.read.json . >>The json data has XML data in couple of key-value paris. >> >> >>Which is the best method to read and parse XML from rdd. Is there any >>specific xml libraries for spark. Could anyone help on this. >> >> >>Thanks.
Re: Best way to read XML data from RDD
Hi Franke, Source format cannot be changed as of now add it is a pretty standard format working for years. Yeah creating one parser I can tryout . Sent from Samsung Mobile. Original message From: Jörn Franke <jornfra...@gmail.com> Date:20/08/2016 11:40 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> Cc: Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? Otherwise you could just create one XML Parser object / node, but sharing this among the parallel tasks on the same node is tricky. The other possibility could be simply more hardware ... On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> wrote: Yes . It accepts a xml file as source but not RDD. The XML data embedded inside json is streamed from kafka cluster. So I could get it as RDD. Right now I am using spark.xml XML.loadstring method inside RDD map function but performance wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. Sent from Samsung Mobile. Original message From: Felix Cheung <felixcheun...@hotmail.com> Date:20/08/2016 09:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user <user@spark.apache.org> Cc: Subject: Re: Best way to read XML data from RDD Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <diwakar.dhanusk...@gmail.com> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.
Re: Best way to read XML data from RDD
Hi Kwon, Was trying out spark XML library . I keep on getting errors in inferring schema. Looks like it cannot infer single line XML data. Sent from Samsung Mobile. Original message From: Hyukjin Kwon <gurwls...@gmail.com> Date:21/08/2016 15:40 (GMT+05:30) To: Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user <user@spark.apache.org> Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? Otherwise you could just create one XML Parser object / node, but sharing this among the parallel tasks on the same node is tricky. The other possibility could be simply more hardware ... On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> wrote: Yes . It accepts a xml file as source but not RDD. The XML data embedded inside json is streamed from kafka cluster. So I could get it as RDD. Right now I am using spark.xml XML.loadstring method inside RDD map function but performance wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. Sent from Samsung Mobile. Original message From: Felix Cheung <felixcheun...@hotmail.com> Date:20/08/2016 09:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user <user@spark.apache.org> Cc: Subject: Re: Best way to read XML data from RDD Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <diwakar.dhanusk...@gmail.com> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.
Re: Best way to read XML data from RDD
Another option would be to look at spark-xml-utils. We use this extensively in the manipulation of our XML content. https://github.com/elsevierlabs-os/spark-xml-utils There are quite a few examples. Depending on your preference (and what you want to do), you could use xpath, xquery, or xslt to transform, extract, or filter. Like mentioned below, you want to initialize the parser in a mapPartitions call (one of the examples shows this). Hope this is helpful. Darin. From: Hyukjin Kwon <gurwls...@gmail.com> To: Jörn Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung <felixcheun...@hotmail.com>; user <user@spark.apache.org> Sent: Sunday, August 21, 2016 6:10 AM Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? >Otherwise you could just create one XML Parser object / node, but sharing this >among the parallel tasks on the same node is tricky. >The other possibility could be simply more hardware ... > >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> >wrote: > > >Yes . It accepts a xml file as source but not RDD. The XML data embedded >inside json is streamed from kafka cluster. So I could get it as RDD. >>Right now I am using spark.xml XML.loadstring method inside RDD map >>function but performance wise I am not happy as it takes 4 minutes to >>parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. >> >> >> >> >>Sent from Samsung Mobile. >> >> >> Original message >>From: Felix Cheung <felixcheun...@hotmail.com> >>Date:20/08/2016 09:49 (GMT+05:30) >>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user >><user@spark.apache.org> >>Cc: >>Subject: Re: Best way to read XML data from RDD >> >> >>Have you tried >> >>https://github.com/databricks/ spark-xml >>? >> >> >> >> >> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" >><diwakar.dhanusk...@gmail.com> wrote: >> >> >>Hi, >> >> >>There is a RDD with json data. I could read json data using rdd.read.json . >>The json data has XML data in couple of key-value paris. >> >> >>Which is the best method to read and parse XML from rdd. Is there any >>specific xml libraries for spark. Could anyone help on this. >> >> >>Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Best way to read XML data from RDD
Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>: > I fear the issue is that this will create and destroy a XML parser object > 2 mio times, which is very inefficient - it does not really look like a > parser performance issue. Can't you do something about the format choice? > Ask your supplier to deliver another format (ideally avro or sth like > this?)? > Otherwise you could just create one XML Parser object / node, but sharing > this among the parallel tasks on the same node is tricky. > The other possibility could be simply more hardware ... > > On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> > wrote: > > Yes . It accepts a xml file as source but not RDD. The XML data embedded > inside json is streamed from kafka cluster. So I could get it as RDD. > Right now I am using spark.xml XML.loadstring method inside RDD map > function but performance wise I am not happy as it takes 4 minutes to > parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. > > > Sent from Samsung Mobile. > > > Original message > From: Felix Cheung <felixcheun...@hotmail.com> > Date:20/08/2016 09:49 (GMT+05:30) > To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user < > user@spark.apache.org> > Cc: > Subject: Re: Best way to read XML data from RDD > > Have you tried > > https://github.com/databricks/spark-xml > ? > > > > > On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" < > diwakar.dhanusk...@gmail.com> wrote: > > Hi, > > There is a RDD with json data. I could read json data using rdd.read.json > . The json data has XML data in couple of key-value paris. > > Which is the best method to read and parse XML from rdd. Is there any > specific xml libraries for spark. Could anyone help on this. > > Thanks. > >
Re: Best way to read XML data from RDD
I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? Otherwise you could just create one XML Parser object / node, but sharing this among the parallel tasks on the same node is tricky. The other possibility could be simply more hardware ... > On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> > wrote: > > Yes . It accepts a xml file as source but not RDD. The XML data embedded > inside json is streamed from kafka cluster. So I could get it as RDD. > Right now I am using spark.xml XML.loadstring method inside RDD map > function but performance wise I am not happy as it takes 4 minutes to > parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. > > > Sent from Samsung Mobile. > > > Original message > From: Felix Cheung <felixcheun...@hotmail.com> > Date:20/08/2016 09:49 (GMT+05:30) > To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user > <user@spark.apache.org> > Cc: > Subject: Re: Best way to read XML data from RDD > > Have you tried > > https://github.com/databricks/spark-xml > ? > > > > > On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" > <diwakar.dhanusk...@gmail.com> wrote: > > Hi, > > There is a RDD with json data. I could read json data using rdd.read.json . > The json data has XML data in couple of key-value paris. > > Which is the best method to read and parse XML from rdd. Is there any > specific xml libraries for spark. Could anyone help on this. > > Thanks.
Re: Best way to read XML data from RDD
Ah. Have you tried Jackson? https://github.com/FasterXML/jackson-dataformat-xml/blob/master/README.md _ From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> Sent: Friday, August 19, 2016 9:41 PM Subject: Re: Best way to read XML data from RDD To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, user <user@spark.apache.org<mailto:user@spark.apache.org>> Yes . It accepts a xml file as source but not RDD. The XML data embedded inside json is streamed from kafka cluster. So I could get it as RDD. Right now I am using spark.xml XML.loadstring method inside RDD map function but performance wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. Sent from Samsung Mobile. Original message From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> Date:20/08/2016 09:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>, user <user@spark.apache.org<mailto:user@spark.apache.org>> Cc: Subject: Re: Best way to read XML data from RDD Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi"<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.
Re: Best way to read XML data from RDD
Yes . It accepts a xml file as source but not RDD. The XML data embedded inside json is streamed from kafka cluster. So I could get it as RDD. Right now I am using spark.xml XML.loadstring method inside RDD map function but performance wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. Sent from Samsung Mobile. Original message From: Felix Cheung <felixcheun...@hotmail.com> Date:20/08/2016 09:49 (GMT+05:30) To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user <user@spark.apache.org> Cc: Subject: Re: Best way to read XML data from RDD Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <diwakar.dhanusk...@gmail.com> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.
Re: Best way to read XML data from RDD
Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi"> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.