Yes, you can use it for single line XML or even a multi-line XML.
In our typical mode of operation, we have sequence files (where the value is 
the XML).  We then run operations over the XML to extract certain values or to 
transform the XML into another format (such as json).
If i understand your question, your content is in json.  Some of the values 
within this json are XML strings.  You should be able to use spark-xml-utils to 
parse this string and filter/evaluate the result of an xpath expression (or 
xquery/xslt).
One limitation of spark-xml-utils when using the evaluate operation is that it 
returns a string.  So, you have to be a little creative when returning multiple 
values (such as delimiting the values with a special character and then 
splitting on this delimiter).  
Darin.

      From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>
 To: Darin McBeath <ddmcbe...@yahoo.com>; Hyukjin Kwon <gurwls...@gmail.com>; 
Jörn Franke <jornfra...@gmail.com> 
Cc: Felix Cheung <felixcheun...@hotmail.com>; user <user@spark.apache.org>
 Sent: Monday, August 22, 2016 6:53 AM
 Subject: Re: Best way to read XML data from RDD
   
Hi Darin, 
Ate  you  using  this  utility  to  parse single line XML?

Sent from Samsung Mobile.

-------- Original message --------From: Darin McBeath <ddmcbe...@yahoo.com> 
Date:21/08/2016 17:44 (GMT+05:30) To: Hyukjin Kwon <gurwls...@gmail.com>, Jörn 
Franke <jornfra...@gmail.com> Cc: Diwakar Dhanuskodi 
<diwakar.dhanusk...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>, user 
<user@spark.apache.org> Subject: Re: Best way to read XML data from RDD 
Another option would be to look at spark-xml-utils.  We use this extensively in 
the manipulation of our XML content.

https://github.com/elsevierlabs-os/spark-xml-utils



There are quite a few examples.  Depending on your preference (and what you 
want to do), you could use xpath, xquery, or xslt to transform, extract, or 
filter.

Like mentioned below, you want to initialize the parser in a mapPartitions call 
(one of the examples shows this).

Hope this is helpful.

Darin.





________________________________
From: Hyukjin Kwon <gurwls...@gmail.com>
To: Jörn Franke <jornfra...@gmail.com> 
Cc: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>; Felix Cheung 
<felixcheun...@hotmail.com>; user <user@spark.apache.org>
Sent: Sunday, August 21, 2016 6:10 AM
Subject: Re: Best way to read XML data from RDD



Hi Diwakar,

Spark XML library can take RDD as source.

```
val df = new XmlReader()
  .withRowTag("book")
  .xmlRdd(sqlContext, rdd)
```

If performance is critical, I would also recommend to take care of creation and 
destruction of the parser.

If the parser is not serializble, then you can do the creation for each 
partition within mapPartition just like

https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325


I hope this is helpful.




2016-08-20 15:10 GMT+09:00 Jörn Franke <jornfra...@gmail.com>:

I fear the issue is that this will create and destroy a XML parser object 2 mio 
times, which is very inefficient - it does not really look like a parser 
performance issue. Can't you do something about the format choice? Ask your 
supplier to deliver another format (ideally avro or sth like this?)?
>Otherwise you could just create one XML Parser object / node, but sharing this 
>among the parallel tasks on the same node is tricky.
>The other possibility could be simply more hardware ...
>
>On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> 
>wrote:
>
>
>Yes . It accepts a xml file as source but not RDD. The XML data embedded  
>inside json is streamed from kafka cluster.  So I could get it as RDD. 
>>Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map 
>>function  but  performance  wise I am not happy as it takes 4 minutes to 
>>parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. 
>>
>>
>>
>>
>>Sent from Samsung Mobile.
>>
>>
>>-------- Original message --------
>>From: Felix Cheung <felixcheun...@hotmail.com> 
>>Date:20/08/2016  09:49  (GMT+05:30) 
>>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> , user 
>><user@spark.apache.org> 
>>Cc: 
>>Subject: Re: Best way to read XML data from RDD 
>>
>>
>>Have you tried
>>
>>https://github.com/databricks/ spark-xml
>>?
>>
>>
>>
>>
>>
>>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" 
>><diwakar.dhanusk...@gmail.com> wrote:
>>
>>
>>Hi,  
>>
>>
>>There is a RDD with json data. I could read json data using rdd.read.json . 
>>The json data has XML data in couple of key-value paris. 
>>
>>
>>Which is the best method to read and parse XML from rdd. Is there any 
>>specific xml libraries for spark. Could anyone help on this.
>>
>>
>>Thanks.


  

Reply via email to