Yes . It accepts a xml file as source but not RDD. The XML data embedded inside json is streamed from kafka cluster. So I could get it as RDD. Right now I am using spark.xml XML.loadstring method inside RDD map function but performance wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
Sent from Samsung Mobile. <div>-------- Original message --------</div><div>From: Felix Cheung <felixcheun...@hotmail.com> </div><div>Date:20/08/2016 09:49 (GMT+05:30) </div><div>To: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com>, user <user@spark.apache.org> </div><div>Cc: </div><div>Subject: Re: Best way to read XML data from RDD </div><div> </div>Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <diwakar.dhanusk...@gmail.com> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.