I can't say this is the best way to do so but my instant thought is as below:
Create two df sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<emplist>") sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</emplist>") sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8") val strXmlDf = sc.newAPIHadoopFile(carsFile, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]).map { pair => new String(pair._2.getBytes, 0, pair._2.getLength) }.toDF("XML") val xmlDf = sqlContext.read.format("xml") .option("rowTag", "emplist") .load(path) zip those two maybe like this https://github.com/apache/spark/pull/7474 and then starts to filter with emp.id or emp.name. 2016-08-22 5:31 GMT+09:00 <srikanth.je...@gmail.com>: > Hello Experts, > > > > I’m using spark-xml package which is automatically inferring my schema and > creating a DataFrame. > > > > I’m extracting few fields like id, name (which are unique) from below xml, > but my requirement is to store entire XML in one of the column as well. I’m > writing this data to AVRO hive table. Can anyone tell me how to achieve > this? > > > > Example XML and expected output is given below. > > > > Sample XML: > > <emplist> > > <emp> > > <manager> > > <id>1</id> > > <name>foo</name> > > <subordinates> > > <clerk> > > <cid>1</cid> > > <cname>foo</cname> > > </clerk> > > <clerk> > > <cid>1</cid> > > <cname>foo</cname> > > </clerk> > > </subordinates> > > </manager> > > </emp> > > </emplist> > > > > Expected output: > > id, name, XML > > 1, foo, <emplist> ….</emplist> > > > > Thanks, > > Sreekanth Jella > > > > >