I can't say this is the best way to do so but my instant thought is as
below:


Create two df

sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<emplist>")
sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</emplist>")
sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8")
val strXmlDf = sc.newAPIHadoopFile(carsFile,
  classOf[XmlInputFormat],
  classOf[LongWritable],
  classOf[Text]).map { pair =>
    new String(pair._2.getBytes, 0, pair._2.getLength)
  }.toDF("XML")

val xmlDf = sqlContext.read.format("xml")
  .option("rowTag", "emplist")
  .load(path)

​

zip those two maybe like this https://github.com/apache/spark/pull/7474


and then starts to filter with emp.id or emp.name.



2016-08-22 5:31 GMT+09:00 <srikanth.je...@gmail.com>:

> Hello Experts,
>
>
>
> I’m using spark-xml package which is automatically inferring my schema and
> creating a DataFrame.
>
>
>
> I’m extracting few fields like id, name (which are unique) from below xml,
> but my requirement is to store entire XML in one of the column as well. I’m
> writing this data to AVRO hive table. Can anyone tell me how to achieve
> this?
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> <emplist>
>
> <emp>
>
>    <manager>
>
>    <id>1</id>
>
>    <name>foo</name>
>
>     <subordinates>
>
>       <clerk>
>
>         <cid>1</cid>
>
>         <cname>foo</cname>
>
>       </clerk>
>
>       <clerk>
>
>         <cid>1</cid>
>
>         <cname>foo</cname>
>
>       </clerk>
>
>     </subordinates>
>
>    </manager>
>
> </emp>
>
> </emplist>
>
>
>
> Expected output:
>
> id, name, XML
>
> 1, foo, <emplist> ….</emplist>
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>

Reply via email to