I can't say this is the best way to do so but my instant thought is as
below:
Create two df
sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8")
val strXmlDf = sc.newAPIHadoopFile(carsFile,
classOf[XmlInputFormat],
classOf[LongWritable],
classOf[Text]).map { pair =>
new String(pair._2.getBytes, 0, pair._2.getLength)
}.toDF("XML")
val xmlDf = sqlContext.read.format("xml")
.option("rowTag", "emplist")
.load(path)
zip those two maybe like this https://github.com/apache/spark/pull/7474
and then starts to filter with emp.id or emp.name.
2016-08-22 5:31 GMT+09:00 :
> Hello Experts,
>
>
>
> I’m using spark-xml package which is automatically inferring my schema and
> creating a DataFrame.
>
>
>
> I’m extracting few fields like id, name (which are unique) from below xml,
> but my requirement is to store entire XML in one of the column as well. I’m
> writing this data to AVRO hive table. Can anyone tell me how to achieve
> this?
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
>
>
>
>
>
>
>1
>
>foo
>
>
>
>
>
> 1
>
> foo
>
>
>
>
>
> 1
>
> foo
>
>
>
>
>
>
>
>
>
>
>
>
>
> Expected output:
>
> id, name, XML
>
> 1, foo, ….
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>