Re: Entire XML data as one of the column in DataFrame

2016-08-21 Thread Hyukjin Kwon
I can't say this is the best way to do so but my instant thought is as
below:


Create two df

sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8")
val strXmlDf = sc.newAPIHadoopFile(carsFile,
  classOf[XmlInputFormat],
  classOf[LongWritable],
  classOf[Text]).map { pair =>
new String(pair._2.getBytes, 0, pair._2.getLength)
  }.toDF("XML")

val xmlDf = sqlContext.read.format("xml")
  .option("rowTag", "emplist")
  .load(path)

​

zip those two maybe like this https://github.com/apache/spark/pull/7474


and then starts to filter with emp.id or emp.name.



2016-08-22 5:31 GMT+09:00 :

> Hello Experts,
>
>
>
> I’m using spark-xml package which is automatically inferring my schema and
> creating a DataFrame.
>
>
>
> I’m extracting few fields like id, name (which are unique) from below xml,
> but my requirement is to store entire XML in one of the column as well. I’m
> writing this data to AVRO hive table. Can anyone tell me how to achieve
> this?
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> 
>
> 
>
>
>
>1
>
>foo
>
> 
>
>   
>
> 1
>
> foo
>
>   
>
>   
>
> 1
>
> foo
>
>   
>
> 
>
>
>
> 
>
> 
>
>
>
> Expected output:
>
> id, name, XML
>
> 1, foo,  ….
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>


Entire XML data as one of the column in DataFrame

2016-08-21 Thread srikanth.jella
Hello Experts,

I’m using spark-xml package which is automatically inferring my schema and 
creating a DataFrame. 

I’m extracting few fields like id, name (which are unique) from below xml, but 
my requirement is to store entire XML in one of the column as well. I’m writing 
this data to AVRO hive table. Can anyone tell me how to achieve this? 

Example XML and expected output is given below.

Sample XML:


   
   1
   foo
    
  
    1
    foo
  
  
    1
    foo
  
    
   


 
Expected output:
id, name, XML
1, foo,  ….
 
Thanks,
Sreekanth Jella