Re: SparkSQL to read XML Blob data to create multiple rows

Judit Planas Thu, 29 Jun 2017 00:55:38 -0700

Hi Amol,

Not sure I understand completely your question, but the SQL function"explode" may help you:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode


Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

Hi

Not sure if I follow your issue. Can you please post output ofbooks_inexp.show()?

On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <amol.ta...@capgemini.com<mailto:amol.ta...@capgemini.com>> wrote:


    Hi:

    We are trying to parse XML data to get below output from given
    input sample.

    Can someone suggest a way to pass one DFrames output into load()
    function or any other alternative to get this output.

    Input Data from Oracle Table XMLBlob:

    *SequenceID*

        

    *Name*

        

    *City*

        

    *XMLComment*

    1

        

    Amol

        

    Kolhapur

        

    
<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>

    2

        

    Suresh

        

    Mumbai

        

    
<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>

    3

        

    Vishal

        

    Delhi

        

    
<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>

    4

        

    Swastik

        

    Bangalore

        

    
<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>

    Output Data Expected using Spark SQL:

    *SequenceID*

        

    *Name*

        

    *City*

        

    *Title*

        

    *Description*

    1

        

    Amol

        

    Kolhapur

        

    Title1.1

        

    Description_1.1

    1

        

    Amol

        

    Kolhapur

        

    Title1.1

        

    Description_1.2

    1

        

    Amol

        

    Kolhapur

        

    Title1.3

        

    Description_1.3

    2

        

    Suresh

        

    Mumbai

        

    Title2

        

    Description_2

    3

        

    Vishal

        

    Delhi

        

    Title3.1

        

    Description_3.1

    4

        

    Swastik

        

    Bangalore

        

    Title4

        

    Description_4

    I am able to parse single XML using below approach in spark-shell
    using example below but how do we apply the same recursively for
    all rows ?

    
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html
    
<https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html>.



    val dfX =
    
sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

    val xData = dfX.registerTempTable("books")

    dfX.printSchema()

    val books_inexp =sqlContext.sql("select title,author from books
    where price<10")

    books_inexp.show

    Regards,

    Amol

    This message contains information that may be privileged or
    confidential and is the property of the Capgemini Group. It is
    intended only for the person to whom it is addressed. If you are
    not the intended recipient, you are not authorized to read, print,
    retain, copy, disseminate, distribute, or use this message or any
    part thereof. If you receive this message in error, please notify
    the sender immediately and delete all copies of this message.




--
Best Regards,
Ayan Guha

Re: SparkSQL to read XML Blob data to create multiple rows

Reply via email to