Re: using spark-xml_2.10 to extract data from XML file

Carlo . Allocca Wed, 15 Feb 2017 01:29:54 -0800

Hi Hyukjin,

Thank you very much for this.


Sure I am going to do it today based on data + java code.

Many Thanks for the support.
Best Regards,
Carlo


On 15 Feb 2017, at 00:22, Hyukjin Kwon 
<gurwls...@gmail.com<mailto:gurwls...@gmail.com>> wrote:

Hi Carlo,


There was a bug in lower versions when accessing to nested values in the 
library.

Otherwise, I suspect another issue about parsing malformed XML.

Could you maybe open an issue in https://github.com/databricks/spark-xml/issues 
with your sample data?

I will stick with it until it is solved.


Thanks.



2017-02-15 5:04 GMT+09:00 Carlo.Allocca 
<carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>>:
more specifically:

Given the following XML data structure:

This is the Structure of the XML file:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 |    |-- bibrecord: struct (nullable = true)
 |    |    |-- head: struct (nullable = true)
 |    |    |    |-- abstracts: struct (nullable = true)
 |    |    |    |    |-- abstract: struct (nullable = true)
 |    |    |    |    |    |-- _original: string (nullable = true)
 |    |    |    |    |    |-- _lang: string (nullable = true)
 |    |    |    |    |    |-- ce:para: string (nullable = true)


CASE 1:

String rowTag="abstracts”;
Dataset<Row> df = (new 
XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, 
localxml);
df.select(df.col("abstract.ce:para"), 
df.col("abstract._original"),df.col("abstract._lang") ).show();

I got the right values.

CASE 2:

String rowTag="xocs:doc";
Dataset<Row> df = (new 
XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, 
localxml);
df.select(df.col("xocs:item.item.bibrecord.head.abstracts.abstract.ce:para")).show();

I got null values.


My question is: How Can I get it right to use String rowTag="xocs:doc”; and get 
the right values for  ….abstract.ce:para, etc? what am I doing wrong?

Many Thanks in advance.
Best Regards,
Carlo



On 14 Feb 2017, at 17:35, carlo allocca 
<ca6...@open.ac.uk<mailto:ca6...@open.ac.uk>> wrote:

Dear All,

I would like to ask you help about the following issue when using 
spark-xml_2.10:

Given a XML file with the following structure:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 |    |-- bibrecord: struct (nullable = true)
 |    |    |-- head: struct (nullable = true)
 |    |    |    |-- abstracts: struct (nullable = true)
 |    |    |    |    |-- abstract: struct (nullable = true)
 |    |    |    |    |    |-- _original: string (nullable = true)
 |    |    |    |    |    |-- _lang: string (nullable = true)
 |    |    |    |    |    |-- ce:para: string (nullable = true)

Using the below code to extract all the info from the abstract:

1) I got “null" for each three values: _original, _lang and ce:para when I use 
rowTag = “xocs:doc”.
2) I got the right values when I use rowTag = “abstracts”.

Of course, I need to write a general parser that works at xocs:doc level.
I have been reading the documentation at 
https://github.com/databricks/spark-xml but I did not help me much to solve the 
above issue.

Am I doing something wrong? or it may be related to bug the library I am using?

Please, could you advice?

Many Thanks,
Best Regards,
carlo





=== Code:
    public static void main(String arr[]) {

        // xocs:item/item/bibrecord/head/abstracts  section
        StructType _abstract = new StructType(new StructField[]{
            new StructField("_original", DataTypes.StringType, true, 
Metadata.empty()),
            new StructField("_lang", DataTypes.StringType, true, 
Metadata.empty()),
            new StructField("ce:para", DataTypes.StringType, true, 
Metadata.empty())
        });
        StructType _abstracts = new StructType(new StructField[]{
            new StructField("abstract", _abstract, true, Metadata.empty())
        });

        StructType _head = new StructType(new StructField[]{
            new StructField("abstracts", _abstracts, true, Metadata.empty())
        });

        StructType _bibrecord = new StructType(new StructField[]{
            new StructField("head", _head, true, Metadata.empty())

        });

        StructType _item = new StructType(new StructField[]{
            new StructField("bibrecord", _bibrecord, true, Metadata.empty())
        });

        StructType _xocs_item = new StructType(new StructField[]{
            new StructField("item", _item, true, Metadata.empty()),});

        StructType rexploreXMLDataSchema = new StructType(new StructField[]{
            new StructField("xocs:item", _xocs_item, true, Metadata.empty()),});

        String localxml = “..filename.xml";

        SparkSession spark = SparkSession
                .builder()
                .master("local[2]")
                .appName("DatasetForCaseNew")
                .getOrCreate();

        String rowTag = "xocs:doc";



        SQLContext sqlContext = new SQLContext(spark);
        Dataset<Row> df = sqlContext.read()
                .format("com.databricks.spark.xml")
                .option("rowTag", rowTag)
                .option("attributePrefix", "_")
                .schema(rexploreXMLDataSchema)
                .load(localxml);

        df.printSchema();

        df.select(
                
df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"),
                
df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"),
                
df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para")
        ).show();

//                df.select(
//                df.col("_original"),
//                df.col("_lang"),
//                df.col("ce:para")
//
//        ).show();

//                df.select(
//                df.col("abstract").getField("_original"),
//                df.col("abstract").getField("_lang"),
//                df.col("abstract").getField("ce:para")
//
//        ).show();


//                df.select(
//                
df.col("head").getField("abstracts").getField("abstract").getField("_original"),
//                
df.col("head").getField("abstracts").getField("abstract").getField("_lang"),
//                
df.col("head").getField("abstracts").getField("abstract").getField("ce:para")
//
//        ).show();




    }




On 13 Feb 2017, at 18:17, Carlo.Allocca 
<carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote:

Dear All,

I am using spark-xml_2.10 to parse and extract some data from XML files.

I got the issue of getting null value whereas the XML file contains actually 
values.

+----------------------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------------+
|xocs:item.bibrecord.head.abstracts.abstract._original         
|xocs:item.bibrecord.head.abstracts.abstract._lang | 
xocs:item.bibrecord.head.abstracts.abstract.ce:para|
+----------------------------------------------------------------------------+------------------------------------------------------------------+---------------------------------------------------------------------+
|                                                 null|                         
                    null|                                                       
                     null|
+----------------------------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------------+

Below, I report an example of XML that I processing and the code I am using to 
parse it.

What am I doing wrong?

Please, any help on this would be very appreciated.

Many Thanks in advance,
Best Regards,
Carlo




===== An example


SPARK prints the following schema:

root
 |-- xocs:item: struct (nullable = true)
 |    |-- bibrecord: struct (nullable = true)
 |    |    |-- head: struct (nullable = true)
 |    |    |    |-- abstracts: struct (nullable = true)
 |    |    |    |    |-- abstract: struct (nullable = true)
 |    |    |    |    |    |-- _original: string (nullable = true)
 |    |    |    |    |    |-- _lang: string (nullable = true)



==== XML file example

XML file structure:

<xocs:doc
  <xocs:item>
     <item>
       <bibrecord>
         <head>
    <abstracts>
       <abstract original="y" xml:lang="eng”>
 <ce:para>
This paper bla bla bla...
  </ce:para>


=== SPARK code:

public class XMLParser {

    public static void main(String arr[]) {

        // xocs:item/item/bibrecord/head/abstracts  section
        StructType _abstract = new StructType(new StructField[]{
            new StructField("_original", DataTypes.StringType, true, 
Metadata.empty()),
            new StructField("_lang", DataTypes.StringType, true, 
Metadata.empty()),
            new StructField("ce:para", DataTypes.StringType, true, 
Metadata.empty())
        });
        StructType _abstracts = new StructType(new StructField[]{
            new StructField("abstract", _abstract, true, Metadata.empty())
        });

        StructType _head = new StructType(new StructField[]{
            new StructField("abstracts", _abstracts, true, Metadata.empty())
        });

        StructType _bibrecord = new StructType(new StructField[]{
            new StructField("head", _head, true, Metadata.empty())

        });

        StructType _xocs_item = new StructType(new StructField[]{
            new StructField("bibrecord", _bibrecord, true, Metadata.empty())
        });

        StructType rexploreXMLDataSchema = new StructType(new StructField[]{
            new StructField("xocs:item", _xocs_item, true, Metadata.empty()),});


        String localxml = "/Users/carloallocca/Desktop/Spark 
Material/francesco2.xml";

        SparkSession spark = SparkSession
                .builder()
                .master("local[2]")
                .appName("DatasetForCaseNew")
                .getOrCreate();

        String rowTag = "xocs:doc";

        SQLContext sqlContext = new SQLContext(spark);
        Dataset<Row> df = sqlContext.read()
                .format("com.databricks.spark.xml")
                .option("rowTag", rowTag)
                .option("attributePrefix", "_")
                .schema(rexploreXMLDataSchema)
                .load(localxml);

        df.printSchema();

        df.select(
                
df.col("xocs:item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"),
                
df.col("xocs:item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"),
                
df.col("xocs:item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para")
        ).show();

    }

}


-- The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302). 
The Open University is authorised and regulated by the Financial Conduct 
Authority.

Re: using spark-xml_2.10 to extract data from XML file

Reply via email to