Re: using spark-xml_2.10 to extract data from XML file

Hyukjin Kwon Tue, 14 Feb 2017 16:23:12 -0800

Hi Carlo,


There was a bug in lower versions when accessing to nested values in the
library.

Otherwise, I suspect another issue about parsing malformed XML.

Could you maybe open an issue in
https://github.com/databricks/spark-xml/issues with your sample data?

I will stick with it until it is solved.


Thanks.



2017-02-15 5:04 GMT+09:00 Carlo.Allocca <carlo.allo...@open.ac.uk>:

> more specifically:
>
> Given the following XML data structure:
>
> This is the Structure of the XML file:
>
> xocs:doc
>  |-- xocs:item: struct (nullable = true)
>  |    |-- bibrecord: struct (nullable = true)
>  |    |    |-- head: struct (nullable = true)
>  |    |    |    |-- abstracts: struct (nullable = true)
>  |    |    |    |    |-- abstract: struct (nullable = true)
>  |    |    |    |    |    |-- _original: string (nullable = true)
>  |    |    |    |    |    |-- _lang: string (nullable = true)
>  |    |    |    |    |    |-- ce:para: string (nullable = true)
>
>
>
> CASE 1:
>
> String rowTag="abstracts”;
> Dataset<Row> df = (new XmlReader()).withAttributePrefix("_").
> withRowTag(rowTag).xmlFile(sqlContext, localxml);
> df.select(df.col("abstract.ce:para"), 
> df.col("abstract._original"),df.col("abstract._lang")
> ).show();
>
> *I got the right values. *
>
> CASE 2:
>
> String rowTag="xocs:doc";
> Dataset<Row> df = (new XmlReader()).withAttributePrefix("_").
> withRowTag(rowTag).xmlFile(sqlContext, localxml);
> df.select(df.col("xocs:item.item.bibrecord.head.abstracts.
> abstract.ce:para")).show();
>
> *I got null values.*
>
>
> My question is: How Can I get it right to use String rowTag="xocs:doc”;
> and get the right values for  ….abstract.ce:para, etc? what am I doing
> wrong?
>
> Many Thanks in advance.
> Best Regards,
> Carlo
>
>
>
> On 14 Feb 2017, at 17:35, carlo allocca <ca6...@open.ac.uk> wrote:
>
> Dear All,
>
> I would like to ask you help about the following issue when using
> spark-xml_2.10:
>
> Given a XML file with the following structure:
>
> xocs:doc
>  |-- xocs:item: struct (nullable = true)
>  |    |-- bibrecord: struct (nullable = true)
>  |    |    |-- head: struct (nullable = true)
>  |    |    |    |-- abstracts: struct (nullable = true)
>  |    |    |    |    |-- abstract: struct (nullable = true)
>  |    |    |    |    |    |-- _original: string (nullable = true)
>  |    |    |    |    |    |-- _lang: string (nullable = true)
>  |    |    |    |    |    |-- ce:para: string (nullable = true)
>
> Using the below code to extract all the info from the abstract:
>
> 1) I got “null" for each three values: _original, _lang and ce:para when I
> use rowTag = “xocs:doc”.
> 2) I got the right values when I use rowTag = “abstracts”.
>
> Of course, I need to write a general parser that works at xocs:doc level.
> I have been reading the documentation at https://github.com/
> databricks/spark-xml but I did not help me much to solve the above issue.
>
> Am I doing something wrong? or it may be related to bug the library I am
> using?
>
> Please, could you advice?
>
> Many Thanks,
> Best Regards,
> carlo
>
>
>
>
>
> === Code:
>     public static void main(String arr[]) {
>
>         // xocs:item/item/bibrecord/head/abstracts  section
>         StructType _abstract = new StructType(new StructField[]{
>             new StructField("_original", DataTypes.StringType, true,
> Metadata.empty()),
>             new StructField("_lang", DataTypes.StringType, true,
> Metadata.empty()),
>             new StructField("ce:para", DataTypes.StringType, true,
> Metadata.empty())
>         });
>         StructType _abstracts = new StructType(new StructField[]{
>             new StructField("abstract", _abstract, true, Metadata.empty())
>         });
>
>         StructType _head = new StructType(new StructField[]{
>             new StructField("abstracts", _abstracts, true,
> Metadata.empty())
>         });
>
>         StructType _bibrecord = new StructType(new StructField[]{
>             new StructField("head", _head, true, Metadata.empty())
>
>         });
>
>         StructType _item = new StructType(new StructField[]{
>             new StructField("bibrecord", _bibrecord, true,
> Metadata.empty())
>         });
>
>         StructType _xocs_item = new StructType(new StructField[]{
>             new StructField("item", _item, true, Metadata.empty()),});
>
>         StructType rexploreXMLDataSchema = new StructType(new
> StructField[]{
>             new StructField("xocs:item", _xocs_item, true,
> Metadata.empty()),});
>
>         String localxml = “..filename.xml";
>
>         SparkSession spark = SparkSession
>                 .builder()
>                 .master("local[2]")
>                 .appName("DatasetForCaseNew")
>                 .getOrCreate();
>
>         String rowTag = "xocs:doc";
>
>
>
>         SQLContext sqlContext = new SQLContext(spark);
>         Dataset<Row> df = sqlContext.read()
>                 .format("com.databricks.spark.xml")
>                 .option("rowTag", rowTag)
>                 .option("attributePrefix", "_")
>                 .schema(rexploreXMLDataSchema)
>                 .load(localxml);
>
>         df.printSchema();
>
>         df.select(
>                 df.col("xocs:item").getField("
> item").getField("bibrecord").getItem("head").getField("
> abstracts").getField("abstract").getField("_original"),
>                 df.col("xocs:item").getField("
> item").getField("bibrecord").getItem("head").getField("
> abstracts").getField("abstract").getItem("_lang"),
>                 df.col("xocs:item").getField("
> item").getField("bibrecord").getItem("head").getField("
> abstracts").getField("abstract").getField("ce:para")
>         ).show();
>
> //                df.select(
> //                df.col("_original"),
> //                df.col("_lang"),
> //                df.col("ce:para")
> //
> //        ).show();
>
> //                df.select(
> //                df.col("abstract").getField("_original"),
> //                df.col("abstract").getField("_lang"),
> //                df.col("abstract").getField("ce:para")
> //
> //        ).show();
>
>
> //                df.select(
> //                df.col("head").getField("abstracts").getField("
> abstract").getField("_original"),
> //                df.col("head").getField("abstracts").getField("
> abstract").getField("_lang"),
> //                df.col("head").getField("abstracts").getField("
> abstract").getField("ce:para")
> //
> //        ).show();
>
>
>
>
>     }
>
>
>
>
> On 13 Feb 2017, at 18:17, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote:
>
> Dear All,
>
> I am using spark-xml_2.10 to parse and extract some data from XML files.
>
> I got the issue of getting null value whereas the XML file contains
> actually values.
>
> +-----------------------------------------------------------
> -----------------+------------------------------------------
> ------------------------+-----------------------------------
> ---------------------------------+
> |xocs:item.bibrecord.head.abstracts.abstract._original
> |xocs:item.bibrecord.head.abstracts.abstract._lang |
> xocs:item.bibrecord.head.abstracts.abstract.ce:para|
> +-----------------------------------------------------------
> -----------------+------------------------------------------
> ------------------------+-----------------------------------
> ----------------------------------+
> |                                                 null|
>                           null|
>                                null|
> +-----------------------------------------------------------
> -----------------+------------------------------------------
> -------------------------+----------------------------------
> -----------------------------------+
>
> Below, I report an example of XML that I processing and the code I am
> using to parse it.
>
> What am I doing wrong?
>
> Please, any help on this would be very appreciated.
>
> Many Thanks in advance,
> Best Regards,
> Carlo
>
>
>
>
> ===== An example
>
>
> SPARK prints the following schema:
>
> root
>  |-- xocs:item: struct (nullable = true)
>  |    |-- bibrecord: struct (nullable = true)
>  |    |    |-- head: struct (nullable = true)
>  |    |    |    |-- abstracts: struct (nullable = true)
>  |    |    |    |    |-- abstract: struct (nullable = true)
>  |    |    |    |    |    |-- _original: string (nullable = true)
>  |    |    |    |    |    |-- _lang: string (nullable = true)
>
>
>
> ==== XML file example
>
> XML file structure:
>
> <xocs:doc
>   <xocs:item>
>      <item>
>        <bibrecord>
>          <head>
>     <abstracts>
>        <abstract original="y" xml:lang="eng”>
>  <ce:para>
> This paper bla bla bla...
>   </ce:para>
>
>
> === SPARK code:
>
> public class XMLParser {
>
>     public static void main(String arr[]) {
>
>         // xocs:item/item/bibrecord/head/abstracts  section
>         StructType _abstract = new StructType(new StructField[]{
>             new StructField("_original", DataTypes.StringType, true,
> Metadata.empty()),
>             new StructField("_lang", DataTypes.StringType, true,
> Metadata.empty()),
>             new StructField("ce:para", DataTypes.StringType, true,
> Metadata.empty())
>         });
>         StructType _abstracts = new StructType(new StructField[]{
>             new StructField("abstract", _abstract, true, Metadata.empty())
>         });
>
>         StructType _head = new StructType(new StructField[]{
>             new StructField("abstracts", _abstracts, true,
> Metadata.empty())
>         });
>
>         StructType _bibrecord = new StructType(new StructField[]{
>             new StructField("head", _head, true, Metadata.empty())
>
>         });
>
>         StructType _xocs_item = new StructType(new StructField[]{
>             new StructField("bibrecord", _bibrecord, true,
> Metadata.empty())
>         });
>
>         StructType rexploreXMLDataSchema = new StructType(new
> StructField[]{
>             new StructField("xocs:item", _xocs_item, true,
> Metadata.empty()),});
>
>
>         String localxml = "/Users/carloallocca/Desktop/Spark
> Material/francesco2.xml";
>
>         SparkSession spark = SparkSession
>                 .builder()
>                 .master("local[2]")
>                 .appName("DatasetForCaseNew")
>                 .getOrCreate();
>
>         String rowTag = "xocs:doc";
>
>         SQLContext sqlContext = new SQLContext(spark);
>         Dataset<Row> df = sqlContext.read()
>                 .format("com.databricks.spark.xml")
>                 .option("rowTag", rowTag)
>                 .option("attributePrefix", "_")
>                 .schema(rexploreXMLDataSchema)
>                 .load(localxml);
>
>         df.printSchema();
>
>         df.select(
>                 df.col("xocs:item").getField("bibrecord").getItem("head").
> getField("abstracts").getField("abstract").getField("_original"),
>                 df.col("xocs:item").getField("bibrecord").getItem("head").
> getField("abstracts").getField("abstract").getItem("_lang"),
>                 df.col("xocs:item").getField("bibrecord").getItem("head").
> getField("abstracts").getField("abstract").getField("ce:para")
>         ).show();
>
>     }
>
> }
>
>
> -- The Open University is incorporated by Royal Charter (RC 000391), an
> exempt charity in England & Wales and a charity registered in Scotland (SC
> 038302). The Open University is authorised and regulated by the Financial
> Conduct Authority.
>
>
>
>

Re: using spark-xml_2.10 to extract data from XML file

Reply via email to