Re: using spark-xml_2.10 to extract data from XML file
Hi Hyukjin, Thank you very much for this. Sure I am going to do it today based on data + java code. Many Thanks for the support. Best Regards, Carlo On 15 Feb 2017, at 00:22, Hyukjin Kwon> wrote: Hi Carlo, There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is solved. Thanks. 2017-02-15 5:04 GMT+09:00 Carlo.Allocca >: more specifically: Given the following XML data structure: This is the Structure of the XML file: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) ||||-- abstracts: struct (nullable = true) |||||-- abstract: struct (nullable = true) ||||||-- _original: string (nullable = true) ||||||-- _lang: string (nullable = true) ||||||-- ce:para: string (nullable = true) CASE 1: String rowTag="abstracts”; Dataset df = (new XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, localxml); df.select(df.col("abstract.ce:para"), df.col("abstract._original"),df.col("abstract._lang") ).show(); I got the right values. CASE 2: String rowTag="xocs:doc"; Dataset df = (new XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, localxml); df.select(df.col("xocs:item.item.bibrecord.head.abstracts.abstract.ce:para")).show(); I got null values. My question is: How Can I get it right to use String rowTag="xocs:doc”; and get the right values for ….abstract.ce:para, etc? what am I doing wrong? Many Thanks in advance. Best Regards, Carlo On 14 Feb 2017, at 17:35, carlo allocca > wrote: Dear All, I would like to ask you help about the following issue when using spark-xml_2.10: Given a XML file with the following structure: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) ||||-- abstracts: struct (nullable = true) |||||-- abstract: struct (nullable = true) ||||||-- _original: string (nullable = true) ||||||-- _lang: string (nullable = true) ||||||-- ce:para: string (nullable = true) Using the below code to extract all the info from the abstract: 1) I got “null" for each three values: _original, _lang and ce:para when I use rowTag = “xocs:doc”. 2) I got the right values when I use rowTag = “abstracts”. Of course, I need to write a general parser that works at xocs:doc level. I have been reading the documentation at https://github.com/databricks/spark-xml but I did not help me much to solve the above issue. Am I doing something wrong? or it may be related to bug the library I am using? Please, could you advice? Many Thanks, Best Regards, carlo === Code: public static void main(String arr[]) { // xocs:item/item/bibrecord/head/abstracts section StructType _abstract = new StructType(new StructField[]{ new StructField("_original", DataTypes.StringType, true, Metadata.empty()), new StructField("_lang", DataTypes.StringType, true, Metadata.empty()), new StructField("ce:para", DataTypes.StringType, true, Metadata.empty()) }); StructType _abstracts = new StructType(new StructField[]{ new StructField("abstract", _abstract, true, Metadata.empty()) }); StructType _head = new StructType(new StructField[]{ new StructField("abstracts", _abstracts, true, Metadata.empty()) }); StructType _bibrecord = new StructType(new StructField[]{ new StructField("head", _head, true, Metadata.empty()) }); StructType _item = new StructType(new StructField[]{ new StructField("bibrecord", _bibrecord, true, Metadata.empty()) }); StructType _xocs_item = new StructType(new StructField[]{ new StructField("item", _item, true, Metadata.empty()),}); StructType rexploreXMLDataSchema = new StructType(new StructField[]{ new StructField("xocs:item", _xocs_item, true, Metadata.empty()),}); String localxml = “..filename.xml"; SparkSession spark = SparkSession .builder() .master("local[2]") .appName("DatasetForCaseNew") .getOrCreate(); String rowTag = "xocs:doc"; SQLContext sqlContext = new SQLContext(spark); Dataset df = sqlContext.read() .format("com.databricks.spark.xml")
Re: using spark-xml_2.10 to extract data from XML file
Hi Carlo, There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is solved. Thanks. 2017-02-15 5:04 GMT+09:00 Carlo.Allocca: > more specifically: > > Given the following XML data structure: > > This is the Structure of the XML file: > > xocs:doc > |-- xocs:item: struct (nullable = true) > ||-- bibrecord: struct (nullable = true) > |||-- head: struct (nullable = true) > ||||-- abstracts: struct (nullable = true) > |||||-- abstract: struct (nullable = true) > ||||||-- _original: string (nullable = true) > ||||||-- _lang: string (nullable = true) > ||||||-- ce:para: string (nullable = true) > > > > CASE 1: > > String rowTag="abstracts”; > Dataset df = (new XmlReader()).withAttributePrefix("_"). > withRowTag(rowTag).xmlFile(sqlContext, localxml); > df.select(df.col("abstract.ce:para"), > df.col("abstract._original"),df.col("abstract._lang") > ).show(); > > *I got the right values. * > > CASE 2: > > String rowTag="xocs:doc"; > Dataset df = (new XmlReader()).withAttributePrefix("_"). > withRowTag(rowTag).xmlFile(sqlContext, localxml); > df.select(df.col("xocs:item.item.bibrecord.head.abstracts. > abstract.ce:para")).show(); > > *I got null values.* > > > My question is: How Can I get it right to use String rowTag="xocs:doc”; > and get the right values for ….abstract.ce:para, etc? what am I doing > wrong? > > Many Thanks in advance. > Best Regards, > Carlo > > > > On 14 Feb 2017, at 17:35, carlo allocca wrote: > > Dear All, > > I would like to ask you help about the following issue when using > spark-xml_2.10: > > Given a XML file with the following structure: > > xocs:doc > |-- xocs:item: struct (nullable = true) > ||-- bibrecord: struct (nullable = true) > |||-- head: struct (nullable = true) > ||||-- abstracts: struct (nullable = true) > |||||-- abstract: struct (nullable = true) > ||||||-- _original: string (nullable = true) > ||||||-- _lang: string (nullable = true) > ||||||-- ce:para: string (nullable = true) > > Using the below code to extract all the info from the abstract: > > 1) I got “null" for each three values: _original, _lang and ce:para when I > use rowTag = “xocs:doc”. > 2) I got the right values when I use rowTag = “abstracts”. > > Of course, I need to write a general parser that works at xocs:doc level. > I have been reading the documentation at https://github.com/ > databricks/spark-xml but I did not help me much to solve the above issue. > > Am I doing something wrong? or it may be related to bug the library I am > using? > > Please, could you advice? > > Many Thanks, > Best Regards, > carlo > > > > > > === Code: > public static void main(String arr[]) { > > // xocs:item/item/bibrecord/head/abstracts section > StructType _abstract = new StructType(new StructField[]{ > new StructField("_original", DataTypes.StringType, true, > Metadata.empty()), > new StructField("_lang", DataTypes.StringType, true, > Metadata.empty()), > new StructField("ce:para", DataTypes.StringType, true, > Metadata.empty()) > }); > StructType _abstracts = new StructType(new StructField[]{ > new StructField("abstract", _abstract, true, Metadata.empty()) > }); > > StructType _head = new StructType(new StructField[]{ > new StructField("abstracts", _abstracts, true, > Metadata.empty()) > }); > > StructType _bibrecord = new StructType(new StructField[]{ > new StructField("head", _head, true, Metadata.empty()) > > }); > > StructType _item = new StructType(new StructField[]{ > new StructField("bibrecord", _bibrecord, true, > Metadata.empty()) > }); > > StructType _xocs_item = new StructType(new StructField[]{ > new StructField("item", _item, true, Metadata.empty()),}); > > StructType rexploreXMLDataSchema = new StructType(new > StructField[]{ > new StructField("xocs:item", _xocs_item, true, > Metadata.empty()),}); > > String localxml = “..filename.xml"; > > SparkSession spark = SparkSession > .builder() > .master("local[2]") > .appName("DatasetForCaseNew") > .getOrCreate(); > > String rowTag = "xocs:doc"; > > > > SQLContext sqlContext = new SQLContext(spark); > Dataset df = sqlContext.read() > .format("com.databricks.spark.xml") > .option("rowTag", rowTag) >
Re: using spark-xml_2.10 to extract data from XML file
more specifically: Given the following XML data structure: This is the Structure of the XML file: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) ||||-- abstracts: struct (nullable = true) |||||-- abstract: struct (nullable = true) ||||||-- _original: string (nullable = true) ||||||-- _lang: string (nullable = true) ||||||-- ce:para: string (nullable = true) CASE 1: String rowTag="abstracts”; Dataset df = (new XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, localxml); df.select(df.col("abstract.ce:para"), df.col("abstract._original"),df.col("abstract._lang") ).show(); I got the right values. CASE 2: String rowTag="xocs:doc"; Dataset df = (new XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, localxml); df.select(df.col("xocs:item.item.bibrecord.head.abstracts.abstract.ce:para")).show(); I got null values. My question is: How Can I get it right to use String rowTag="xocs:doc”; and get the right values for ….abstract.ce:para, etc? what am I doing wrong? Many Thanks in advance. Best Regards, Carlo On 14 Feb 2017, at 17:35, carlo allocca> wrote: Dear All, I would like to ask you help about the following issue when using spark-xml_2.10: Given a XML file with the following structure: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) ||||-- abstracts: struct (nullable = true) |||||-- abstract: struct (nullable = true) ||||||-- _original: string (nullable = true) ||||||-- _lang: string (nullable = true) ||||||-- ce:para: string (nullable = true) Using the below code to extract all the info from the abstract: 1) I got “null" for each three values: _original, _lang and ce:para when I use rowTag = “xocs:doc”. 2) I got the right values when I use rowTag = “abstracts”. Of course, I need to write a general parser that works at xocs:doc level. I have been reading the documentation at https://github.com/databricks/spark-xml but I did not help me much to solve the above issue. Am I doing something wrong? or it may be related to bug the library I am using? Please, could you advice? Many Thanks, Best Regards, carlo === Code: public static void main(String arr[]) { // xocs:item/item/bibrecord/head/abstracts section StructType _abstract = new StructType(new StructField[]{ new StructField("_original", DataTypes.StringType, true, Metadata.empty()), new StructField("_lang", DataTypes.StringType, true, Metadata.empty()), new StructField("ce:para", DataTypes.StringType, true, Metadata.empty()) }); StructType _abstracts = new StructType(new StructField[]{ new StructField("abstract", _abstract, true, Metadata.empty()) }); StructType _head = new StructType(new StructField[]{ new StructField("abstracts", _abstracts, true, Metadata.empty()) }); StructType _bibrecord = new StructType(new StructField[]{ new StructField("head", _head, true, Metadata.empty()) }); StructType _item = new StructType(new StructField[]{ new StructField("bibrecord", _bibrecord, true, Metadata.empty()) }); StructType _xocs_item = new StructType(new StructField[]{ new StructField("item", _item, true, Metadata.empty()),}); StructType rexploreXMLDataSchema = new StructType(new StructField[]{ new StructField("xocs:item", _xocs_item, true, Metadata.empty()),}); String localxml = “..filename.xml"; SparkSession spark = SparkSession .builder() .master("local[2]") .appName("DatasetForCaseNew") .getOrCreate(); String rowTag = "xocs:doc"; SQLContext sqlContext = new SQLContext(spark); Dataset df = sqlContext.read() .format("com.databricks.spark.xml") .option("rowTag", rowTag) .option("attributePrefix", "_") .schema(rexploreXMLDataSchema) .load(localxml); df.printSchema(); df.select( df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"), df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"), df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para")
Re: using spark-xml_2.10 to extract data from XML file
Dear All, I would like to ask you help about the following issue when using spark-xml_2.10: Given a XML file with the following structure: xocs:doc |-- xocs:item: struct (nullable = true) ||-- bibrecord: struct (nullable = true) |||-- head: struct (nullable = true) ||||-- abstracts: struct (nullable = true) |||||-- abstract: struct (nullable = true) ||||||-- _original: string (nullable = true) ||||||-- _lang: string (nullable = true) ||||||-- ce:para: string (nullable = true) Using the below code to extract all the info from the abstract: 1) I got “null" for each three values: _original, _lang and ce:para when I use rowTag = “xocs:doc”. 2) I got the right values when I use rowTag = “abstracts”. Of course, I need to write a general parser that works at xocs:doc level. I have been reading the documentation at https://github.com/databricks/spark-xml but I did not help me much to solve the above issue. Am I doing something wrong? or it may be related to bug the library I am using? Please, could you advice? Many Thanks, Best Regards, carlo === Code: public static void main(String arr[]) { // xocs:item/item/bibrecord/head/abstracts section StructType _abstract = new StructType(new StructField[]{ new StructField("_original", DataTypes.StringType, true, Metadata.empty()), new StructField("_lang", DataTypes.StringType, true, Metadata.empty()), new StructField("ce:para", DataTypes.StringType, true, Metadata.empty()) }); StructType _abstracts = new StructType(new StructField[]{ new StructField("abstract", _abstract, true, Metadata.empty()) }); StructType _head = new StructType(new StructField[]{ new StructField("abstracts", _abstracts, true, Metadata.empty()) }); StructType _bibrecord = new StructType(new StructField[]{ new StructField("head", _head, true, Metadata.empty()) }); StructType _item = new StructType(new StructField[]{ new StructField("bibrecord", _bibrecord, true, Metadata.empty()) }); StructType _xocs_item = new StructType(new StructField[]{ new StructField("item", _item, true, Metadata.empty()),}); StructType rexploreXMLDataSchema = new StructType(new StructField[]{ new StructField("xocs:item", _xocs_item, true, Metadata.empty()),}); String localxml = “..filename.xml"; SparkSession spark = SparkSession .builder() .master("local[2]") .appName("DatasetForCaseNew") .getOrCreate(); String rowTag = "xocs:doc"; SQLContext sqlContext = new SQLContext(spark); Dataset df = sqlContext.read() .format("com.databricks.spark.xml") .option("rowTag", rowTag) .option("attributePrefix", "_") .schema(rexploreXMLDataSchema) .load(localxml); df.printSchema(); df.select( df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"), df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"), df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para") ).show(); //df.select( //df.col("_original"), //df.col("_lang"), //df.col("ce:para") // //).show(); //df.select( //df.col("abstract").getField("_original"), //df.col("abstract").getField("_lang"), //df.col("abstract").getField("ce:para") // //).show(); //df.select( // df.col("head").getField("abstracts").getField("abstract").getField("_original"), // df.col("head").getField("abstracts").getField("abstract").getField("_lang"), // df.col("head").getField("abstracts").getField("abstract").getField("ce:para") // //).show(); } On 13 Feb 2017, at 18:17, Carlo.Allocca> wrote: Dear All, I am using spark-xml_2.10 to parse and extract some data from XML files. I got the issue of getting null value whereas the XML file contains actually values. ++--++ |xocs:item.bibrecord.head.abstracts.abstract._original