Hi Carlo,
There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is solved. Thanks. 2017-02-15 5:04 GMT+09:00 Carlo.Allocca <carlo.allo...@open.ac.uk>: > more specifically: > > Given the following XML data structure: > > This is the Structure of the XML file: > > xocs:doc > |-- xocs:item: struct (nullable = true) > | |-- bibrecord: struct (nullable = true) > | | |-- head: struct (nullable = true) > | | | |-- abstracts: struct (nullable = true) > | | | | |-- abstract: struct (nullable = true) > | | | | | |-- _original: string (nullable = true) > | | | | | |-- _lang: string (nullable = true) > | | | | | |-- ce:para: string (nullable = true) > > > > CASE 1: > > String rowTag="abstracts”; > Dataset<Row> df = (new XmlReader()).withAttributePrefix("_"). > withRowTag(rowTag).xmlFile(sqlContext, localxml); > df.select(df.col("abstract.ce:para"), > df.col("abstract._original"),df.col("abstract._lang") > ).show(); > > *I got the right values. * > > CASE 2: > > String rowTag="xocs:doc"; > Dataset<Row> df = (new XmlReader()).withAttributePrefix("_"). > withRowTag(rowTag).xmlFile(sqlContext, localxml); > df.select(df.col("xocs:item.item.bibrecord.head.abstracts. > abstract.ce:para")).show(); > > *I got null values.* > > > My question is: How Can I get it right to use String rowTag="xocs:doc”; > and get the right values for ….abstract.ce:para, etc? what am I doing > wrong? > > Many Thanks in advance. > Best Regards, > Carlo > > > > On 14 Feb 2017, at 17:35, carlo allocca <ca6...@open.ac.uk> wrote: > > Dear All, > > I would like to ask you help about the following issue when using > spark-xml_2.10: > > Given a XML file with the following structure: > > xocs:doc > |-- xocs:item: struct (nullable = true) > | |-- bibrecord: struct (nullable = true) > | | |-- head: struct (nullable = true) > | | | |-- abstracts: struct (nullable = true) > | | | | |-- abstract: struct (nullable = true) > | | | | | |-- _original: string (nullable = true) > | | | | | |-- _lang: string (nullable = true) > | | | | | |-- ce:para: string (nullable = true) > > Using the below code to extract all the info from the abstract: > > 1) I got “null" for each three values: _original, _lang and ce:para when I > use rowTag = “xocs:doc”. > 2) I got the right values when I use rowTag = “abstracts”. > > Of course, I need to write a general parser that works at xocs:doc level. > I have been reading the documentation at https://github.com/ > databricks/spark-xml but I did not help me much to solve the above issue. > > Am I doing something wrong? or it may be related to bug the library I am > using? > > Please, could you advice? > > Many Thanks, > Best Regards, > carlo > > > > > > === Code: > public static void main(String arr[]) { > > // xocs:item/item/bibrecord/head/abstracts section > StructType _abstract = new StructType(new StructField[]{ > new StructField("_original", DataTypes.StringType, true, > Metadata.empty()), > new StructField("_lang", DataTypes.StringType, true, > Metadata.empty()), > new StructField("ce:para", DataTypes.StringType, true, > Metadata.empty()) > }); > StructType _abstracts = new StructType(new StructField[]{ > new StructField("abstract", _abstract, true, Metadata.empty()) > }); > > StructType _head = new StructType(new StructField[]{ > new StructField("abstracts", _abstracts, true, > Metadata.empty()) > }); > > StructType _bibrecord = new StructType(new StructField[]{ > new StructField("head", _head, true, Metadata.empty()) > > }); > > StructType _item = new StructType(new StructField[]{ > new StructField("bibrecord", _bibrecord, true, > Metadata.empty()) > }); > > StructType _xocs_item = new StructType(new StructField[]{ > new StructField("item", _item, true, Metadata.empty()),}); > > StructType rexploreXMLDataSchema = new StructType(new > StructField[]{ > new StructField("xocs:item", _xocs_item, true, > Metadata.empty()),}); > > String localxml = “..filename.xml"; > > SparkSession spark = SparkSession > .builder() > .master("local[2]") > .appName("DatasetForCaseNew") > .getOrCreate(); > > String rowTag = "xocs:doc"; > > > > SQLContext sqlContext = new SQLContext(spark); > Dataset<Row> df = sqlContext.read() > .format("com.databricks.spark.xml") > .option("rowTag", rowTag) > .option("attributePrefix", "_") > .schema(rexploreXMLDataSchema) > .load(localxml); > > df.printSchema(); > > df.select( > df.col("xocs:item").getField(" > item").getField("bibrecord").getItem("head").getField(" > abstracts").getField("abstract").getField("_original"), > df.col("xocs:item").getField(" > item").getField("bibrecord").getItem("head").getField(" > abstracts").getField("abstract").getItem("_lang"), > df.col("xocs:item").getField(" > item").getField("bibrecord").getItem("head").getField(" > abstracts").getField("abstract").getField("ce:para") > ).show(); > > // df.select( > // df.col("_original"), > // df.col("_lang"), > // df.col("ce:para") > // > // ).show(); > > // df.select( > // df.col("abstract").getField("_original"), > // df.col("abstract").getField("_lang"), > // df.col("abstract").getField("ce:para") > // > // ).show(); > > > // df.select( > // df.col("head").getField("abstracts").getField(" > abstract").getField("_original"), > // df.col("head").getField("abstracts").getField(" > abstract").getField("_lang"), > // df.col("head").getField("abstracts").getField(" > abstract").getField("ce:para") > // > // ).show(); > > > > > } > > > > > On 13 Feb 2017, at 18:17, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote: > > Dear All, > > I am using spark-xml_2.10 to parse and extract some data from XML files. > > I got the issue of getting null value whereas the XML file contains > actually values. > > +----------------------------------------------------------- > -----------------+------------------------------------------ > ------------------------+----------------------------------- > ---------------------------------+ > |xocs:item.bibrecord.head.abstracts.abstract._original > |xocs:item.bibrecord.head.abstracts.abstract._lang | > xocs:item.bibrecord.head.abstracts.abstract.ce:para| > +----------------------------------------------------------- > -----------------+------------------------------------------ > ------------------------+----------------------------------- > ----------------------------------+ > | null| > null| > null| > +----------------------------------------------------------- > -----------------+------------------------------------------ > -------------------------+---------------------------------- > -----------------------------------+ > > Below, I report an example of XML that I processing and the code I am > using to parse it. > > What am I doing wrong? > > Please, any help on this would be very appreciated. > > Many Thanks in advance, > Best Regards, > Carlo > > > > > ===== An example > > > SPARK prints the following schema: > > root > |-- xocs:item: struct (nullable = true) > | |-- bibrecord: struct (nullable = true) > | | |-- head: struct (nullable = true) > | | | |-- abstracts: struct (nullable = true) > | | | | |-- abstract: struct (nullable = true) > | | | | | |-- _original: string (nullable = true) > | | | | | |-- _lang: string (nullable = true) > > > > ==== XML file example > > XML file structure: > > <xocs:doc > <xocs:item> > <item> > <bibrecord> > <head> > <abstracts> > <abstract original="y" xml:lang="eng”> > <ce:para> > This paper bla bla bla... > </ce:para> > > > === SPARK code: > > public class XMLParser { > > public static void main(String arr[]) { > > // xocs:item/item/bibrecord/head/abstracts section > StructType _abstract = new StructType(new StructField[]{ > new StructField("_original", DataTypes.StringType, true, > Metadata.empty()), > new StructField("_lang", DataTypes.StringType, true, > Metadata.empty()), > new StructField("ce:para", DataTypes.StringType, true, > Metadata.empty()) > }); > StructType _abstracts = new StructType(new StructField[]{ > new StructField("abstract", _abstract, true, Metadata.empty()) > }); > > StructType _head = new StructType(new StructField[]{ > new StructField("abstracts", _abstracts, true, > Metadata.empty()) > }); > > StructType _bibrecord = new StructType(new StructField[]{ > new StructField("head", _head, true, Metadata.empty()) > > }); > > StructType _xocs_item = new StructType(new StructField[]{ > new StructField("bibrecord", _bibrecord, true, > Metadata.empty()) > }); > > StructType rexploreXMLDataSchema = new StructType(new > StructField[]{ > new StructField("xocs:item", _xocs_item, true, > Metadata.empty()),}); > > > String localxml = "/Users/carloallocca/Desktop/Spark > Material/francesco2.xml"; > > SparkSession spark = SparkSession > .builder() > .master("local[2]") > .appName("DatasetForCaseNew") > .getOrCreate(); > > String rowTag = "xocs:doc"; > > SQLContext sqlContext = new SQLContext(spark); > Dataset<Row> df = sqlContext.read() > .format("com.databricks.spark.xml") > .option("rowTag", rowTag) > .option("attributePrefix", "_") > .schema(rexploreXMLDataSchema) > .load(localxml); > > df.printSchema(); > > df.select( > df.col("xocs:item").getField("bibrecord").getItem("head"). > getField("abstracts").getField("abstract").getField("_original"), > df.col("xocs:item").getField("bibrecord").getItem("head"). > getField("abstracts").getField("abstract").getItem("_lang"), > df.col("xocs:item").getField("bibrecord").getItem("head"). > getField("abstracts").getField("abstract").getField("ce:para") > ).show(); > > } > > } > > > -- The Open University is incorporated by Royal Charter (RC 000391), an > exempt charity in England & Wales and a charity registered in Scotland (SC > 038302). The Open University is authorised and regulated by the Financial > Conduct Authority. > > > >