Dear All, I am using spark-xml_2.10 to parse and extract some data from XML files.
I got the issue of getting null value whereas the XML file contains actually values. +----------------------------------------------------------------------------+------------------------------------------------------------------+--------------------------------------------------------------------+ |xocs:item.bibrecord.head.abstracts.abstract._original |xocs:item.bibrecord.head.abstracts.abstract._lang | xocs:item.bibrecord.head.abstracts.abstract.ce:para| +----------------------------------------------------------------------------+------------------------------------------------------------------+---------------------------------------------------------------------+ | null| null| null| +----------------------------------------------------------------------------+-------------------------------------------------------------------+---------------------------------------------------------------------+ Below, I report an example of XML that I processing and the code I am using to parse it. What am I doing wrong? Please, any help on this would be very appreciated. Many Thanks in advance, Best Regards, Carlo ===== An example SPARK prints the following schema: root |-- xocs:item: struct (nullable = true) | |-- bibrecord: struct (nullable = true) | | |-- head: struct (nullable = true) | | | |-- abstracts: struct (nullable = true) | | | | |-- abstract: struct (nullable = true) | | | | | |-- _original: string (nullable = true) | | | | | |-- _lang: string (nullable = true) ==== XML file example XML file structure: <xocs:doc <xocs:item> <item> <bibrecord> <head> <abstracts> <abstract original="y" xml:lang="eng”> <ce:para> This paper bla bla bla... </ce:para> === SPARK code: public class XMLParser { public static void main(String arr[]) { // xocs:item/item/bibrecord/head/abstracts section StructType _abstract = new StructType(new StructField[]{ new StructField("_original", DataTypes.StringType, true, Metadata.empty()), new StructField("_lang", DataTypes.StringType, true, Metadata.empty()), new StructField("ce:para", DataTypes.StringType, true, Metadata.empty()) }); StructType _abstracts = new StructType(new StructField[]{ new StructField("abstract", _abstract, true, Metadata.empty()) }); StructType _head = new StructType(new StructField[]{ new StructField("abstracts", _abstracts, true, Metadata.empty()) }); StructType _bibrecord = new StructType(new StructField[]{ new StructField("head", _head, true, Metadata.empty()) }); StructType _xocs_item = new StructType(new StructField[]{ new StructField("bibrecord", _bibrecord, true, Metadata.empty()) }); StructType rexploreXMLDataSchema = new StructType(new StructField[]{ new StructField("xocs:item", _xocs_item, true, Metadata.empty()),}); String localxml = "/Users/carloallocca/Desktop/Spark Material/francesco2.xml"; SparkSession spark = SparkSession .builder() .master("local[2]") .appName("DatasetForCaseNew") .getOrCreate(); String rowTag = "xocs:doc"; SQLContext sqlContext = new SQLContext(spark); Dataset<Row> df = sqlContext.read() .format("com.databricks.spark.xml") .option("rowTag", rowTag) .option("attributePrefix", "_") .schema(rexploreXMLDataSchema) .load(localxml); df.printSchema(); df.select( df.col("xocs:item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"), df.col("xocs:item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"), df.col("xocs:item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para") ).show(); } } -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority.