Re: using spark-xml_2.10 to extract data from XML file

2017-02-15 Thread Carlo . Allocca
Hi Hyukjin,

Thank you very much for this.

Sure I am going to do it today based on data + java code.

Many Thanks for the support.
Best Regards,
Carlo


On 15 Feb 2017, at 00:22, Hyukjin Kwon 
> wrote:

Hi Carlo,


There was a bug in lower versions when accessing to nested values in the 
library.

Otherwise, I suspect another issue about parsing malformed XML.

Could you maybe open an issue in https://github.com/databricks/spark-xml/issues 
with your sample data?

I will stick with it until it is solved.


Thanks.



2017-02-15 5:04 GMT+09:00 Carlo.Allocca 
>:
more specifically:

Given the following XML data structure:

This is the Structure of the XML file:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 ||-- bibrecord: struct (nullable = true)
 |||-- head: struct (nullable = true)
 ||||-- abstracts: struct (nullable = true)
 |||||-- abstract: struct (nullable = true)
 ||||||-- _original: string (nullable = true)
 ||||||-- _lang: string (nullable = true)
 ||||||-- ce:para: string (nullable = true)


CASE 1:

String rowTag="abstracts”;
Dataset df = (new 
XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, 
localxml);
df.select(df.col("abstract.ce:para"), 
df.col("abstract._original"),df.col("abstract._lang") ).show();

I got the right values.

CASE 2:

String rowTag="xocs:doc";
Dataset df = (new 
XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, 
localxml);
df.select(df.col("xocs:item.item.bibrecord.head.abstracts.abstract.ce:para")).show();

I got null values.


My question is: How Can I get it right to use String rowTag="xocs:doc”; and get 
the right values for  ….abstract.ce:para, etc? what am I doing wrong?

Many Thanks in advance.
Best Regards,
Carlo



On 14 Feb 2017, at 17:35, carlo allocca 
> wrote:

Dear All,

I would like to ask you help about the following issue when using 
spark-xml_2.10:

Given a XML file with the following structure:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 ||-- bibrecord: struct (nullable = true)
 |||-- head: struct (nullable = true)
 ||||-- abstracts: struct (nullable = true)
 |||||-- abstract: struct (nullable = true)
 ||||||-- _original: string (nullable = true)
 ||||||-- _lang: string (nullable = true)
 ||||||-- ce:para: string (nullable = true)

Using the below code to extract all the info from the abstract:

1) I got “null" for each three values: _original, _lang and ce:para when I use 
rowTag = “xocs:doc”.
2) I got the right values when I use rowTag = “abstracts”.

Of course, I need to write a general parser that works at xocs:doc level.
I have been reading the documentation at 
https://github.com/databricks/spark-xml but I did not help me much to solve the 
above issue.

Am I doing something wrong? or it may be related to bug the library I am using?

Please, could you advice?

Many Thanks,
Best Regards,
carlo





=== Code:
public static void main(String arr[]) {

// xocs:item/item/bibrecord/head/abstracts  section
StructType _abstract = new StructType(new StructField[]{
new StructField("_original", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("_lang", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("ce:para", DataTypes.StringType, true, 
Metadata.empty())
});
StructType _abstracts = new StructType(new StructField[]{
new StructField("abstract", _abstract, true, Metadata.empty())
});

StructType _head = new StructType(new StructField[]{
new StructField("abstracts", _abstracts, true, Metadata.empty())
});

StructType _bibrecord = new StructType(new StructField[]{
new StructField("head", _head, true, Metadata.empty())

});

StructType _item = new StructType(new StructField[]{
new StructField("bibrecord", _bibrecord, true, Metadata.empty())
});

StructType _xocs_item = new StructType(new StructField[]{
new StructField("item", _item, true, Metadata.empty()),});

StructType rexploreXMLDataSchema = new StructType(new StructField[]{
new StructField("xocs:item", _xocs_item, true, Metadata.empty()),});

String localxml = “..filename.xml";

SparkSession spark = SparkSession
.builder()
.master("local[2]")
.appName("DatasetForCaseNew")
.getOrCreate();

String rowTag = "xocs:doc";



SQLContext sqlContext = new SQLContext(spark);
Dataset df = sqlContext.read()
.format("com.databricks.spark.xml")
   

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Hyukjin Kwon
Hi Carlo,


There was a bug in lower versions when accessing to nested values in the
library.

Otherwise, I suspect another issue about parsing malformed XML.

Could you maybe open an issue in
https://github.com/databricks/spark-xml/issues with your sample data?

I will stick with it until it is solved.


Thanks.



2017-02-15 5:04 GMT+09:00 Carlo.Allocca :

> more specifically:
>
> Given the following XML data structure:
>
> This is the Structure of the XML file:
>
> xocs:doc
>  |-- xocs:item: struct (nullable = true)
>  ||-- bibrecord: struct (nullable = true)
>  |||-- head: struct (nullable = true)
>  ||||-- abstracts: struct (nullable = true)
>  |||||-- abstract: struct (nullable = true)
>  ||||||-- _original: string (nullable = true)
>  ||||||-- _lang: string (nullable = true)
>  ||||||-- ce:para: string (nullable = true)
>
>
>
> CASE 1:
>
> String rowTag="abstracts”;
> Dataset df = (new XmlReader()).withAttributePrefix("_").
> withRowTag(rowTag).xmlFile(sqlContext, localxml);
> df.select(df.col("abstract.ce:para"), 
> df.col("abstract._original"),df.col("abstract._lang")
> ).show();
>
> *I got the right values. *
>
> CASE 2:
>
> String rowTag="xocs:doc";
> Dataset df = (new XmlReader()).withAttributePrefix("_").
> withRowTag(rowTag).xmlFile(sqlContext, localxml);
> df.select(df.col("xocs:item.item.bibrecord.head.abstracts.
> abstract.ce:para")).show();
>
> *I got null values.*
>
>
> My question is: How Can I get it right to use String rowTag="xocs:doc”;
> and get the right values for  ….abstract.ce:para, etc? what am I doing
> wrong?
>
> Many Thanks in advance.
> Best Regards,
> Carlo
>
>
>
> On 14 Feb 2017, at 17:35, carlo allocca  wrote:
>
> Dear All,
>
> I would like to ask you help about the following issue when using
> spark-xml_2.10:
>
> Given a XML file with the following structure:
>
> xocs:doc
>  |-- xocs:item: struct (nullable = true)
>  ||-- bibrecord: struct (nullable = true)
>  |||-- head: struct (nullable = true)
>  ||||-- abstracts: struct (nullable = true)
>  |||||-- abstract: struct (nullable = true)
>  ||||||-- _original: string (nullable = true)
>  ||||||-- _lang: string (nullable = true)
>  ||||||-- ce:para: string (nullable = true)
>
> Using the below code to extract all the info from the abstract:
>
> 1) I got “null" for each three values: _original, _lang and ce:para when I
> use rowTag = “xocs:doc”.
> 2) I got the right values when I use rowTag = “abstracts”.
>
> Of course, I need to write a general parser that works at xocs:doc level.
> I have been reading the documentation at https://github.com/
> databricks/spark-xml but I did not help me much to solve the above issue.
>
> Am I doing something wrong? or it may be related to bug the library I am
> using?
>
> Please, could you advice?
>
> Many Thanks,
> Best Regards,
> carlo
>
>
>
>
>
> === Code:
> public static void main(String arr[]) {
>
> // xocs:item/item/bibrecord/head/abstracts  section
> StructType _abstract = new StructType(new StructField[]{
> new StructField("_original", DataTypes.StringType, true,
> Metadata.empty()),
> new StructField("_lang", DataTypes.StringType, true,
> Metadata.empty()),
> new StructField("ce:para", DataTypes.StringType, true,
> Metadata.empty())
> });
> StructType _abstracts = new StructType(new StructField[]{
> new StructField("abstract", _abstract, true, Metadata.empty())
> });
>
> StructType _head = new StructType(new StructField[]{
> new StructField("abstracts", _abstracts, true,
> Metadata.empty())
> });
>
> StructType _bibrecord = new StructType(new StructField[]{
> new StructField("head", _head, true, Metadata.empty())
>
> });
>
> StructType _item = new StructType(new StructField[]{
> new StructField("bibrecord", _bibrecord, true,
> Metadata.empty())
> });
>
> StructType _xocs_item = new StructType(new StructField[]{
> new StructField("item", _item, true, Metadata.empty()),});
>
> StructType rexploreXMLDataSchema = new StructType(new
> StructField[]{
> new StructField("xocs:item", _xocs_item, true,
> Metadata.empty()),});
>
> String localxml = “..filename.xml";
>
> SparkSession spark = SparkSession
> .builder()
> .master("local[2]")
> .appName("DatasetForCaseNew")
> .getOrCreate();
>
> String rowTag = "xocs:doc";
>
>
>
> SQLContext sqlContext = new SQLContext(spark);
> Dataset df = sqlContext.read()
> .format("com.databricks.spark.xml")
> .option("rowTag", rowTag)
> 

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Carlo . Allocca
more specifically:

Given the following XML data structure:

This is the Structure of the XML file:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 ||-- bibrecord: struct (nullable = true)
 |||-- head: struct (nullable = true)
 ||||-- abstracts: struct (nullable = true)
 |||||-- abstract: struct (nullable = true)
 ||||||-- _original: string (nullable = true)
 ||||||-- _lang: string (nullable = true)
 ||||||-- ce:para: string (nullable = true)


CASE 1:

String rowTag="abstracts”;
Dataset df = (new 
XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, 
localxml);
df.select(df.col("abstract.ce:para"), 
df.col("abstract._original"),df.col("abstract._lang") ).show();

I got the right values.

CASE 2:

String rowTag="xocs:doc";
Dataset df = (new 
XmlReader()).withAttributePrefix("_").withRowTag(rowTag).xmlFile(sqlContext, 
localxml);
df.select(df.col("xocs:item.item.bibrecord.head.abstracts.abstract.ce:para")).show();

I got null values.


My question is: How Can I get it right to use String rowTag="xocs:doc”; and get 
the right values for  ….abstract.ce:para, etc? what am I doing wrong?

Many Thanks in advance.
Best Regards,
Carlo



On 14 Feb 2017, at 17:35, carlo allocca 
> wrote:

Dear All,

I would like to ask you help about the following issue when using 
spark-xml_2.10:

Given a XML file with the following structure:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 ||-- bibrecord: struct (nullable = true)
 |||-- head: struct (nullable = true)
 ||||-- abstracts: struct (nullable = true)
 |||||-- abstract: struct (nullable = true)
 ||||||-- _original: string (nullable = true)
 ||||||-- _lang: string (nullable = true)
 ||||||-- ce:para: string (nullable = true)

Using the below code to extract all the info from the abstract:

1) I got “null" for each three values: _original, _lang and ce:para when I use 
rowTag = “xocs:doc”.
2) I got the right values when I use rowTag = “abstracts”.

Of course, I need to write a general parser that works at xocs:doc level.
I have been reading the documentation at 
https://github.com/databricks/spark-xml but I did not help me much to solve the 
above issue.

Am I doing something wrong? or it may be related to bug the library I am using?

Please, could you advice?

Many Thanks,
Best Regards,
carlo





=== Code:
public static void main(String arr[]) {

// xocs:item/item/bibrecord/head/abstracts  section
StructType _abstract = new StructType(new StructField[]{
new StructField("_original", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("_lang", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("ce:para", DataTypes.StringType, true, 
Metadata.empty())
});
StructType _abstracts = new StructType(new StructField[]{
new StructField("abstract", _abstract, true, Metadata.empty())
});

StructType _head = new StructType(new StructField[]{
new StructField("abstracts", _abstracts, true, Metadata.empty())
});

StructType _bibrecord = new StructType(new StructField[]{
new StructField("head", _head, true, Metadata.empty())

});

StructType _item = new StructType(new StructField[]{
new StructField("bibrecord", _bibrecord, true, Metadata.empty())
});

StructType _xocs_item = new StructType(new StructField[]{
new StructField("item", _item, true, Metadata.empty()),});

StructType rexploreXMLDataSchema = new StructType(new StructField[]{
new StructField("xocs:item", _xocs_item, true, Metadata.empty()),});

String localxml = “..filename.xml";

SparkSession spark = SparkSession
.builder()
.master("local[2]")
.appName("DatasetForCaseNew")
.getOrCreate();

String rowTag = "xocs:doc";



SQLContext sqlContext = new SQLContext(spark);
Dataset df = sqlContext.read()
.format("com.databricks.spark.xml")
.option("rowTag", rowTag)
.option("attributePrefix", "_")
.schema(rexploreXMLDataSchema)
.load(localxml);

df.printSchema();

df.select(

df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"),

df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"),

df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para")
  

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Carlo . Allocca
Dear All,

I would like to ask you help about the following issue when using 
spark-xml_2.10:

Given a XML file with the following structure:

xocs:doc
 |-- xocs:item: struct (nullable = true)
 ||-- bibrecord: struct (nullable = true)
 |||-- head: struct (nullable = true)
 ||||-- abstracts: struct (nullable = true)
 |||||-- abstract: struct (nullable = true)
 ||||||-- _original: string (nullable = true)
 ||||||-- _lang: string (nullable = true)
 ||||||-- ce:para: string (nullable = true)

Using the below code to extract all the info from the abstract:

1) I got “null" for each three values: _original, _lang and ce:para when I use 
rowTag = “xocs:doc”.
2) I got the right values when I use rowTag = “abstracts”.

Of course, I need to write a general parser that works at xocs:doc level.
I have been reading the documentation at 
https://github.com/databricks/spark-xml but I did not help me much to solve the 
above issue.

Am I doing something wrong? or it may be related to bug the library I am using?

Please, could you advice?

Many Thanks,
Best Regards,
carlo





=== Code:
public static void main(String arr[]) {

// xocs:item/item/bibrecord/head/abstracts  section
StructType _abstract = new StructType(new StructField[]{
new StructField("_original", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("_lang", DataTypes.StringType, true, 
Metadata.empty()),
new StructField("ce:para", DataTypes.StringType, true, 
Metadata.empty())
});
StructType _abstracts = new StructType(new StructField[]{
new StructField("abstract", _abstract, true, Metadata.empty())
});

StructType _head = new StructType(new StructField[]{
new StructField("abstracts", _abstracts, true, Metadata.empty())
});

StructType _bibrecord = new StructType(new StructField[]{
new StructField("head", _head, true, Metadata.empty())

});

StructType _item = new StructType(new StructField[]{
new StructField("bibrecord", _bibrecord, true, Metadata.empty())
});

StructType _xocs_item = new StructType(new StructField[]{
new StructField("item", _item, true, Metadata.empty()),});

StructType rexploreXMLDataSchema = new StructType(new StructField[]{
new StructField("xocs:item", _xocs_item, true, Metadata.empty()),});

String localxml = “..filename.xml";

SparkSession spark = SparkSession
.builder()
.master("local[2]")
.appName("DatasetForCaseNew")
.getOrCreate();

String rowTag = "xocs:doc";



SQLContext sqlContext = new SQLContext(spark);
Dataset df = sqlContext.read()
.format("com.databricks.spark.xml")
.option("rowTag", rowTag)
.option("attributePrefix", "_")
.schema(rexploreXMLDataSchema)
.load(localxml);

df.printSchema();

df.select(

df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("_original"),

df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getItem("_lang"),

df.col("xocs:item").getField("item").getField("bibrecord").getItem("head").getField("abstracts").getField("abstract").getField("ce:para")
).show();

//df.select(
//df.col("_original"),
//df.col("_lang"),
//df.col("ce:para")
//
//).show();

//df.select(
//df.col("abstract").getField("_original"),
//df.col("abstract").getField("_lang"),
//df.col("abstract").getField("ce:para")
//
//).show();


//df.select(
//
df.col("head").getField("abstracts").getField("abstract").getField("_original"),
//
df.col("head").getField("abstracts").getField("abstract").getField("_lang"),
//
df.col("head").getField("abstracts").getField("abstract").getField("ce:para")
//
//).show();




}




On 13 Feb 2017, at 18:17, Carlo.Allocca 
> wrote:

Dear All,

I am using spark-xml_2.10 to parse and extract some data from XML files.

I got the issue of getting null value whereas the XML file contains actually 
values.

++--++
|xocs:item.bibrecord.head.abstracts.abstract._original