[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content

Giuseppe Ceravolo (Jira) Wed, 04 Oct 2023 06:34:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Giuseppe Ceravolo updated SPARK-45414:
--------------------------------------
    Description: 
h1. Intro

Hi all! Please expect some degree of incompleteness in this issue as this is 
the very first one I post, and feel free to edit it as you like - I welcome 
your feedback.

My goal is to provide you with as many details and indications as I can on this 
issue that I am currently facing with a Client of mine on its Production 
environment (we use Azure Databricks DBR 11.3 LTS).

I was told by [Sean Owen|[srowen (Sean Owen) 
(github.com)|https://github.com/srowen]], who maintains the spark-xml maven 
repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an 
issue here because "This code has been ported to Apache Spark now anyway so 
won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).]
h1. Issue

When I write a DataFrame into xml format via the spark-xml library either (1) I 
get an error if empty string columns are in between non-string nested ones or 
(2) if I put all string columns at the end then I get a wrong xml where the 
content of string tags are misplaced into the following ones.
h1. Code to reproduce the issue

Please find below the end-to-end code snippet that results into the error
h2. CASE (1): ERROR

When empty strings are in between non-string nested ones, the write fails. 
Please find attached the full trace of the error.

```py

fake_file_df = spark \
    .sql(
        """SELECT
            CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS 
STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
            CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: 
STRING, _VALUE: STRING>) AS UPC,
            CAST('' AS STRING) AS _SerialNumberFlag,
            CAST('MyDescription' AS STRING) AS Description,
            CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS 
ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,
            CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS 
ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
            CAST('' AS STRING) AS Color,
            CAST('' AS STRING) AS IntendedIndustry,
            CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS 
Manufacturer,
            CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS 
Marketing,
            CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS 
BrandOwner,
            CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) 
AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS 
ItemAttribute_culinary,
            CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS 
`AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS 
ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, 
AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,
            CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS 
`Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, 
STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: 
STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: 
STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, 
Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements,
            CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, 
'1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, 
TaxAmount: STRING>) AS TaxInformation,
            CAST('' AS STRING) AS ItemImageUrl,
            CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, 
NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, 
_franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
            CAST('Add' AS STRING) AS _Action
        ;"""
    )

# fake_file_df.display()

fake_file_df \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option("nullValue", "") \
.option('rootTag', "root_tag") \
.option('rowTag', "row_tag") \
.mode('overwrite') \
.save(xml_folder_path)

```

I noticed that it works if I try to write all columns up to "Color" (excluded), 
namely:
fake_file_df \
.select( "ItemID", "UPC", "_SerialNumberFlag", "Description", 
"MerchandiseHierarchy", "ItemPrice" ) \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option("nullValue", "") \
.option('rootTag', "root_tag") \
.option('rowTag', "row_tag") \
.mode('overwrite') \
.save(xml_folder_path)
h2. CASE (2): MISPLACED XML

When I put all string columns at the end of the 1-row DataFrame it mistakenly 
writes the content of one column into the tag right after it.
fake_file_df = spark \
.sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS 
STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, 
'123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, 
CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: 
STRING, _Level: STRING>>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS 
`_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, 
_Value: STRING>>) AS ItemPrice, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: 
STRING>) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: 
STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) 
AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS 
`_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS 
ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, 
ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS 
`_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: 
ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS 
ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS 
`_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS 
`Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, 
STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS 
STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: 
STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: 
STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: 
STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS 
`TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, 
TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation, 
CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS 
`_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: 
STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, CAST('' AS STRING) AS 
_SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST('' AS 
STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST('' AS STRING) AS 
ItemImageUrl, CAST('Add' AS STRING) AS _Action ;""" )fake_file_df \
.coalesce(1) \
.write \
.format('com.databricks.spark.xml') \
.option('declaration', 'version="1.0" encoding="UTF-8"') \
.option("nullValue", "") \
.option('rootTag', "root_tag") \
.option('rowTag', "row_tag") \
.mode('overwrite') \
.save(xml_folder_path)
The output is a wrong xml where "MyDescription" is written inside the "Color" 
tag instead of the "Description" tag (but if you display the "fake_file_df" 
DataFrame it looks good as "MyDescription" is under the "Description" column 
(see image below):).
<?xml version="1.0" encoding="UTF-8"?>
<root_tag>
<row_tag SerialNumberFlag="" Action="Add">
<ItemID Type="ItemId">123</ItemID>
<UPC Type="UPC">123</UPC>
<MerchandiseHierarchy ID="" Level=""/>
<ItemPrice ValueTypeCode="" Value=""/>
<Manufacturer>
<Name></Name>
</Manufacturer>
<Marketing>
<Season></Season>
</Marketing>
<BrandOwner Name=""/>
<ItemAttribute_culinary Name="Attribute1">
<AttributeValue>Value1</AttributeValue>
</ItemAttribute_culinary>
<ItemAttribute_noculinary Name="">
<item>
<AttributeCode></AttributeCode>
<AttributeValue></AttributeValue>
</item>
</ItemAttribute_noculinary>
<ItemMeasurements>
<Depth UnitOfMeasure=""></Depth>
<Height UnitOfMeasure=""></Height>
<Width UnitOfMeasure=""></Width>
<Diameter UnitOfMeasure=""></Diameter>
</ItemMeasurements>
<TaxInformation>
<TaxGroupID>GroupA</TaxGroupID>
<TaxExemptCode>CodeA</TaxExemptCode>
<TaxAmount>1</TaxAmount>
</TaxInformation>
<ItemFranchisees>
<item action="" franchiseeId="" franchiseeName=""/>
</ItemFranchisees>
<Description></Description>
<Color>{color:#ff0000}MyDescription{color}</Color>
<IntendedIndustry></IntendedIndustry>
<ItemImageUrl></ItemImageUrl>
</row_tag>
</root_tag>
 

Thanks!  Giuseppe Ceravolo

 

  was:
h1. Intro

Hi all! Please expect some degree of incompleteness in this issue as this is 
the very first I post, and feel free to edit it as you like - I welcome your 
feedback.

My goal is to provide you with as many details and indications as I can on this 
issue that I am currently facing with a Client of mine on its Production 
environment (we use Azure Databricks DBR 11.3 LTS).

I was told by [Sean Owen|[srowen (Sean Owen) 
(github.com)|https://github.com/srowen]], who maintains the spark-xml maven 
repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an 
issue here because "This code has been ported to Apache Spark now anyway so 
won't be updated here" (refer to his comment 
[here|[https://github.com/databricks/spark-xml/issues/431#issuecomment-1744792958]).]
h1. Issue

When I write a DataFrame into xml format via the spark-xml library either (1) I 
get an error if empty string columns are in between non-string nested ones or 
(2) if I put all string columns at the end then I get a wrong xml where the 
content of string tags are misplaced into the following ones.
h1. Code to reproduce the issue

Please find below the end-to-end code snippet that results into the error
h2. CASE (1): ERROR

When empty strings are in between non-string nested ones, the write fails. 
Please find attached the full trace of the error.
fake_file_df = spark \
    .sql(        """SELECT            CAST(STRUCT('ItemId' AS `_Type`, '123' AS 
`_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,            
CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, 
_VALUE: STRING>) AS UPC,            CAST('' AS STRING) AS _SerialNumberFlag,    
        CAST('MyDescription' AS STRING) AS Description,            
CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: 
STRING, _Level: STRING>>) AS MerchandiseHierarchy,            
CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS 
ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,            
CAST('' AS STRING) AS Color,            CAST('' AS STRING) AS IntendedIndustry, 
           CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS 
Manufacturer,            CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: 
STRING>) AS Marketing,            CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: 
STRING>) AS BrandOwner,            CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 
'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) 
AS ItemAttribute_culinary,            CAST(ARRAY(STRUCT(NULL AS `_Name`, 
ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS 
`_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: 
ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS 
ItemAttribute_noculinary,            CAST(STRUCT(STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: 
STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: 
STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, 
Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, 
           CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, 
'1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, 
TaxAmount: STRING>) AS TaxInformation,            CAST('' AS STRING) AS 
ItemImageUrl,            CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS 
`_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: 
STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,  
          CAST('Add' AS STRING) AS _Action        ;"""    )# 
fake_file_df.display()fake_file_df \
    .coalesce(1) \
    .write \
    .format('com.databricks.spark.xml') \
    .option('declaration', 'version="1.0" encoding="UTF-8"') \
    .option("nullValue", "") \
    .option('rootTag', "root_tag") \
    .option('rowTag', "row_tag") \
    .mode('overwrite') \
    .save(xml_folder_path)
I noticed that it works if I try to write all columns up to "Color" (excluded), 
namely:
fake_file_df \
    .select(        "ItemID",        "UPC",        "_SerialNumberFlag",        
"Description",        "MerchandiseHierarchy",        "ItemPrice"    ) \
    .coalesce(1) \
    .write \
    .format('com.databricks.spark.xml') \
    .option('declaration', 'version="1.0" encoding="UTF-8"') \
    .option("nullValue", "") \
    .option('rootTag', "root_tag") \
    .option('rowTag', "row_tag") \
    .mode('overwrite') \
    .save(xml_folder_path)
h2. CASE (2): MISPLACED XML

When I put all string columns at the end of the 1-row DataFrame it mistakenly 
writes the content of one column into the tag right after it.
fake_file_df = spark \
    .sql(        """SELECT            CAST(STRUCT('ItemId' AS `_Type`, '123' AS 
`_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,            
CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, 
_VALUE: STRING>) AS UPC,            CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS 
`_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS 
MerchandiseHierarchy,            CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, 
NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS 
ItemPrice,            CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS 
Manufacturer,            CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: 
STRING>) AS Marketing,            CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: 
STRING>) AS BrandOwner,            CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 
'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) 
AS ItemAttribute_culinary,            CAST(ARRAY(STRUCT(NULL AS `_Name`, 
ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS 
`_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: 
ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS 
ItemAttribute_noculinary,            CAST(STRUCT(STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS 
`_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: 
STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: 
STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, 
Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, 
           CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, 
'1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, 
TaxAmount: STRING>) AS TaxInformation,            CAST(ARRAY(ARRAY(STRUCT(NULL 
AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS 
ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: 
STRING>>>) AS ItemFranchisees,            CAST('' AS STRING) AS 
_SerialNumberFlag,            CAST('MyDescription' AS STRING) AS Description,   
         CAST('' AS STRING) AS Color,            CAST('' AS STRING) AS 
IntendedIndustry,            CAST('' AS STRING) AS ItemImageUrl,            
CAST('Add' AS STRING) AS _Action        ;"""    )fake_file_df \
    .coalesce(1) \
    .write \
    .format('com.databricks.spark.xml') \
    .option('declaration', 'version="1.0" encoding="UTF-8"') \
    .option("nullValue", "") \
    .option('rootTag', "root_tag") \
    .option('rowTag', "row_tag") \
    .mode('overwrite') \
    .save(xml_folder_path)
The output is a wrong xml where "MyDescription" is written inside the "Color" 
tag instead of the "Description" tag (but if you display the "fake_file_df" 
DataFrame it looks good as "MyDescription" is under the "Description" column 
(see image below):).
<?xml version="1.0" encoding="UTF-8"?>
<root_tag>
    <row_tag SerialNumberFlag="" Action="Add">
        <ItemID Type="ItemId">123</ItemID>
        <UPC Type="UPC">123</UPC>
        <MerchandiseHierarchy ID="" Level=""/>
        <ItemPrice ValueTypeCode="" Value=""/>
        <Manufacturer>
            <Name></Name>
        </Manufacturer>
        <Marketing>
            <Season></Season>
        </Marketing>
        <BrandOwner Name=""/>
        <ItemAttribute_culinary Name="Attribute1">
            <AttributeValue>Value1</AttributeValue>
        </ItemAttribute_culinary>
        <ItemAttribute_noculinary Name="">
            <item>
                <AttributeCode></AttributeCode>
                <AttributeValue></AttributeValue>
            </item>
        </ItemAttribute_noculinary>
        <ItemMeasurements>
            <Depth UnitOfMeasure=""></Depth>
            <Height UnitOfMeasure=""></Height>
            <Width UnitOfMeasure=""></Width>
            <Diameter UnitOfMeasure=""></Diameter>
        </ItemMeasurements>
        <TaxInformation>
            <TaxGroupID>GroupA</TaxGroupID>
            <TaxExemptCode>CodeA</TaxExemptCode>
            <TaxAmount>1</TaxAmount>
        </TaxInformation>
        <ItemFranchisees>
            <item action="" franchiseeId="" franchiseeName=""/>
        </ItemFranchisees>
        <Description></Description>
        <Color>{color:#FF0000}MyDescription{color}</Color>
        <IntendedIndustry></IntendedIndustry>
        <ItemImageUrl></ItemImageUrl>
    </row_tag>
</root_tag>
 

Thanks!  Giuseppe Ceravolo

 


> spark-xml misplaces string tag content
> --------------------------------------
>
>                 Key: SPARK-45414
>                 URL: https://issues.apache.org/jira/browse/SPARK-45414
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Giuseppe Ceravolo
>            Priority: Critical
>         Attachments: IllegalArgumentException.txt
>
>
> h1. Intro
> Hi all! Please expect some degree of incompleteness in this issue as this is 
> the very first one I post, and feel free to edit it as you like - I welcome 
> your feedback.
> My goal is to provide you with as many details and indications as I can on 
> this issue that I am currently facing with a Client of mine on its Production 
> environment (we use Azure Databricks DBR 11.3 LTS).
> I was told by [Sean Owen|[srowen (Sean Owen) 
> (github.com)|https://github.com/srowen]], who maintains the spark-xml maven 
> repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an 
> issue here because "This code has been ported to Apache Spark now anyway so 
> won't be updated here" (refer to his comment 
> [here|#issuecomment-1744792958]).]
> h1. Issue
> When I write a DataFrame into xml format via the spark-xml library either (1) 
> I get an error if empty string columns are in between non-string nested ones 
> or (2) if I put all string columns at the end then I get a wrong xml where 
> the content of string tags are misplaced into the following ones.
> h1. Code to reproduce the issue
> Please find below the end-to-end code snippet that results into the error
> h2. CASE (1): ERROR
> When empty strings are in between non-string nested ones, the write fails. 
> Please find attached the full trace of the error.
> ```py
> fake_file_df = spark \
>     .sql(
>         """SELECT
>             CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS 
> STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID,
>             CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: 
> STRING, _VALUE: STRING>) AS UPC,
>             CAST('' AS STRING) AS _SerialNumberFlag,
>             CAST('MyDescription' AS STRING) AS Description,
>             CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS 
> ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy,
>             CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS 
> ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice,
>             CAST('' AS STRING) AS Color,
>             CAST('' AS STRING) AS IntendedIndustry,
>             CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS 
> Manufacturer,
>             CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS 
> Marketing,
>             CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS 
> BrandOwner,
>             CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) 
> AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS 
> ItemAttribute_culinary,
>             CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS 
> `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS 
> ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, 
> AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary,
>             CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS 
> `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, 
> STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS 
> `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: 
> STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: 
> STRING>>) AS ItemMeasurements,
>             CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, 
> '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, 
> TaxAmount: STRING>) AS TaxInformation,
>             CAST('' AS STRING) AS ItemImageUrl,
>             CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS 
> `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: 
> STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees,
>             CAST('Add' AS STRING) AS _Action
>         ;"""
>     )
> # fake_file_df.display()
> fake_file_df \
> .coalesce(1) \
> .write \
> .format('com.databricks.spark.xml') \
> .option('declaration', 'version="1.0" encoding="UTF-8"') \
> .option("nullValue", "") \
> .option('rootTag', "root_tag") \
> .option('rowTag', "row_tag") \
> .mode('overwrite') \
> .save(xml_folder_path)
> ```
> I noticed that it works if I try to write all columns up to "Color" 
> (excluded), namely:
> fake_file_df \
> .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", 
> "MerchandiseHierarchy", "ItemPrice" ) \
> .coalesce(1) \
> .write \
> .format('com.databricks.spark.xml') \
> .option('declaration', 'version="1.0" encoding="UTF-8"') \
> .option("nullValue", "") \
> .option('rootTag', "root_tag") \
> .option('rowTag', "row_tag") \
> .mode('overwrite') \
> .save(xml_folder_path)
> h2. CASE (2): MISPLACED XML
> When I put all string columns at the end of the 1-row DataFrame it mistakenly 
> writes the content of one column into the tag right after it.
> fake_file_df = spark \
> .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS 
> STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS 
> `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, 
> CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: 
> STRING, _Level: STRING>>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS 
> `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, 
> _Value: STRING>>) AS ItemPrice, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: 
> STRING>) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: 
> STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) 
> AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS 
> `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS 
> ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, 
> ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS 
> `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: 
> ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS 
> ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL 
> AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) 
> AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, 
> STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS 
> STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: 
> STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: 
> STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: 
> STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' 
> AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, 
> TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation, 
> CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS 
> `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: 
> STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, CAST('' AS STRING) AS 
> _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST('' AS 
> STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST('' AS STRING) 
> AS ItemImageUrl, CAST('Add' AS STRING) AS _Action ;""" )fake_file_df \
> .coalesce(1) \
> .write \
> .format('com.databricks.spark.xml') \
> .option('declaration', 'version="1.0" encoding="UTF-8"') \
> .option("nullValue", "") \
> .option('rootTag', "root_tag") \
> .option('rowTag', "row_tag") \
> .mode('overwrite') \
> .save(xml_folder_path)
> The output is a wrong xml where "MyDescription" is written inside the "Color" 
> tag instead of the "Description" tag (but if you display the "fake_file_df" 
> DataFrame it looks good as "MyDescription" is under the "Description" column 
> (see image below):).
> <?xml version="1.0" encoding="UTF-8"?>
> <root_tag>
> <row_tag SerialNumberFlag="" Action="Add">
> <ItemID Type="ItemId">123</ItemID>
> <UPC Type="UPC">123</UPC>
> <MerchandiseHierarchy ID="" Level=""/>
> <ItemPrice ValueTypeCode="" Value=""/>
> <Manufacturer>
> <Name></Name>
> </Manufacturer>
> <Marketing>
> <Season></Season>
> </Marketing>
> <BrandOwner Name=""/>
> <ItemAttribute_culinary Name="Attribute1">
> <AttributeValue>Value1</AttributeValue>
> </ItemAttribute_culinary>
> <ItemAttribute_noculinary Name="">
> <item>
> <AttributeCode></AttributeCode>
> <AttributeValue></AttributeValue>
> </item>
> </ItemAttribute_noculinary>
> <ItemMeasurements>
> <Depth UnitOfMeasure=""></Depth>
> <Height UnitOfMeasure=""></Height>
> <Width UnitOfMeasure=""></Width>
> <Diameter UnitOfMeasure=""></Diameter>
> </ItemMeasurements>
> <TaxInformation>
> <TaxGroupID>GroupA</TaxGroupID>
> <TaxExemptCode>CodeA</TaxExemptCode>
> <TaxAmount>1</TaxAmount>
> </TaxInformation>
> <ItemFranchisees>
> <item action="" franchiseeId="" franchiseeName=""/>
> </ItemFranchisees>
> <Description></Description>
> <Color>{color:#ff0000}MyDescription{color}</Color>
> <IntendedIndustry></IntendedIndustry>
> <ItemImageUrl></ItemImageUrl>
> </row_tag>
> </root_tag>
>  
> Thanks!  Giuseppe Ceravolo
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45414) spark-xml misplaces string tag content

Reply via email to