[ https://issues.apache.org/jira/browse/SPARK-45414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Giuseppe Ceravolo updated SPARK-45414: -------------------------------------- Description: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST('' AS STRING) AS ItemImageUrl, CAST('Add' AS STRING) AS _Action ;""" ) fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} The output is a wrong xml where "MyDescription" is written inside the "Color" tag instead of the "Description" tag (but if you display the "fake_file_df" DataFrame it looks good as "MyDescription" is under the "Description" column). {code:xml} <?xml version="1.0" encoding="UTF-8"?> <root_tag> <row_tag SerialNumberFlag="" Action="Add"> <ItemID Type="ItemId">123</ItemID> <UPC Type="UPC">123</UPC> <MerchandiseHierarchy ID="" Level=""/> <ItemPrice ValueTypeCode="" Value=""/> <Manufacturer> <Name></Name> </Manufacturer> <Marketing> <Season></Season> </Marketing> <BrandOwner Name=""/> <ItemAttribute_culinary Name="Attribute1"> <AttributeValue>Value1</AttributeValue> </ItemAttribute_culinary> <ItemAttribute_noculinary Name=""> <item> <AttributeCode></AttributeCode> <AttributeValue></AttributeValue> </item> </ItemAttribute_noculinary> <ItemMeasurements> <Depth UnitOfMeasure=""></Depth> <Height UnitOfMeasure=""></Height> <Width UnitOfMeasure=""></Width> <Diameter UnitOfMeasure=""></Diameter> </ItemMeasurements> <TaxInformation> <TaxGroupID>GroupA</TaxGroupID> <TaxExemptCode>CodeA</TaxExemptCode> <TaxAmount>1</TaxAmount> </TaxInformation> <ItemFranchisees> <item action="" franchiseeId="" franchiseeName=""/> </ItemFranchisees> <Description></Description> <Color>MyDescription</Color> <IntendedIndustry></IntendedIndustry> <ItemImageUrl></ItemImageUrl> </row_tag> </root_tag> {code} Thanks! Giuseppe Ceravolo was: h1. Intro Hi all! Please expect some degree of incompleteness in this issue as this is the very first one I post, and feel free to edit it as you like - I welcome your feedback. My goal is to provide you with as many details and indications as I can on this issue that I am currently facing with a Client of mine on its Production environment (we use Azure Databricks DBR 11.3 LTS). I was told by [Sean Owen|[srowen (Sean Owen) (github.com)|https://github.com/srowen]], who maintains the spark-xml maven repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an issue here because "This code has been ported to Apache Spark now anyway so won't be updated here" (refer to his comment [here|#issuecomment-1744792958]).] h1. Issue When I write a DataFrame into xml format via the spark-xml library either (1) I get an error if empty string columns are in between non-string nested ones or (2) if I put all string columns at the end then I get a wrong xml where the content of string tags are misplaced into the following ones. h1. Code to reproduce the issue Please find below the end-to-end code snippet that results into the error h2. CASE (1): ERROR When empty strings are in between non-string nested ones, the write fails with the following error. _Caused by: java.lang.IllegalArgumentException: Failed to convert value MyDescription (class of class java.lang.String) in type ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) to XML._ Please find attached the full trace of the error. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation, CAST('' AS STRING) AS ItemImageUrl, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, CAST('Add' AS STRING) AS _Action ;""" ) # fake_file_df.display() fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} I noticed that it works if I try to write all columns up to "Color" (excluded), namely: {code:python} fake_file_df \ .select( "ItemID", "UPC", "_SerialNumberFlag", "Description", "MerchandiseHierarchy", "ItemPrice" ) \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path){code} h2. CASE (2): MISPLACED XML When I put all string columns at the end of the 1-row DataFrame it mistakenly writes the content of one column into the tag right after it. {code:python} fake_file_df = spark \ .sql( """SELECT CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: STRING, _VALUE: STRING>) AS UPC, CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy, CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice, CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS Manufacturer, CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS Marketing, CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS BrandOwner, CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS ItemAttribute_culinary, CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary, CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>>) AS ItemMeasurements, CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, TaxAmount: STRING>) AS TaxInformation, CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, CAST('' AS STRING) AS _SerialNumberFlag, CAST('MyDescription' AS STRING) AS Description, CAST('' AS STRING) AS Color, CAST('' AS STRING) AS IntendedIndustry, CAST('' AS STRING) AS ItemImageUrl, CAST('Add' AS STRING) AS _Action ;""" ) fake_file_df \ .coalesce(1) \ .write \ .format('com.databricks.spark.xml') \ .option('declaration', 'version="1.0" encoding="UTF-8"') \ .option("nullValue", "") \ .option('rootTag', "root_tag") \ .option('rowTag', "row_tag") \ .mode('overwrite') \ .save(xml_folder_path) {code} The output is a wrong xml where "MyDescription" is written inside the "Color" tag instead of the "Description" tag (but if you display the "fake_file_df" DataFrame it looks good as "MyDescription" is under the "Description" column). {code:xml} <?xml version="1.0" encoding="UTF-8"?> <root_tag> <row_tag SerialNumberFlag="" Action="Add"> <ItemID Type="ItemId">123</ItemID> <UPC Type="UPC">123</UPC> <MerchandiseHierarchy ID="" Level=""/> <ItemPrice ValueTypeCode="" Value=""/> <Manufacturer> <Name></Name> </Manufacturer> <Marketing> <Season></Season> </Marketing> <BrandOwner Name=""/> <ItemAttribute_culinary Name="Attribute1"> <AttributeValue>Value1</AttributeValue> </ItemAttribute_culinary> <ItemAttribute_noculinary Name=""> <item> <AttributeCode></AttributeCode> <AttributeValue></AttributeValue> </item> </ItemAttribute_noculinary> <ItemMeasurements> <Depth UnitOfMeasure=""></Depth> <Height UnitOfMeasure=""></Height> <Width UnitOfMeasure=""></Width> <Diameter UnitOfMeasure=""></Diameter> </ItemMeasurements> <TaxInformation> <TaxGroupID>GroupA</TaxGroupID> <TaxExemptCode>CodeA</TaxExemptCode> <TaxAmount>1</TaxAmount> </TaxInformation> <ItemFranchisees> <item action="" franchiseeId="" franchiseeName=""/> </ItemFranchisees> <Description></Description> <Color>MyDescription</Color> <IntendedIndustry></IntendedIndustry> <ItemImageUrl></ItemImageUrl> </row_tag> </root_tag> {code} Thanks! Giuseppe Ceravolo > spark-xml misplaces string tag content > -------------------------------------- > > Key: SPARK-45414 > URL: https://issues.apache.org/jira/browse/SPARK-45414 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 3.3.0 > Reporter: Giuseppe Ceravolo > Priority: Critical > Attachments: IllegalArgumentException.txt > > > h1. Intro > Hi all! Please expect some degree of incompleteness in this issue as this is > the very first one I post, and feel free to edit it as you like - I welcome > your feedback. > My goal is to provide you with as many details and indications as I can on > this issue that I am currently facing with a Client of mine on its Production > environment (we use Azure Databricks DBR 11.3 LTS). > I was told by [Sean Owen|[srowen (Sean Owen) > (github.com)|https://github.com/srowen]], who maintains the spark-xml maven > repository on GitHub [here|[https://github.com/srowen/spark-xml],] to post an > issue here because "This code has been ported to Apache Spark now anyway so > won't be updated here" (refer to his comment > [here|#issuecomment-1744792958]).] > h1. Issue > When I write a DataFrame into xml format via the spark-xml library either (1) > I get an error if empty string columns are in between non-string nested ones > or (2) if I put all string columns at the end then I get a wrong xml where > the content of string tags are misplaced into the following ones. > h1. Code to reproduce the issue > Please find below the end-to-end code snippet that results into the error > h2. CASE (1): ERROR > When empty strings are in between non-string nested ones, the write fails > with the following error. > _Caused by: java.lang.IllegalArgumentException: Failed to convert value > MyDescription (class of class java.lang.String) in type > ArrayType(StructType(StructField(_ID,StringType,true),StructField(_Level,StringType,true)),true) > to XML._ > Please find attached the full trace of the error. > {code:python} > fake_file_df = spark \ > .sql( > """SELECT > CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS > STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, > CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: > STRING, _VALUE: STRING>) AS UPC, > CAST('' AS STRING) AS _SerialNumberFlag, > CAST('MyDescription' AS STRING) AS Description, > CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS > ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy, > CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS > ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice, > CAST('' AS STRING) AS Color, > CAST('' AS STRING) AS IntendedIndustry, > CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS > Manufacturer, > CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS > Marketing, > CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS > BrandOwner, > CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) > AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS > ItemAttribute_culinary, > CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS > `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS > ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, > AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary, > CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS > `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, > STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS > `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: > STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: > STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: > STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: > STRING>>) AS ItemMeasurements, > CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, > '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, > TaxAmount: STRING>) AS TaxInformation, > CAST('' AS STRING) AS ItemImageUrl, > CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS > `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: > STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, > CAST('Add' AS STRING) AS _Action > ;""" > ) > # fake_file_df.display() > fake_file_df \ > .coalesce(1) \ > .write \ > .format('com.databricks.spark.xml') \ > .option('declaration', 'version="1.0" encoding="UTF-8"') \ > .option("nullValue", "") \ > .option('rootTag', "root_tag") \ > .option('rowTag', "row_tag") \ > .mode('overwrite') \ > .save(xml_folder_path) {code} > I noticed that it works if I try to write all columns up to "Color" > (excluded), namely: > {code:python} > fake_file_df \ > .select( > "ItemID", > "UPC", > "_SerialNumberFlag", > "Description", > "MerchandiseHierarchy", > "ItemPrice" > ) \ > .coalesce(1) \ > .write \ > .format('com.databricks.spark.xml') \ > .option('declaration', 'version="1.0" encoding="UTF-8"') \ > .option("nullValue", "") \ > .option('rootTag', "root_tag") \ > .option('rowTag', "row_tag") \ > .mode('overwrite') \ > .save(xml_folder_path){code} > h2. CASE (2): MISPLACED XML > When I put all string columns at the end of the 1-row DataFrame it mistakenly > writes the content of one column into the tag right after it. > {code:python} > fake_file_df = spark \ > .sql( > """SELECT > CAST(STRUCT('ItemId' AS `_Type`, '123' AS `_VALUE`) AS > STRUCT<_Type: STRING, _VALUE: STRING>) AS ItemID, > CAST(STRUCT('UPC' AS `_Type`, '123' AS `_VALUE`) AS STRUCT<_Type: > STRING, _VALUE: STRING>) AS UPC, > CAST(ARRAY(STRUCT(NULL AS `_ID`, NULL AS `_Level`)) AS > ARRAY<STRUCT<_ID: STRING, _Level: STRING>>) AS MerchandiseHierarchy, > CAST(ARRAY(STRUCT(NULL AS `_ValueTypeCode`, NULL AS `_VALUE`)) AS > ARRAY<STRUCT<_ValueTypeCode: STRING, _Value: STRING>>) AS ItemPrice, > CAST(STRUCT(NULL AS `Name`) AS STRUCT<Name: STRING>) AS > Manufacturer, > CAST(STRUCT(NULL AS `Season`) AS STRUCT<Season: STRING>) AS > Marketing, > CAST(STRUCT(NULL AS `_Name`) AS STRUCT<_Name: STRING>) AS > BrandOwner, > CAST(ARRAY(STRUCT('Attribute1' AS `_Name`, 'Value1' AS `_VALUE`)) > AS ARRAY<STRUCT<_Name: STRING, AttributeValue: STRING>>) AS > ItemAttribute_culinary, > CAST(ARRAY(STRUCT(NULL AS `_Name`, ARRAY(ARRAY(STRUCT(NULL AS > `AttributeCode`, NULL AS `AttributeValue`))) AS `_VALUE`)) AS > ARRAY<STRUCT<_Name: STRING, _VALUE: ARRAY<ARRAY<STRUCT<AttributeCode: STRING, > AttributeValue: STRING>>>>>) AS ItemAttribute_noculinary, > CAST(STRUCT(STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS > `Depth`, STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Height`, > STRUCT(NULL AS `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Width`, STRUCT(NULL AS > `_UnitOfMeasure`, NULL AS `_VALUE`) AS `Diameter`) AS STRUCT<Depth: > STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Height: > STRUCT<_UnitOfMeasure: STRING, _VALUE: STRING>, Width: STRUCT<_UnitOfMeasure: > STRING, _VALUE: STRING>, Diameter: STRUCT<_UnitOfMeasure: STRING, _VALUE: > STRING>>) AS ItemMeasurements, > CAST(STRUCT('GroupA' AS `TaxGroupID`, 'CodeA' AS `TaxExemptCode`, > '1' AS `TaxAmount`) AS STRUCT<TaxGroupID: STRING, TaxExemptCode: STRING, > TaxAmount: STRING>) AS TaxInformation, > CAST(ARRAY(ARRAY(STRUCT(NULL AS `_action`, NULL AS > `_franchiseeId`, NULL AS `_franchiseeName`))) AS ARRAY<ARRAY<STRUCT<_action: > STRING, _franchiseeId: STRING, _franchiseeName: STRING>>>) AS ItemFranchisees, > CAST('' AS STRING) AS _SerialNumberFlag, > CAST('MyDescription' AS STRING) AS Description, > CAST('' AS STRING) AS Color, > CAST('' AS STRING) AS IntendedIndustry, > CAST('' AS STRING) AS ItemImageUrl, > CAST('Add' AS STRING) AS _Action > ;""" > ) > fake_file_df \ > .coalesce(1) \ > .write \ > .format('com.databricks.spark.xml') \ > .option('declaration', 'version="1.0" encoding="UTF-8"') \ > .option("nullValue", "") \ > .option('rootTag', "root_tag") \ > .option('rowTag', "row_tag") \ > .mode('overwrite') \ > .save(xml_folder_path) {code} > The output is a wrong xml where "MyDescription" is written inside the "Color" > tag instead of the "Description" tag (but if you display the "fake_file_df" > DataFrame it looks good as "MyDescription" is under the "Description" column). > {code:xml} > <?xml version="1.0" encoding="UTF-8"?> > <root_tag> > <row_tag SerialNumberFlag="" Action="Add"> > <ItemID Type="ItemId">123</ItemID> > <UPC Type="UPC">123</UPC> > <MerchandiseHierarchy ID="" Level=""/> > <ItemPrice ValueTypeCode="" Value=""/> > <Manufacturer> > <Name></Name> > </Manufacturer> > <Marketing> > <Season></Season> > </Marketing> > <BrandOwner Name=""/> > <ItemAttribute_culinary Name="Attribute1"> > <AttributeValue>Value1</AttributeValue> > </ItemAttribute_culinary> > <ItemAttribute_noculinary Name=""> > <item> > <AttributeCode></AttributeCode> > <AttributeValue></AttributeValue> > </item> > </ItemAttribute_noculinary> > <ItemMeasurements> > <Depth UnitOfMeasure=""></Depth> > <Height UnitOfMeasure=""></Height> > <Width UnitOfMeasure=""></Width> > <Diameter UnitOfMeasure=""></Diameter> > </ItemMeasurements> > <TaxInformation> > <TaxGroupID>GroupA</TaxGroupID> > <TaxExemptCode>CodeA</TaxExemptCode> > <TaxAmount>1</TaxAmount> > </TaxInformation> > <ItemFranchisees> > <item action="" franchiseeId="" franchiseeName=""/> > </ItemFranchisees> > <Description></Description> > <Color>MyDescription</Color> > <IntendedIndustry></IntendedIndustry> > <ItemImageUrl></ItemImageUrl> > </row_tag> > </root_tag> {code} > > Thanks! Giuseppe Ceravolo > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org