[ 
https://issues.apache.org/jira/browse/SPARK-56486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pawel Matejko updated SPARK-56486:
----------------------------------
    Summary: Add XSDToSchema.extendDecimalPrecision() to prevent silent 
truncation from common XSD totalDigits/fractionDigits   (was: Add 
XSDToSchema.extendDecimalPrecision() to prevent silent truncation from common 
XSDtotalDigits/fractionDigits misuse)

> Add XSDToSchema.extendDecimalPrecision() to prevent silent truncation from 
> common XSD totalDigits/fractionDigits 
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56486
>                 URL: https://issues.apache.org/jira/browse/SPARK-56486
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.1.0, 4.0.1, 3.5.8
>         Environment: Databricks
>            Reporter: Pawel Matejko
>            Priority: Major
>              Labels: data-loss
>
> h3. Problem
> Decimal values from XML files can be silently truncated when the XSD's 
> decimal 
> precision settings don't provide enough integer digit capacity.
>  
> Users currently must write custom code (DecimalExtender) to extend precision 
> after 
> parsing the schema - this should be built into Spark.
> h4. The Issue
> ||Business Requirement||Common XSD Setup||Actual Integer Capacity||
> |"20 integer + 5 fractional digits"|`totalDigits=20, fractionDigits=5`|Only 
> {*}{{*}}15{{*}}{*} integer digits (20-5)|
> When loading value `"9999999999999999.12345"` (16 integer digits + 5 
> fractional):
>  *  "9999999999999999999.12345" -> "99999999999999999999.00000" 
> h4. User Intent:
> {code:xml}
> <xs:restriction base="xs:decimal">
>   <xs:totalDigits value="20"/>   <!-- User intends: 20 INTEGER -->
>   <xs:fractionDigits value="5"/> <!-- Plus 5 fractional -->
> </xs:restriction>
> {code}
> h3. +note:+
> Note: Spark's current mapping DecimalType(totalDigits, fractionDigits) is 
> correct per the W3C XML Schema spec. The issue is the widespread mismatch 
> between user intent and the spec.
> h3. Why This Is a Data-Loss Bug
> According to Spark guidelines, issues that can cause silent data loss or 
> affect data correctness are classified as very serious.
> In practice, when an XSD uses totalDigits and fractionDigits, the schema 
> inferred by XSDToSchema.read() can silently truncate the integer part of 
> decimal values (because DecimalType(precision, scale) allocates only 
> precision - scale digits for the integer portion). No error or warning is 
> raised at runtime, so the problem is often discovered only weeks later when 
> business reports show discrepancies.
> As a result, users must *post-process* the returned StructType in every 
> PySpark job:
>  # Call XSDToSchema.read() to get the initial schema
>  # adjust every DecimalType (e.g. using a custom DecimalExtender)
>  # Proceed with the rest of the pipeline
> This extra adjustment step is repetitive and error-prone. A built-in utility 
> would remove the need for this custom code and make the workflow cleaner and 
> safer.
> h3. Proposed Solution: Add Utility Method to XSDToSchema
> Include the proven DecimalExtender logic as official Spark utility:
> {code:scala}
> object XSDToSchema {
>   def read(path: Path): StructType = { ... }
>   
>   /**
>    * Extends decimal precision in schema to prevent data truncation.
>    * For each DecimalType(p, s), increases precision to min(maxPrecision, p + 
> s).
>    */
>   def extendDecimalPrecision(schema: StructType, maxPrecision: Int = 38): 
> StructType = { ... }
> }
> {code}
> h4. Reference Implementation (Python Workaround)
> {code:python}
> class DecimalExtender:
>     def _generate_new_decimal_data_type(self, decimal_type):
>         # Doubles integer digit capacity while respecting Spark's 38 limit
>         new_precision = min(38, decimal_type.precision + decimal_type.scale)
>         return DecimalType(new_precision, decimal_type.scale)
> {code}
> This handles:
>  * Top-level decimals
>  * Nested struct decimals (recursive)
>  * Array element decimals
> h3. Impact
> ||Before||After||
> |Users write custom workaround code|Official utility method available|
> |Silent data loss common|Easy prevention built-in|
> |No standard solution|Tested, documented approach|
> h3. Docs Text (Release Notes)
> {code:java}
> Added extendDecimalPrecision() utility method to XSDToSchema for preventing 
> silent data truncation when XSD totalDigits settings cause insufficient 
> integer digit capacity.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to