Pawel Matejko created SPARK-56486:
-------------------------------------

             Summary: Add XSDToSchema.extendDecimalPrecision() to prevent 
silent truncation from common XSDtotalDigits/fractionDigits misuse
                 Key: SPARK-56486
                 URL: https://issues.apache.org/jira/browse/SPARK-56486
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.5.8, 4.0.1, 4.1.0
         Environment: Databricks
            Reporter: Pawel Matejko


h3. Problem

Decimal values from XML files can be silently truncated when the XSD's decimal 
precision settings don't provide enough integer digit capacity.

 

Users currently must write custom code (DecimalExtender) to extend precision 
after 
parsing the schema - this should be built into Spark.
h4. The Issue
||Business Requirement||Common XSD Setup||Actual Integer Capacity||
|"20 integer + 5 fractional digits"|`totalDigits=20, fractionDigits=5`|Only 
{*}{{*}}15{{*}}{*} integer digits (20-5)|

When loading value `"9999999999999999.12345"` (16 integer digits + 5 
fractional):
 *  "9999999999999999999.12345" -> "99999999999999999999.00000" 

h4. User Intent:
{code:xml}
<xs:restriction base="xs:decimal">
  <xs:totalDigits value="20"/>   <!-- User intends: 20 INTEGER -->
  <xs:fractionDigits value="5"/> <!-- Plus 5 fractional -->
</xs:restriction>
{code}
h3. +note:+

Note: Spark's current mapping DecimalType(totalDigits, fractionDigits) is 
correct per the W3C XML Schema spec. The issue is the widespread mismatch 
between user intent and the spec.
h3. Why This Is a Data-Loss Bug

According to Spark guidelines, issues that can cause silent data loss or affect 
data correctness are classified as very serious.

In practice, when an XSD uses totalDigits and fractionDigits, the schema 
inferred by XSDToSchema.read() can silently truncate the integer part of 
decimal values (because DecimalType(precision, scale) allocates only precision 
- scale digits for the integer portion). No error or warning is raised at 
runtime, so the problem is often discovered only weeks later when business 
reports show discrepancies.

As a result, users must *post-process* the returned StructType in every PySpark 
job:
 # Call XSDToSchema.read() to get the initial schema
 # adjust every DecimalType (e.g. using a custom DecimalExtender)
 # Proceed with the rest of the pipeline

This extra adjustment step is repetitive and error-prone. A built-in utility 
would remove the need for this custom code and make the workflow cleaner and 
safer.
h3. Proposed Solution: Add Utility Method to XSDToSchema

Include the proven DecimalExtender logic as official Spark utility:
{code:scala}
object XSDToSchema {
  def read(path: Path): StructType = { ... }
  
  /**
   * Extends decimal precision in schema to prevent data truncation.
   * For each DecimalType(p, s), increases precision to min(maxPrecision, p + 
s).
   */
  def extendDecimalPrecision(schema: StructType, maxPrecision: Int = 38): 
StructType = { ... }
}
{code}
h4. Reference Implementation (Python Workaround)
{code:python}
class DecimalExtender:
    def _generate_new_decimal_data_type(self, decimal_type):
        # Doubles integer digit capacity while respecting Spark's 38 limit
        new_precision = min(38, decimal_type.precision + decimal_type.scale)
        return DecimalType(new_precision, decimal_type.scale)
{code}
This handles:
 * Top-level decimals
 * Nested struct decimals (recursive)
 * Array element decimals

h3. Impact
||Before||After||
|Users write custom workaround code|Official utility method available|
|Silent data loss common|Easy prevention built-in|
|No standard solution|Tested, documented approach|
h3. Docs Text (Release Notes)
{code:java}
Added extendDecimalPrecision() utility method to XSDToSchema for preventing 
silent data truncation when XSD totalDigits settings cause insufficient integer 
digit capacity.
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to