[
https://issues.apache.org/jira/browse/SPARK-56486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pawel Matejko updated SPARK-56486:
----------------------------------
Summary: Add XSDToSchema.extendDecimalPrecision() to prevent silent
truncation from common XSD totalDigits/fractionDigits (was: Add
XSDToSchema.extendDecimalPrecision() to prevent silent truncation from common
XSDtotalDigits/fractionDigits misuse)
> Add XSDToSchema.extendDecimalPrecision() to prevent silent truncation from
> common XSD totalDigits/fractionDigits
> -----------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-56486
> URL: https://issues.apache.org/jira/browse/SPARK-56486
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.1.0, 4.0.1, 3.5.8
> Environment: Databricks
> Reporter: Pawel Matejko
> Priority: Major
> Labels: data-loss
>
> h3. Problem
> Decimal values from XML files can be silently truncated when the XSD's
> decimal
> precision settings don't provide enough integer digit capacity.
>
> Users currently must write custom code (DecimalExtender) to extend precision
> after
> parsing the schema - this should be built into Spark.
> h4. The Issue
> ||Business Requirement||Common XSD Setup||Actual Integer Capacity||
> |"20 integer + 5 fractional digits"|`totalDigits=20, fractionDigits=5`|Only
> {*}{{*}}15{{*}}{*} integer digits (20-5)|
> When loading value `"9999999999999999.12345"` (16 integer digits + 5
> fractional):
> * "9999999999999999999.12345" -> "99999999999999999999.00000"
> h4. User Intent:
> {code:xml}
> <xs:restriction base="xs:decimal">
> <xs:totalDigits value="20"/> <!-- User intends: 20 INTEGER -->
> <xs:fractionDigits value="5"/> <!-- Plus 5 fractional -->
> </xs:restriction>
> {code}
> h3. +note:+
> Note: Spark's current mapping DecimalType(totalDigits, fractionDigits) is
> correct per the W3C XML Schema spec. The issue is the widespread mismatch
> between user intent and the spec.
> h3. Why This Is a Data-Loss Bug
> According to Spark guidelines, issues that can cause silent data loss or
> affect data correctness are classified as very serious.
> In practice, when an XSD uses totalDigits and fractionDigits, the schema
> inferred by XSDToSchema.read() can silently truncate the integer part of
> decimal values (because DecimalType(precision, scale) allocates only
> precision - scale digits for the integer portion). No error or warning is
> raised at runtime, so the problem is often discovered only weeks later when
> business reports show discrepancies.
> As a result, users must *post-process* the returned StructType in every
> PySpark job:
> # Call XSDToSchema.read() to get the initial schema
> # adjust every DecimalType (e.g. using a custom DecimalExtender)
> # Proceed with the rest of the pipeline
> This extra adjustment step is repetitive and error-prone. A built-in utility
> would remove the need for this custom code and make the workflow cleaner and
> safer.
> h3. Proposed Solution: Add Utility Method to XSDToSchema
> Include the proven DecimalExtender logic as official Spark utility:
> {code:scala}
> object XSDToSchema {
> def read(path: Path): StructType = { ... }
>
> /**
> * Extends decimal precision in schema to prevent data truncation.
> * For each DecimalType(p, s), increases precision to min(maxPrecision, p +
> s).
> */
> def extendDecimalPrecision(schema: StructType, maxPrecision: Int = 38):
> StructType = { ... }
> }
> {code}
> h4. Reference Implementation (Python Workaround)
> {code:python}
> class DecimalExtender:
> def _generate_new_decimal_data_type(self, decimal_type):
> # Doubles integer digit capacity while respecting Spark's 38 limit
> new_precision = min(38, decimal_type.precision + decimal_type.scale)
> return DecimalType(new_precision, decimal_type.scale)
> {code}
> This handles:
> * Top-level decimals
> * Nested struct decimals (recursive)
> * Array element decimals
> h3. Impact
> ||Before||After||
> |Users write custom workaround code|Official utility method available|
> |Silent data loss common|Easy prevention built-in|
> |No standard solution|Tested, documented approach|
> h3. Docs Text (Release Notes)
> {code:java}
> Added extendDecimalPrecision() utility method to XSDToSchema for preventing
> silent data truncation when XSD totalDigits settings cause insufficient
> integer digit capacity.
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]