Pawel Matejko created SPARK-56486:
-------------------------------------
Summary: Add XSDToSchema.extendDecimalPrecision() to prevent
silent truncation from common XSDtotalDigits/fractionDigits misuse
Key: SPARK-56486
URL: https://issues.apache.org/jira/browse/SPARK-56486
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.5.8, 4.0.1, 4.1.0
Environment: Databricks
Reporter: Pawel Matejko
h3. Problem
Decimal values from XML files can be silently truncated when the XSD's decimal
precision settings don't provide enough integer digit capacity.
Users currently must write custom code (DecimalExtender) to extend precision
after
parsing the schema - this should be built into Spark.
h4. The Issue
||Business Requirement||Common XSD Setup||Actual Integer Capacity||
|"20 integer + 5 fractional digits"|`totalDigits=20, fractionDigits=5`|Only
{*}{{*}}15{{*}}{*} integer digits (20-5)|
When loading value `"9999999999999999.12345"` (16 integer digits + 5
fractional):
* "9999999999999999999.12345" -> "99999999999999999999.00000"
h4. User Intent:
{code:xml}
<xs:restriction base="xs:decimal">
<xs:totalDigits value="20"/> <!-- User intends: 20 INTEGER -->
<xs:fractionDigits value="5"/> <!-- Plus 5 fractional -->
</xs:restriction>
{code}
h3. +note:+
Note: Spark's current mapping DecimalType(totalDigits, fractionDigits) is
correct per the W3C XML Schema spec. The issue is the widespread mismatch
between user intent and the spec.
h3. Why This Is a Data-Loss Bug
According to Spark guidelines, issues that can cause silent data loss or affect
data correctness are classified as very serious.
In practice, when an XSD uses totalDigits and fractionDigits, the schema
inferred by XSDToSchema.read() can silently truncate the integer part of
decimal values (because DecimalType(precision, scale) allocates only precision
- scale digits for the integer portion). No error or warning is raised at
runtime, so the problem is often discovered only weeks later when business
reports show discrepancies.
As a result, users must *post-process* the returned StructType in every PySpark
job:
# Call XSDToSchema.read() to get the initial schema
# adjust every DecimalType (e.g. using a custom DecimalExtender)
# Proceed with the rest of the pipeline
This extra adjustment step is repetitive and error-prone. A built-in utility
would remove the need for this custom code and make the workflow cleaner and
safer.
h3. Proposed Solution: Add Utility Method to XSDToSchema
Include the proven DecimalExtender logic as official Spark utility:
{code:scala}
object XSDToSchema {
def read(path: Path): StructType = { ... }
/**
* Extends decimal precision in schema to prevent data truncation.
* For each DecimalType(p, s), increases precision to min(maxPrecision, p +
s).
*/
def extendDecimalPrecision(schema: StructType, maxPrecision: Int = 38):
StructType = { ... }
}
{code}
h4. Reference Implementation (Python Workaround)
{code:python}
class DecimalExtender:
def _generate_new_decimal_data_type(self, decimal_type):
# Doubles integer digit capacity while respecting Spark's 38 limit
new_precision = min(38, decimal_type.precision + decimal_type.scale)
return DecimalType(new_precision, decimal_type.scale)
{code}
This handles:
* Top-level decimals
* Nested struct decimals (recursive)
* Array element decimals
h3. Impact
||Before||After||
|Users write custom workaround code|Official utility method available|
|Silent data loss common|Easy prevention built-in|
|No standard solution|Tested, documented approach|
h3. Docs Text (Release Notes)
{code:java}
Added extendDecimalPrecision() utility method to XSDToSchema for preventing
silent data truncation when XSD totalDigits settings cause insufficient integer
digit capacity.
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]