[
https://issues.apache.org/jira/browse/SPARK-51062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937955#comment-17937955
]
amoghsantarkar edited comment on SPARK-51062 at 3/24/25 5:11 PM:
-----------------------------------------------------------------
Hi [~pscheurig]/team, I have committed code to
[PR|[https://github.com/apache/spark/pull/50365]] and added unit test too for
this please see when you get a chance!
Thanks,
Amogh Antarkar
was (Author: JIRAUSER309134):
Hi [~pscheurig]/team, I have committed code to
[PR|[https://github.com/apache/spark/pull/50365]] and added unit test too for
this see when you get a chance!
Thanks,
Amogh Antarkar
> assertSchemaEqual Does Not Compare Decimal Precision and Scale
> --------------------------------------------------------------
>
> Key: SPARK-51062
> URL: https://issues.apache.org/jira/browse/SPARK-51062
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.5.0, 3.5.1, 3.5.2, 3.5.3, 3.5.4
> Reporter: pscheurig
> Priority: Major
> Labels: pull-request-available
>
> h1. Summary
> The {{assertSchemaEqual}} function in PySpark's testing utilities does not
> properly compare DecimalType fields, as it only checks the base type name
> (e.g. "decimal") without comparing precision and scale parameters. This
> significantly reduces the utility of the function for schemas containing
> decimal fields.
> h2. Version
> * Apache Spark Version: >=3.5.0
> * Component: PySpark Testing Utils
> * Function: {{pyspark.testing.assertSchemaEqual}}
> h2. Description
> When comparing two schemas containing DecimalType fields with different
> precision and scale parameters, {{assertSchemaEqual}} incorrectly reports
> them as equal because it only compares the base type name ("decimal") without
> considering the precision and scale parameters.
> h3. Current Behavior
> {code:python}
> from pyspark.sql.types import StructType, StructField, DecimalType
> from pyspark.testing import assertSchemaEqual
> s1 = StructType(
> [
> StructField("price_102", DecimalType(10, 2), True),
> StructField("price_80", DecimalType(8, 0), True),
> ]
> )
> s2 = StructType(
> [
> StructField("price_102", DecimalType(10, 4), True), # Different scale
> StructField(
> "price_80", DecimalType(10, 2), True
> ), # Different precision and scale
> ]
> )
> # This passes when it should fail
> assertSchemaEqual(s1, s2)
> {code}
> h3. Expected Behavior
> The function should compare both precision and scale parameters of
> DecimalType fields and raise a PySparkAssertionError when they differ,
> similar to how it handles other type mismatches. The error message should
> indicate which fields have mismatched decimal parameters.
> h2. Impact
> This issue affects data quality validation and testing scenarios where
> precise decimal specifications are crucial, such as:
> * Financial data processing where decimal precision and scale are critical
> * ETL validation where source and target schemas must match exactly
> h2. Suggested Fix
> The {{compare_datatypes_ignore_nullable}} function in
> {{pyspark/testing/utils.py}} should be enhanced to compare precision and
> scale parameters when dealing with decimal types:
> {code:python}
> def compare_datatypes_ignore_nullable(dt1: Any, dt2: Any):
> if dt1.typeName() == dt2.typeName():
> if dt1.typeName() == "decimal":
> return dt1.precision == dt2.precision and dt1.scale == dt2.scale
> elif dt1.typeName() == "array":
> return compare_datatypes_ignore_nullable(dt1.elementType,
> dt2.elementType)
> elif dt1.typeName() == "struct":
> return compare_schemas_ignore_nullable(dt1, dt2)
> else:
> return True
> else:
> return False
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]