Yicong-Huang opened a new pull request, #55522:
URL: https://github.com/apache/spark/pull/55522
### What changes were proposed in this pull request?
When `assertDataFrameEqual` is called with `checkRowOrder=False` (the
default) and the two inputs have **different row counts**, a single
missing/extra row cascades into a mismatch on every subsequent row. This
inflates both the diff count and the reported mismatch percentage.
Root cause: after sorting both lists by `str(row)`, `assert_rows_equal`
pairs rows with `zip_longest`. When one side is shorter, the pairing shifts
past every row following the hole.
Fix: switch to a merge-walk over the sorted lists **only when their lengths
differ**. Equal lengths keep `zip_longest` so that field-level diffs continue
to be reported as paired rows (preserving existing docstrings and tests that
rely on "B vs X" style pairing).
The merge-walk uses `compare_rows` for equality (honoring `rtol`/`atol`) and
`str(r)` for ordering decisions (consistent with how the lists were sorted).
### Why are the changes needed?
Reproducer:
```python
from pyspark.testing.utils import assertDataFrameEqual
from pyspark.sql import Row
actual = [Row(id='1'), Row(id='2'), Row(id='3'), Row(id='4'), Row(id='5')]
expected = [Row(id='1'), Row(id='2'), Row(id='4'), Row(id='5')]
assertDataFrameEqual(actual, expected)
```
Before this fix: `Results do not match: ( 60.00000 % )` (3 of 5 rows
reported as different).
After this fix: `Results do not match: ( 20.00000 % )` (only `Row(id='3')`
is reported).
A larger example from the JIRA: rows1 has 5 rows, rows2 has 3 of them
missing in the middle -- the old code reports 80% mismatch; the new code
reports 40%, matching what a user would expect from a sorted-set comparison.
### Does this PR introduce _any_ user-facing change?
Yes, but only in the error message / reported data when
`assertDataFrameEqual(checkRowOrder=False)` is given inputs whose row counts
differ:
- The reported mismatch percentage is no longer inflated by positional
shifting.
- `includeDiffRows=True` now returns `(row, None)` tuples for extras and
`(None, row)` tuples for missing rows, rather than shifted `(row, row)` pairs.
Behavior when row counts are equal (including all existing docstring
examples) is unchanged.
### How was this patch tested?
- Added five targeted tests in `python/pyspark/sql/tests/test_utils.py`:
- `test_different_row_count_middle_missing_no_cascading_diff`
- `test_different_row_count_multiple_missing`
- `test_different_row_count_includeDiffRows`
- `test_different_row_count_mixed_extra_and_missing`
- `test_different_row_count_extras_at_end`
- Ran the full `test_utils.py` suite: 81 passed.
- Verified the docstring examples (`maxErrors`, `showOnlyDiff`,
`includeDiffRows`) still produce their documented output.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]