Yicong-Huang opened a new pull request, #55522:
URL: https://github.com/apache/spark/pull/55522

   ### What changes were proposed in this pull request?
   
   When `assertDataFrameEqual` is called with `checkRowOrder=False` (the 
default) and the two inputs have **different row counts**, a single 
missing/extra row cascades into a mismatch on every subsequent row. This 
inflates both the diff count and the reported mismatch percentage.
   
   Root cause: after sorting both lists by `str(row)`, `assert_rows_equal` 
pairs rows with `zip_longest`. When one side is shorter, the pairing shifts 
past every row following the hole.
   
   Fix: switch to a merge-walk over the sorted lists **only when their lengths 
differ**. Equal lengths keep `zip_longest` so that field-level diffs continue 
to be reported as paired rows (preserving existing docstrings and tests that 
rely on "B vs X" style pairing).
   
   The merge-walk uses `compare_rows` for equality (honoring `rtol`/`atol`) and 
`str(r)` for ordering decisions (consistent with how the lists were sorted).
   
   ### Why are the changes needed?
   
   Reproducer:
   
   ```python
   from pyspark.testing.utils import assertDataFrameEqual
   from pyspark.sql import Row
   
   actual   = [Row(id='1'), Row(id='2'), Row(id='3'), Row(id='4'), Row(id='5')]
   expected = [Row(id='1'), Row(id='2'),              Row(id='4'), Row(id='5')]
   assertDataFrameEqual(actual, expected)
   ```
   
   Before this fix: `Results do not match: ( 60.00000 % )` (3 of 5 rows 
reported as different).
   After this fix:  `Results do not match: ( 20.00000 % )` (only `Row(id='3')` 
is reported).
   
   A larger example from the JIRA: rows1 has 5 rows, rows2 has 3 of them 
missing in the middle -- the old code reports 80% mismatch; the new code 
reports 40%, matching what a user would expect from a sorted-set comparison.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but only in the error message / reported data when 
`assertDataFrameEqual(checkRowOrder=False)` is given inputs whose row counts 
differ:
   - The reported mismatch percentage is no longer inflated by positional 
shifting.
   - `includeDiffRows=True` now returns `(row, None)` tuples for extras and 
`(None, row)` tuples for missing rows, rather than shifted `(row, row)` pairs.
   
   Behavior when row counts are equal (including all existing docstring 
examples) is unchanged.
   
   ### How was this patch tested?
   
   - Added five targeted tests in `python/pyspark/sql/tests/test_utils.py`:
     - `test_different_row_count_middle_missing_no_cascading_diff`
     - `test_different_row_count_multiple_missing`
     - `test_different_row_count_includeDiffRows`
     - `test_different_row_count_mixed_extra_and_missing`
     - `test_different_row_count_extras_at_end`
   - Ran the full `test_utils.py` suite: 81 passed.
   - Verified the docstring examples (`maxErrors`, `showOnlyDiff`, 
`includeDiffRows`) still produce their documented output.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to