Aimilios Tsouvelekakis created SPARK-54090:
----------------------------------------------
Summary: AssertDataframeEqual carries rows when showing differences
Key: SPARK-54090
URL: https://issues.apache.org/jira/browse/SPARK-54090
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.0.1
Reporter: Aimilios Tsouvelekakis
When we try to do assertDataFrameEqual, on two dataframes that have not the
same amount of rows, the output gets cascading from the difference till the
end. Why this is happening:
{code:java}
def assert_rows_equal(
rows1: List[Row], rows2: List[Row], maxErrors: int = None,
showOnlyDiff: bool = False
):
__tracebackhide__ = True
zipped = list(zip_longest(rows1, rows2))
diff_rows_cnt = 0
diff_rows = []
has_diff_rows = False rows_str1 = ""
rows_str2 = "" # count different rows
for r1, r2 in zipped:
if not compare_rows(r1, r2):
diff_rows_cnt += 1
has_diff_rows = True
if includeDiffRows:
diff_rows.append((r1, r2))
rows_str1 += str(r1) + "\n"
rows_str2 += str(r2) + "\n"
if maxErrors is not None and diff_rows_cnt >= maxErrors:
break
elif not showOnlyDiff:
rows_str1 += str(r1) + "\n"
rows_str2 += str(r2) + "\n" generated_diff =
_context_diff(
actual=rows_str1.splitlines(), expected=rows_str2.splitlines(),
n=len(zipped)
) if has_diff_rows:
error_msg = "Results do not match: "
percent_diff = (diff_rows_cnt / len(zipped)) * 100
error_msg += "( %.5f %% )" % percent_diff
error_msg += "\n" + "\n".join(generated_diff)
data = diff_rows if includeDiffRows else None
raise PySparkAssertionError(
errorClass="DIFFERENT_ROWS", messageParameters={"error_msg":
error_msg}, data=data
) {code}
The problem lies in the way that we zip the lines
{code:java}
zipped = list(zip_longest(rows1, rows2)){code}
With zip longest we assume that the rows are in order and we do position by
position comparison but it does not work well with checkRowOrder which defaults
to False.
if I have 1 line difference in 100 line dataframe the result percentage won't
be 1% but the amount of rows that cascade towards that difference.
The best solution here would be either a set based comparison
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]