[
https://issues.apache.org/jira/browse/SPARK-54090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aimilios Tsouvelekakis updated SPARK-54090:
-------------------------------------------
Affects Version/s: 4.1.0
> AssertDataframeEqual carries rows when showing differences
> ----------------------------------------------------------
>
> Key: SPARK-54090
> URL: https://issues.apache.org/jira/browse/SPARK-54090
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.1.0, 4.0.1
> Reporter: Aimilios Tsouvelekakis
> Priority: Major
>
> When we try to do assertDataFrameEqual, on two dataframes that have not the
> same amount of rows, the output gets cascading from the difference till the
> end. Why this is happening:
> [https://github.com/apache/spark/blob/067969ff946712eeabf47040415f25000837cd87/python/pyspark/testing/utils.py#L1036]
> {code:java}
> def assert_rows_equal(
> rows1: List[Row], rows2: List[Row], maxErrors: int = None,
> showOnlyDiff: bool = False
> ):
> __tracebackhide__ = True
> zipped = list(zip_longest(rows1, rows2))
> diff_rows_cnt = 0
> diff_rows = []
> has_diff_rows = False
> rows_str1 = ""
> rows_str2 = ""
>
> # count different rows
> for r1, r2 in zipped:
> if not compare_rows(r1, r2):
> diff_rows_cnt += 1
> has_diff_rows = True
> if includeDiffRows:
> diff_rows.append((r1, r2))
> rows_str1 += str(r1) + "\n"
> rows_str2 += str(r2) + "\n"
> if maxErrors is not None and diff_rows_cnt >= maxErrors:
> break
> elif not showOnlyDiff:
> rows_str1 += str(r1) + "\n"
> rows_str2 += str(r2) + "\n"
> generated_diff = _context_diff(
> actual=rows_str1.splitlines(), expected=rows_str2.splitlines(),
> n=len(zipped)
> )
> if has_diff_rows:
> error_msg = "Results do not match: "
> percent_diff = (diff_rows_cnt / len(zipped)) * 100
> error_msg += "( %.5f %% )" % percent_diff
> error_msg += "\n" + "\n".join(generated_diff)
> data = diff_rows if includeDiffRows else None
> raise PySparkAssertionError(
> errorClass="DIFFERENT_ROWS", messageParameters={"error_msg":
> error_msg}, data=data
> ){code}
> The problem lies in the way that we zip the lines
> {code:java}
> zipped = list(zip_longest(rows1, rows2)){code}
> With zip longest we assume that the rows are in order and we do position by
> position comparison but it does not work well with checkRowOrder which
> defaults to False.
> If I have 1 line difference in 100 line dataframe the result percentage won't
> be 1% but the amount of rows that cascade towards on from that difference.
> The best solution here would be to have a set based comparison and return the
> percentage and the rows over that.
> A sample of what zip_longest is doing:
> {code:java}
> from itertools import zip_longest
> rows1 = [ 'A', 'B', 'C', 'D', 'E']
> rows2 = [ 'A', 'C', 'D',]
> zipped = list(zip_longest(rows1, rows2))zipped {code}
> Result:
> {code:java}
> [('A', 'A'), ('B', 'C'), ('C', 'D'), ('D', None), ('E', None)]{code}
> So in this case we would have 80% rows failure.
> This comes directly with the implementation of CheckRowOrder, when it is True
> we do not sort and it makes sense to use the zip_longest, when it is False it
> makes sense to use set based comparison since we have already sorted.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]