[ 
https://issues.apache.org/jira/browse/SPARK-54090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aimilios Tsouvelekakis updated SPARK-54090:
-------------------------------------------
    Description: 
When we try to do assertDataFrameEqual, on two dataframes that have not the 
same amount of rows, the output gets cascading from the difference till the 
end. Why this is happening:

[https://github.com/apache/spark/blob/067969ff946712eeabf47040415f25000837cd87/python/pyspark/testing/utils.py#L1036]
{code:java}
    def assert_rows_equal(
        rows1: List[Row], rows2: List[Row], maxErrors: int = None, 
showOnlyDiff: bool = False
    ):
        __tracebackhide__ = True
        zipped = list(zip_longest(rows1, rows2))
        diff_rows_cnt = 0
        diff_rows = []
        has_diff_rows = False        
        rows_str1 = ""
        rows_str2 = ""        
        
        # count different rows
        for r1, r2 in zipped:
            if not compare_rows(r1, r2):
                diff_rows_cnt += 1
                has_diff_rows = True
                if includeDiffRows:
                    diff_rows.append((r1, r2))
                rows_str1 += str(r1) + "\n"
                rows_str2 += str(r2) + "\n"
                if maxErrors is not None and diff_rows_cnt >= maxErrors:
                    break
            elif not showOnlyDiff:
                rows_str1 += str(r1) + "\n"
                rows_str2 += str(r2) + "\n"        
        generated_diff = _context_diff(
            actual=rows_str1.splitlines(), expected=rows_str2.splitlines(), 
n=len(zipped)
        )        
        if has_diff_rows:
            error_msg = "Results do not match: "
            percent_diff = (diff_rows_cnt / len(zipped)) * 100
            error_msg += "( %.5f %% )" % percent_diff
            error_msg += "\n" + "\n".join(generated_diff)
            data = diff_rows if includeDiffRows else None
            raise PySparkAssertionError(
                errorClass="DIFFERENT_ROWS", messageParameters={"error_msg": 
error_msg}, data=data
            ){code}
The problem lies in the way that we zip the lines
{code:java}
zipped = list(zip_longest(rows1, rows2)){code}
With zip longest we assume that the rows are in order and we do position by 
position comparison but it does not work well with checkRowOrder which defaults 
to False.

If I have 1 line difference in 100 line dataframe the result percentage won't 
be 1% but the amount of rows that cascade towards on from that difference. 

The best solution here would be to have a set based comparison and return the 
percentage and the rows over that.

  was:
When we try to do assertDataFrameEqual, on two dataframes that have not the 
same amount of rows, the output gets cascading from the difference till the 
end. Why this is happening:

https://github.com/apache/spark/blob/067969ff946712eeabf47040415f25000837cd87/python/pyspark/testing/utils.py#L1036
{code:java}
    def assert_rows_equal(
        rows1: List[Row], rows2: List[Row], maxErrors: int = None, 
showOnlyDiff: bool = False
    ):
        __tracebackhide__ = True
        zipped = list(zip_longest(rows1, rows2))
        diff_rows_cnt = 0
        diff_rows = []
        has_diff_rows = False        
        rows_str1 = ""
        rows_str2 = ""        
        
        # count different rows
        for r1, r2 in zipped:
            if not compare_rows(r1, r2):
                diff_rows_cnt += 1
                has_diff_rows = True
                if includeDiffRows:
                    diff_rows.append((r1, r2))
                rows_str1 += str(r1) + "\n"
                rows_str2 += str(r2) + "\n"
                if maxErrors is not None and diff_rows_cnt >= maxErrors:
                    break
            elif not showOnlyDiff:
                rows_str1 += str(r1) + "\n"
                rows_str2 += str(r2) + "\n"        
        generated_diff = _context_diff(
            actual=rows_str1.splitlines(), expected=rows_str2.splitlines(), 
n=len(zipped)
        )        
        if has_diff_rows:
            error_msg = "Results do not match: "
            percent_diff = (diff_rows_cnt / len(zipped)) * 100
            error_msg += "( %.5f %% )" % percent_diff
            error_msg += "\n" + "\n".join(generated_diff)
            data = diff_rows if includeDiffRows else None
            raise PySparkAssertionError(
                errorClass="DIFFERENT_ROWS", messageParameters={"error_msg": 
error_msg}, data=data
            ){code}
The problem lies in the way that we zip the lines
{code:java}
zipped = list(zip_longest(rows1, rows2)){code}
With zip longest we assume that the rows are in order and we do position by 
position comparison but it does not work well with checkRowOrder which defaults 
to False.

if I have 1 line difference in 100 line dataframe the result percentage won't 
be 1% but the amount of rows that cascade towards that difference. 

The best solution here would be either a set based comparison


> AssertDataframeEqual carries rows when showing differences
> ----------------------------------------------------------
>
>                 Key: SPARK-54090
>                 URL: https://issues.apache.org/jira/browse/SPARK-54090
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.0.1
>            Reporter: Aimilios Tsouvelekakis
>            Priority: Major
>
> When we try to do assertDataFrameEqual, on two dataframes that have not the 
> same amount of rows, the output gets cascading from the difference till the 
> end. Why this is happening:
> [https://github.com/apache/spark/blob/067969ff946712eeabf47040415f25000837cd87/python/pyspark/testing/utils.py#L1036]
> {code:java}
>     def assert_rows_equal(
>         rows1: List[Row], rows2: List[Row], maxErrors: int = None, 
> showOnlyDiff: bool = False
>     ):
>         __tracebackhide__ = True
>         zipped = list(zip_longest(rows1, rows2))
>         diff_rows_cnt = 0
>         diff_rows = []
>         has_diff_rows = False        
>         rows_str1 = ""
>         rows_str2 = ""        
>         
>         # count different rows
>         for r1, r2 in zipped:
>             if not compare_rows(r1, r2):
>                 diff_rows_cnt += 1
>                 has_diff_rows = True
>                 if includeDiffRows:
>                     diff_rows.append((r1, r2))
>                 rows_str1 += str(r1) + "\n"
>                 rows_str2 += str(r2) + "\n"
>                 if maxErrors is not None and diff_rows_cnt >= maxErrors:
>                     break
>             elif not showOnlyDiff:
>                 rows_str1 += str(r1) + "\n"
>                 rows_str2 += str(r2) + "\n"        
>         generated_diff = _context_diff(
>             actual=rows_str1.splitlines(), expected=rows_str2.splitlines(), 
> n=len(zipped)
>         )        
>         if has_diff_rows:
>             error_msg = "Results do not match: "
>             percent_diff = (diff_rows_cnt / len(zipped)) * 100
>             error_msg += "( %.5f %% )" % percent_diff
>             error_msg += "\n" + "\n".join(generated_diff)
>             data = diff_rows if includeDiffRows else None
>             raise PySparkAssertionError(
>                 errorClass="DIFFERENT_ROWS", messageParameters={"error_msg": 
> error_msg}, data=data
>             ){code}
> The problem lies in the way that we zip the lines
> {code:java}
> zipped = list(zip_longest(rows1, rows2)){code}
> With zip longest we assume that the rows are in order and we do position by 
> position comparison but it does not work well with checkRowOrder which 
> defaults to False.
> If I have 1 line difference in 100 line dataframe the result percentage won't 
> be 1% but the amount of rows that cascade towards on from that difference. 
> The best solution here would be to have a set based comparison and return the 
> percentage and the rows over that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to