[ 
https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537423#comment-17537423
 ] 

Eyal Allweil commented on DATAFU-159:
-------------------------------------

I've discovered the 
[spark-extension|https://github.com/G-Research/spark-extension] library, which 
contains a 
[diff|https://github.com/G-Research/spark-extension/blob/master/DIFF.md] method 
which seems to do exactly this. The only caveat is that this library is 
provided for Spark 3.x, whereas DataFu is 2.x.

In light of this, my tendency is to close this issue. Anyone disagree? I 
suppose we could also copy (with attribution) the code so people on the Spark 
2.x line could use it until they upgrade.

> Add diff functionality to datafu-spark
> --------------------------------------
>
>                 Key: DATAFU-159
>                 URL: https://issues.apache.org/jira/browse/DATAFU-159
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Priority: Major
>
> A useful feature when examining results is the ability to clearly understand 
> the differences between two datasets - for example, doing regressions between 
> expected and actual results.
> Spark provides the _except_ functionality, but this is often not enough for 
> this - for example, see [this question on Stack 
> Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala]
> Datafu-pig had a macro for doing this, and this could be a useful addition to 
> datafu-spark.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to