[ 
https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926354#comment-17926354
 ] 

Anna O commented on DATAFU-159:
-------------------------------

[~eyal] I've reviewed the [diff 
library|[https://github.com/G-Research/spark-extension/blob/master/DIFF.md]] 
and it seems more focused on providing diff results rather than diff statistics:
{code:java}

{code}
{{--statistics Only output statistics on how many rows exist per diff action 
(see 'Diffing options' section)}}

It may still be valuable to add an easy-to-use DataFrame comparison method that 
returns relative difference statistics per column. 
{code:java}
/**
* Compares two DataFrames (`df1` and `df2`) and returns a DataFrame containing 
comparison statistics.
*
* This function can perform comparisons based on provided keys or attempt to 
infer keys if none are given.
* It calculates various metrics to quantify the differences between the 
DataFrames, including:
*
* - **When Keys Are Provided:**
* - `min_diff`, `max_diff`, `mean_diff`, `stddev_diff`: Minimum, maximum, mean, 
and standard deviation of the percentage difference for numerical columns.
* - `one_sided_null_percent`: Percentage of rows where a numerical column is 
null in one DataFrame but not in the other.
* - `under_1%_diff_percent`, `under_5%_diff_percent`, `under_10%_diff_percent`: 
Percentage of rows where the numerical difference is under 1%, 5%, and 10% 
respectively.
* - `non_numeric_diff_percent`: Percentage of rows where non-numeric columns 
differ.
* - `df1_non_matched_keys_percent`, `df2_non_matched_keys_percent`: Percentage 
of rows with keys present only in `df1` or `df2` respectively.
*
* - **When NO Keys Are Provided:**
* - `df1_version_count`, `df2_version_count`: Row counts of `df1` and `df2` 
respectively.
* - `only_in_df1`, `only_in_df2`: Number of rows unique to `df1` and `df2`.
*
* The returned DataFrame has the following schema:
* - `column`: The name of the compared column or "general" for overall 
statistics.
* - `metric`: The name of the calculated metric.
* - `value`: The value of the metric (as a Float).
*
* @param df1 The first DataFrame to compare.
* @param df2 The second DataFrame to compare.
* @param keys An optional list of column names to use as keys for the 
comparison. If `None`, the function will attempt to infer keys based on 
distinct counts up to a limited number of columns (8 currently).
* @param orderedSeq A boolean flag indicating whether to treat array columns as 
ordered during comparison. Defaults to `false` (treat arrays as unordered sets).
* @return A DataFrame containing comparison statistics.
*/
def compareDFs(df1: DataFrame, df2: DataFrame, keys: Option[List[String]] = 
None, orderedSeq: Boolean = false): DataFrame{code}

> Add diff functionality to datafu-spark
> --------------------------------------
>
>                 Key: DATAFU-159
>                 URL: https://issues.apache.org/jira/browse/DATAFU-159
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Priority: Major
>
> A useful feature when examining results is the ability to clearly understand 
> the differences between two datasets - for example, doing regressions between 
> expected and actual results.
> Spark provides the _except_ functionality, but this is often not enough for 
> this - for example, see [this question on Stack 
> Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala]
> Datafu-pig had a macro for doing this, and this could be a useful addition to 
> datafu-spark.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to