[
https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926354#comment-17926354
]
Anna O commented on DATAFU-159:
-------------------------------
[~eyal] I've reviewed the [diff
library|[https://github.com/G-Research/spark-extension/blob/master/DIFF.md]]
and it seems more focused on providing diff results rather than diff statistics:
{code:java}
{code}
{{--statistics Only output statistics on how many rows exist per diff action
(see 'Diffing options' section)}}
It may still be valuable to add an easy-to-use DataFrame comparison method that
returns relative difference statistics per column.
{code:java}
/**
* Compares two DataFrames (`df1` and `df2`) and returns a DataFrame containing
comparison statistics.
*
* This function can perform comparisons based on provided keys or attempt to
infer keys if none are given.
* It calculates various metrics to quantify the differences between the
DataFrames, including:
*
* - **When Keys Are Provided:**
* - `min_diff`, `max_diff`, `mean_diff`, `stddev_diff`: Minimum, maximum, mean,
and standard deviation of the percentage difference for numerical columns.
* - `one_sided_null_percent`: Percentage of rows where a numerical column is
null in one DataFrame but not in the other.
* - `under_1%_diff_percent`, `under_5%_diff_percent`, `under_10%_diff_percent`:
Percentage of rows where the numerical difference is under 1%, 5%, and 10%
respectively.
* - `non_numeric_diff_percent`: Percentage of rows where non-numeric columns
differ.
* - `df1_non_matched_keys_percent`, `df2_non_matched_keys_percent`: Percentage
of rows with keys present only in `df1` or `df2` respectively.
*
* - **When NO Keys Are Provided:**
* - `df1_version_count`, `df2_version_count`: Row counts of `df1` and `df2`
respectively.
* - `only_in_df1`, `only_in_df2`: Number of rows unique to `df1` and `df2`.
*
* The returned DataFrame has the following schema:
* - `column`: The name of the compared column or "general" for overall
statistics.
* - `metric`: The name of the calculated metric.
* - `value`: The value of the metric (as a Float).
*
* @param df1 The first DataFrame to compare.
* @param df2 The second DataFrame to compare.
* @param keys An optional list of column names to use as keys for the
comparison. If `None`, the function will attempt to infer keys based on
distinct counts up to a limited number of columns (8 currently).
* @param orderedSeq A boolean flag indicating whether to treat array columns as
ordered during comparison. Defaults to `false` (treat arrays as unordered sets).
* @return A DataFrame containing comparison statistics.
*/
def compareDFs(df1: DataFrame, df2: DataFrame, keys: Option[List[String]] =
None, orderedSeq: Boolean = false): DataFrame{code}
> Add diff functionality to datafu-spark
> --------------------------------------
>
> Key: DATAFU-159
> URL: https://issues.apache.org/jira/browse/DATAFU-159
> Project: DataFu
> Issue Type: New Feature
> Reporter: Eyal Allweil
> Priority: Major
>
> A useful feature when examining results is the ability to clearly understand
> the differences between two datasets - for example, doing regressions between
> expected and actual results.
> Spark provides the _except_ functionality, but this is often not enough for
> this - for example, see [this question on Stack
> Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala]
> Datafu-pig had a macro for doing this, and this could be a useful addition to
> datafu-spark.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)