[ https://issues.apache.org/jira/browse/DATAFU-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17926354#comment-17926354 ]
Anna O commented on DATAFU-159: ------------------------------- [~eyal] I've reviewed the [diff library|[https://github.com/G-Research/spark-extension/blob/master/DIFF.md]] and it seems more focused on providing diff results rather than diff statistics: {code:java} {code} {{--statistics Only output statistics on how many rows exist per diff action (see 'Diffing options' section)}} It may still be valuable to add an easy-to-use DataFrame comparison method that returns relative difference statistics per column. {code:java} /** * Compares two DataFrames (`df1` and `df2`) and returns a DataFrame containing comparison statistics. * * This function can perform comparisons based on provided keys or attempt to infer keys if none are given. * It calculates various metrics to quantify the differences between the DataFrames, including: * * - **When Keys Are Provided:** * - `min_diff`, `max_diff`, `mean_diff`, `stddev_diff`: Minimum, maximum, mean, and standard deviation of the percentage difference for numerical columns. * - `one_sided_null_percent`: Percentage of rows where a numerical column is null in one DataFrame but not in the other. * - `under_1%_diff_percent`, `under_5%_diff_percent`, `under_10%_diff_percent`: Percentage of rows where the numerical difference is under 1%, 5%, and 10% respectively. * - `non_numeric_diff_percent`: Percentage of rows where non-numeric columns differ. * - `df1_non_matched_keys_percent`, `df2_non_matched_keys_percent`: Percentage of rows with keys present only in `df1` or `df2` respectively. * * - **When NO Keys Are Provided:** * - `df1_version_count`, `df2_version_count`: Row counts of `df1` and `df2` respectively. * - `only_in_df1`, `only_in_df2`: Number of rows unique to `df1` and `df2`. * * The returned DataFrame has the following schema: * - `column`: The name of the compared column or "general" for overall statistics. * - `metric`: The name of the calculated metric. * - `value`: The value of the metric (as a Float). * * @param df1 The first DataFrame to compare. * @param df2 The second DataFrame to compare. * @param keys An optional list of column names to use as keys for the comparison. If `None`, the function will attempt to infer keys based on distinct counts up to a limited number of columns (8 currently). * @param orderedSeq A boolean flag indicating whether to treat array columns as ordered during comparison. Defaults to `false` (treat arrays as unordered sets). * @return A DataFrame containing comparison statistics. */ def compareDFs(df1: DataFrame, df2: DataFrame, keys: Option[List[String]] = None, orderedSeq: Boolean = false): DataFrame{code} > Add diff functionality to datafu-spark > -------------------------------------- > > Key: DATAFU-159 > URL: https://issues.apache.org/jira/browse/DATAFU-159 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Priority: Major > > A useful feature when examining results is the ability to clearly understand > the differences between two datasets - for example, doing regressions between > expected and actual results. > Spark provides the _except_ functionality, but this is often not enough for > this - for example, see [this question on Stack > Overflow.|https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala] > Datafu-pig had a macro for doing this, and this could be a useful addition to > datafu-spark. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)