kachayev opened a new pull request #27916: [SPARK-30532] DataFrameStatFunctions to work with TABLE.COLUMN syntax URL: https://github.com/apache/spark/pull/27916 ### What changes were proposed in this pull request? `DataFrameStatFunctions` now works correctly with fully qualified column name (Table.Column syntax) by properly resolving the name instead of relying on field names from schema, notably: * `approxQuantile` * `freqItems` * `cov` * `corr` (other functions from `DataFrameStatFunctions` already work correctly). See code examples below. ### Why are the changes needed? With current implementation some stat functions are impossible to use when joining datasets with similar column names. ### Does this PR introduce any user-facing change? Yes. Before the change, the following code would fail with `AnalysisException`. ```scala scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1") df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2") df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int] scala> val dfx = df2.crossJoin(df1) dfx: org.apache.spark.sql.DataFrame = [num: int, num: int] scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0) res0: Array[Double] = Array(1.0) scala> dfx.stat.corr("table1.num", "table2.num") res1: Double = 1.0 scala> dfx.stat.cov("table1.num", "table2.num") res2: Double = 11.0 scala> dfx.stat.freqItems(Array("table1.num", "table2.num")) res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, table2.num_freqItems: array<int>] ``` ### How was this patch tested? Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as "SPARK-30532").
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org