kachayev opened a new pull request #27916: [SPARK-30532] DataFrameStatFunctions 
to work with TABLE.COLUMN syntax
URL: https://github.com/apache/spark/pull/27916
 
 
   ### What changes were proposed in this pull request?
   `DataFrameStatFunctions` now works correctly with fully qualified column 
name (Table.Column syntax) by properly resolving the name instead of relying on 
field names from schema, notably:
   * `approxQuantile`
   * `freqItems`
   * `cov`
   * `corr`
   
   (other functions from `DataFrameStatFunctions` already work correctly).
   
   See code examples below.
   
   ### Why are the changes needed?
   With current implementation some stat functions are impossible to use when 
joining datasets with similar column names.
   
   ### Does this PR introduce any user-facing change?
   Yes. Before the change, the following code would fail with 
`AnalysisException`.
   
   ```scala
   scala> val df1 = sc.parallelize(0 to 10).toDF("num").as("table1")
   df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]
   
   scala> val df2 = sc.parallelize(0 to 10).toDF("num").as("table2")
   df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]
   
   scala> val dfx = df2.crossJoin(df1)
   dfx: org.apache.spark.sql.DataFrame = [num: int, num: int]
   
   scala> dfx.stat.approxQuantile("table1.num", Array(0.1), 0.0)
   res0: Array[Double] = Array(1.0)
   
   scala> dfx.stat.corr("table1.num", "table2.num")
   res1: Double = 1.0
   
   scala> dfx.stat.cov("table1.num", "table2.num")
   res2: Double = 11.0
   
   scala> dfx.stat.freqItems(Array("table1.num", "table2.num"))
   res3: org.apache.spark.sql.DataFrame = [table1.num_freqItems: array<int>, 
table2.num_freqItems: array<int>]
   ```
   
   ### How was this patch tested?
   Corresponding unit tests are added to `DataFrameStatSuite.scala` (marked as 
"SPARK-30532").
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to