Maxim Gekk created SPARK-25381: ---------------------------------- Summary: Stratified sampling by Column argument Key: SPARK-25381 URL: https://issues.apache.org/jira/browse/SPARK-25381 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk
Currently the sampleBy method accepts the first argument of string type only. Need to provide overloaded method which accepts Column type too. So, it will allow sampling by multiple columns , for example: {code:scala} import org.apache.spark.sql.Row import org.apache.spark.sql.functions.struct val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() +-----+---+ | name|age| +-----+---+ | Nico| 8| |Alice| 10| +-----+---+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org