[ https://issues.apache.org/jira/browse/SPARK-25381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li reassigned SPARK-25381: ------------------------------- Assignee: Maxim Gekk > Stratified sampling by Column argument > -------------------------------------- > > Key: SPARK-25381 > URL: https://issues.apache.org/jira/browse/SPARK-25381 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Maxim Gekk > Assignee: Maxim Gekk > Priority: Minor > Fix For: 3.0.0 > > > Currently the sampleBy method accepts the first argument of string type only. > Need to provide overloaded method which accepts Column type too. So, it will > allow sampling by multiple columns , for example: > {code:scala} > import org.apache.spark.sql.Row > import org.apache.spark.sql.functions.struct > val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), > ("Bob", 17), > ("Alice", 10))).toDF("name", "age") > val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) > df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() > +-----+---+ > | name|age| > +-----+---+ > | Nico| 8| > |Alice| 10| > +-----+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org