[jira] [Created] (SPARK-25381) Stratified sampling by Column argument

Maxim Gekk (JIRA) Sat, 08 Sep 2018 09:07:25 -0700

Maxim Gekk created SPARK-25381:
----------------------------------

             Summary: Stratified sampling by Column argument
                 Key: SPARK-25381
                 URL: https://issues.apache.org/jira/browse/SPARK-25381
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Maxim Gekk



Currently the sampleBy method accepts the first argument of string type only. 
Need to provide overloaded method which accepts Column type too. So, it will 
allow sampling by multiple columns , for example:
{code:scala}
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.struct
val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), 
("Bob", 17),
  ("Alice", 10))).toDF("name", "age")
val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
       +-----+---+
       | name|age|
       +-----+---+
       | Nico|  8|
       |Alice| 10|
       +-----+---+
{code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25381) Stratified sampling by Column argument

Reply via email to