[jira] [Updated] (SPARK-16686) Dataset.sample with seed: result seems to depend on downstream usage

Joseph K. Bradley (JIRA) Fri, 22 Jul 2016 11:00:54 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joseph K. Bradley updated SPARK-16686:
--------------------------------------
    Affects Version/s: 1.6.2
          Environment: 
Spark 1.6.2 and Spark 2.0 - RC4
Standalone
Single-worker cluster

  was:
Spark 2.0 - RC4
Standalone
Single-worker cluster


> Dataset.sample with seed: result seems to depend on downstream usage
> --------------------------------------------------------------------
>
>                 Key: SPARK-16686
>                 URL: https://issues.apache.org/jira/browse/SPARK-16686
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Spark 1.6.2 and Spark 2.0 - RC4
> Standalone
> Single-worker cluster
>            Reporter: Joseph K. Bradley
>         Attachments: DataFrame.sample bug - 2.0.html
>
>
> Summary to reproduce bug:
> * Create a DataFrame DF, and sample it with a fixed seed.
> * Collect that DataFrame -> result1
> * Call a particular UDF on that DataFrame -> result2
> You would expect results 1 and 2 to use the same rows from DF, but they 
> appear not to.
> Note: result1 and result2 are both deterministic.
> See the attached notebook for details.  Cells in the notebook were executed 
> in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16686) Dataset.sample with seed: result seems to depend on downstream usage

Reply via email to