[ https://issues.apache.org/jira/browse/SPARK-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-16686: -------------------------------------- Affects Version/s: 1.6.2 Environment: Spark 1.6.2 and Spark 2.0 - RC4 Standalone Single-worker cluster was: Spark 2.0 - RC4 Standalone Single-worker cluster > Dataset.sample with seed: result seems to depend on downstream usage > -------------------------------------------------------------------- > > Key: SPARK-16686 > URL: https://issues.apache.org/jira/browse/SPARK-16686 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.2, 2.0.0 > Environment: Spark 1.6.2 and Spark 2.0 - RC4 > Standalone > Single-worker cluster > Reporter: Joseph K. Bradley > Attachments: DataFrame.sample bug - 2.0.html > > > Summary to reproduce bug: > * Create a DataFrame DF, and sample it with a fixed seed. > * Collect that DataFrame -> result1 > * Call a particular UDF on that DataFrame -> result2 > You would expect results 1 and 2 to use the same rows from DF, but they > appear not to. > Note: result1 and result2 are both deterministic. > See the attached notebook for details. Cells in the notebook were executed > in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org