Joseph K. Bradley created SPARK-16686: -----------------------------------------
Summary: Dataset.sample with seed: result seems to depend on downstream usage Key: SPARK-16686 URL: https://issues.apache.org/jira/browse/SPARK-16686 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Environment: Spark 2.0 - RC4 Standalone Single-worker cluster Reporter: Joseph K. Bradley Summary to reproduce bug: * Create a DataFrame DF, and sample it with a fixed seed. * Collect that DataFrame -> result1 * Call a particular UDF on that DataFrame -> result2 You would expect results 1 and 2 to use the same rows from DF, but they appear not to. Note: result1 and result2 are both deterministic. See the attached notebook for details. Cells in the notebook were executed in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org