[ https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259977#comment-14259977 ]
Apache Spark commented on SPARK-4963: ------------------------------------- User 'yanbohappy' has created a pull request for this issue: https://github.com/apache/spark/pull/3827 > SchemaRDD.sample may return wrong results > ----------------------------------------- > > Key: SPARK-4963 > URL: https://issues.apache.org/jira/browse/SPARK-4963 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0 > Reporter: Cheng Lian > > This {{sbt/sbt hive/console}} session can easily reproduce this issue: > {code} > sql("SELECT * FROM src WHERE key % 2 = 0"). > sample(withReplacement = false, fraction = 0.05). > registerTempTable("sampled") > println(table("sampled").queryExecution) > val query = sql("SELECT * FROM sampled WHERE key % 2 = 1") > println(query.queryExecution) > // Should print `true' > println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _)) > {code} > Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used > to do the sampling. My guess is that there’s something to do with the > underlying mutable row objects used in {{HiveTableScan}}, but haven't figured > out the root cause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org