[ https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260782#comment-14260782 ]
Michael Armbrust commented on SPARK-4963: ----------------------------------------- We could create a new operator, but the problem here is that we sometimes use operator specific logic to decide when to copy. For example, we do this in exchange to avoid copies when the shuffle is going to be hash-based. For that reason I think it might be okay to just do .map(_.copy()) before calling spark's sample method. > SchemaRDD.sample may return wrong results > ----------------------------------------- > > Key: SPARK-4963 > URL: https://issues.apache.org/jira/browse/SPARK-4963 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0 > Reporter: Cheng Lian > Assignee: Yanbo Liang > > This {{sbt/sbt hive/console}} session can easily reproduce this issue: > {code} > sql("SELECT * FROM src WHERE key % 2 = 0"). > sample(withReplacement = false, fraction = 0.05). > registerTempTable("sampled") > println(table("sampled").queryExecution) > val query = sql("SELECT * FROM sampled WHERE key % 2 = 1") > println(query.queryExecution) > // Should print `true' > println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _)) > {code} > Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used > to do the sampling. My guess is that there’s something to do with the > underlying mutable row objects used in {{HiveTableScan}}, but haven't figured > out the root cause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org