[ https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260790#comment-14260790 ]
Cheng Lian edited comment on SPARK-4963 at 12/30/14 4:49 AM: ------------------------------------------------------------- OK I agree. However the {{\_.copy()}} call should be added in the SQL {{Sample}} operator rather than Spark's {{RDD.sample}} method, since the element type of an arbitrary RDD doesn't necessarily to be a {{Product}} and {{\_.copy()}} is not available. was (Author: lian cheng): OK I agree. However the {{_.copy()}} call should be added in the SQL {{Sample}} operator rather than Spark's {{RDD.sample}} method, since the element type of an arbitrary RDD doesn't necessarily to be a {{Product}} and {{_.copy()}} is not available. > SchemaRDD.sample may return wrong results > ----------------------------------------- > > Key: SPARK-4963 > URL: https://issues.apache.org/jira/browse/SPARK-4963 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.0 > Reporter: Cheng Lian > Assignee: Yanbo Liang > > This {{sbt/sbt hive/console}} session can easily reproduce this issue: > {code} > sql("SELECT * FROM src WHERE key % 2 = 0"). > sample(withReplacement = false, fraction = 0.05). > registerTempTable("sampled") > println(table("sampled").queryExecution) > val query = sql("SELECT * FROM sampled WHERE key % 2 = 1") > println(query.queryExecution) > // Should print `true' > println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _)) > {code} > Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used > to do the sampling. My guess is that there’s something to do with the > underlying mutable row objects used in {{HiveTableScan}}, but haven't figured > out the root cause. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org