Cheng Lian created SPARK-4963:
---------------------------------

             Summary: SchemaRDD.sample may return wrong results
                 Key: SPARK-4963
                 URL: https://issues.apache.org/jira/browse/SPARK-4963
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.2.0
            Reporter: Cheng Lian


This {{sbt/sbt hive/console}} session can easily reproduce this issue:
{code}
sql("SELECT * FROM src WHERE key % 2 = 0").
  sample(withReplacement = false, fraction = 0.05).
  registerTempTable("sampled")

println(table("sampled").queryExecution)

val query = sql("SELECT * FROM sampled WHERE key % 2 = 1")
println(query.queryExecution)

// Should print `true'
println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _))
{code}
Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used to 
do the sampling. My guess is that there’s something to do with the underlying 
mutable row objects used in {{HiveTableScan}}, but haven't figured out the root 
cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to