[ 
https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260355#comment-14260355
 ] 

Xiangrui Meng commented on SPARK-4963:
--------------------------------------

[~yanboliang] Thanks for looking into this issue! I've assigned the JIRA to you.

1. What's the overhead of making HiveTableScan return mutable row with copy?
2. Is this issue also applies to other operations rather than sample? For 
example,  a user may operate directly on a SchemaRDD.

> SchemaRDD.sample may return wrong results
> -----------------------------------------
>
>                 Key: SPARK-4963
>                 URL: https://issues.apache.org/jira/browse/SPARK-4963
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Cheng Lian
>            Assignee: Yanbo Liang
>
> This {{sbt/sbt hive/console}} session can easily reproduce this issue:
> {code}
> sql("SELECT * FROM src WHERE key % 2 = 0").
>   sample(withReplacement = false, fraction = 0.05).
>   registerTempTable("sampled")
> println(table("sampled").queryExecution)
> val query = sql("SELECT * FROM sampled WHERE key % 2 = 1")
> println(query.queryExecution)
> // Should print `true'
> println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _))
> {code}
> Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used 
> to do the sampling. My guess is that there’s something to do with the 
> underlying mutable row objects used in {{HiveTableScan}}, but haven't figured 
> out the root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to