[jira] [Commented] (SPARK-4963) SchemaRDD.sample may return wrong results

Michael Armbrust (JIRA) Tue, 30 Dec 2014 10:35:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261331#comment-14261331
 ]


Michael Armbrust commented on SPARK-4963:
-----------------------------------------

Mutability is an internal optimization and we always copy at boundaries where 
we expose data to the user.  We should not remove it from parquet or hive table 
scan because it greatly improves performance.

> SchemaRDD.sample may return wrong results
> -----------------------------------------
>
>                 Key: SPARK-4963
>                 URL: https://issues.apache.org/jira/browse/SPARK-4963
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Cheng Lian
>            Assignee: Yanbo Liang
>
> This {{sbt/sbt hive/console}} session can easily reproduce this issue:
> {code}
> sql("SELECT * FROM src WHERE key % 2 = 0").
>   sample(withReplacement = false, fraction = 0.05).
>   registerTempTable("sampled")
> println(table("sampled").queryExecution)
> val query = sql("SELECT * FROM sampled WHERE key % 2 = 1")
> println(query.queryExecution)
> // Should print `true'
> println((1 to 10).map(_ => query.collect().isEmpty).reduce(_ && _))
> {code}
> Notice that when fraction is less than 0.4, {{GapSamplingIterator}} is used 
> to do the sampling. My guess is that there’s something to do with the 
> underlying mutable row objects used in {{HiveTableScan}}, but haven't figured 
> out the root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4963) SchemaRDD.sample may return wrong results

Reply via email to