[ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17752:
------------------------------------------
    Fix Version/s: 2.1.0
                   2.0.1

> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-17752
>                 URL: https://issues.apache.org/jira/browse/SPARK-17752
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.0.0
>            Reporter: Kevin Ushey
>            Priority: Critical
>             Fix For: 2.0.1, 2.1.0
>
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to