[ https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shivaram Venkataraman updated SPARK-17752: ------------------------------------------ Fix Version/s: 2.1.0 2.0.1 > Spark returns incorrect result when 'collect()'ing a cached Dataset with many > columns > ------------------------------------------------------------------------------------- > > Key: SPARK-17752 > URL: https://issues.apache.org/jira/browse/SPARK-17752 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 2.0.0 > Reporter: Kevin Ushey > Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 > installation as necessary): > {code:r} > SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7") > Sys.setenv(SPARK_HOME = SPARK_HOME) > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = > "2g")) > n <- 1E3 > df <- as.data.frame(replicate(n, 1L, FALSE)) > names(df) <- paste("X", 1:n, sep = "") > tbl <- as.DataFrame(df) > cache(tbl) # works fine without this > cl <- collect(tbl) > identical(df, cl) # FALSE > {code} > Although this is reproducible with SparkR, it seems more likely that this is > an error in the Java / Scala Spark sources. > For posterity: > > sessionInfo() > R version 3.3.1 Patched (2016-07-30 r71015) > Platform: x86_64-apple-darwin13.4.0 (64-bit) > Running under: macOS Sierra (10.12) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org