[ https://issues.apache.org/jira/browse/SPARK-24756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24756. ---------------------------------- Resolution: Incomplete > Incorrect Statistics > -------------------- > > Key: SPARK-24756 > URL: https://issues.apache.org/jira/browse/SPARK-24756 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Nick Jordan > Priority: Major > Labels: bulk-closed > > I'm getting some odd results when looking at the statistics for a simple data > frame: > {code:java} > val df = spark.sparkContext.parallelize(Seq("y")).toDF("y") > df.queryExecution.stringWithStats{code} > {noformat} > == Optimized Logical Plan > == Project [value#7 AS y#9], Statistics(sizeInBytes=8.0 EB, hints=none) > +- SerializeFromObject [staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > java.lang.String, true], true, false) AS value#7], Statistics(sizeInBytes=8.0 > EB, hints=none) > +- ExternalRDD [obj#6], Statistics(sizeInBytes=8.0 EB, hints=none) > {noformat} > 8.0 Exabytes is clearly not right here. It is worth noting that if I don't > parallelize the Seq then I get the expected results. > This surfaced when I was running unit tests that verified that a broadcast > hint was preserved which was failing because of the incorrect statistics. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org