[ https://issues.apache.org/jira/browse/SPARK-22035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-22035. ---------------------------------- Resolution: Incomplete > the value of statistical logicalPlan.stats.sizeInBytes which is not expected > ---------------------------------------------------------------------------- > > Key: SPARK-22035 > URL: https://issues.apache.org/jira/browse/SPARK-22035 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: caoxuewen > Priority: Major > Labels: bulk-closed > > Currently, assume there will be the same number of rows as child has. > statistics `logicalPlan.stats.SizeInBytes` is calculated based on the > percentage of the size of the child's data type and the size of the current > data type size. But there is a problem. Statistics are not very accurate. > for example: > ``` > val N = 1 << 3 > val df = spark.range(N).selectExpr("id as k1", > s"cast(id % 3 as string) as idString1", > s"cast((id + 1) % 5 as string) as idString3") > val sizeInBytes = df.logicalPlan.stats.sizeInBytes > println("sizeInBytes : " + sizeInBytes) > ``` > before modify: > sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). > debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` > ``` > p.child.dataType: LongType defaultSize: 8 > p.dataType: LongType defaultSize: 8 > p.dataType: StringType defaultSize: 20 > p.dataType: StringType defaultSize: 20 > childRowSize: 16 outputRowSize: 56 > p.child.stats.sizeInBytes : 64 > p.stats.sizeInBytes : 224 > sizeInBytes: 224 > ``` > but sizeInBytes must be 384( 8 * (8 + 20 + 20) ). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org