caoxuewen created SPARK-22035: --------------------------------- Summary: Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected Key: SPARK-22035 URL: https://issues.apache.org/jira/browse/SPARK-22035 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: caoxuewen
Currently, assume there will be the same number of rows as child has. statistics `logicalPlan.stats.SizeInBytes` is calculated based on the percentage of the size of the child's data type and the size of the current data type size. But there is a problem. Statistics are not very accurate. for example: ``` val N = 1 << 3 val df = spark.range(N).selectExpr("id as k1", s"cast(id % 3 as string) as idString1", s"cast((id + 1) % 5 as string) as idString3") val sizeInBytes = df.logicalPlan.stats.sizeInBytes println("sizeInBytes : " + sizeInBytes) ``` before modify: sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 16 outputRowSize: 56 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 224 sizeInBytes: 224 ``` after modify: sizeInBytes is 384( 8 * 8 ((8 + 20 + 20) / 8) ). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 8 outputRowSize: 48 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 384 sizeInBytes: 384 ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org