[ https://issues.apache.org/jira/browse/SPARK-18853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749397#comment-15749397 ]
Michael Allman commented on SPARK-18853: ---------------------------------------- [~rxin] [~hvanhovell] Should we move the overridden {{statistics}} method to {{Project}} before marking this issue as resolved? > Project (UnaryNode) is way too aggressive in estimating statistics > ------------------------------------------------------------------- > > Key: SPARK-18853 > URL: https://issues.apache.org/jira/browse/SPARK-18853 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Reynold Xin > Assignee: Reynold Xin > Fix For: 2.0.3, 2.1.0 > > > We currently define statistics in UnaryNode: > {code} > override def statistics: Statistics = { > // There should be some overhead in Row object, the size should not be > zero when there is > // no columns, this help to prevent divide-by-zero error. > val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8 > val outputRowSize = output.map(_.dataType.defaultSize).sum + 8 > // Assume there will be the same number of rows as child has. > var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / > childRowSize > if (sizeInBytes == 0) { > // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be > zero > // (product of children). > sizeInBytes = 1 > } > child.statistics.copy(sizeInBytes = sizeInBytes) > } > {code} > This has a few issues: > 1. This can aggressively underestimate the size for Project. We assume each > array/map has 100 elements, which is an overestimate. If the user projects a > single field out of a deeply nested field, this would lead to huge > underestimation. A safer sane default is probably 1. > 2. It is not a property of UnaryNode to propagate statistics this way. It > should be a property of Project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org