yea sure, thanks for doing it!

On Sat, Jan 6, 2018 at 9:25 PM, Jacek Laskowski <> wrote:

> Hi Wenchen,
> That's just now when I stumbled across this comment in
> `LeafNode.computeStats` [1]:
> > Leaf nodes that can survive analysis must define their own statistics.
> And the other in scaladoc of AnalysisBarrier [2]
> > This analysis barrier will be removed at the end of analysis stage.
> That makes a lot of sense now and makes QueryExecution.analyzed crucial
> (since there could be AnalysisBarriers in a plan). Thanks again!
> Regarding your comment:
> > For this particular case, we can make it work by defining `computeStats`
> in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this
> doesn't break any real use cases.
> Don't you think that `AnalysisBarrier.computeStats` could just dispatch
> to child.computeStats and although "this doesn't break any real use cases"
> would avoid questions like that? Less to worry about and would make things
> more comprehensible. Mind if I proposed a PR with the change (and another
> for the typo in the scaladoc where it says: "The SQL Analyzer goes through
> a whole query plan even most part of it is analyzed." [3])?
> [1]
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> plans/logical/LogicalPlan.scala#L231
> [2]
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/
> basicLogicalOperators.scala#L903
> [3]
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/
> basicLogicalOperators.scala#L895
> Pozdrawiam,
> Jacek Laskowski
> ----
> Mastering Spark SQL
> Spark Structured Streaming
> Mastering Kafka Streams
> Follow me at
> On Thu, Jan 4, 2018 at 11:57 AM, Wenchen Fan <> wrote:
>> First of all, I think you know that `QueryExecution` is a developer API
>> right? By definition `QueryExecution.logical` is the input plan, which can
>> even be unresolved. Developers should be aware of it and do not apply
>> operations that need the plan to be resolved. Obviously `LogicalPlan.stats`
>> needs the plan to be resolved.
>> For this particular case, we can make it work by defining `computeStats`
>> in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this
>> doesn't break any real use cases.
>> On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <> wrote:
>>> Hi,
>>> I use Spark from the master today.
>>> $ ./bin/spark-shell --version
>>> Welcome to
>>>       ____              __
>>>      / __/__  ___ _____/ /__
>>>     _\ \/ _ \/ _ `/ __/  '_/
>>>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>>>       /_/
>>> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152
>>> Branch master
>>> Compiled by user jacek on 2018-01-04T05:44:05Z
>>> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8
>>> Can anyone explain why some queries have stats in logical plan while
>>> others don't (and I had to use analyzed logical plan)?
>>> I can explain the difference using the code, but I don't know why there
>>> is the difference.
>>> spark.range(1000).write.parquet("/tmp/p1000")
>>> // The stats are available in logical plan (in logical "phase")
>>> scala>"/tmp/p1000").queryExecution.logical.stats
>>> res21: org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(sizeInBytes=6.9 KB, hints=none)
>>> // logical plan fails, but it worked fine above --> WHY?!
>>> val names = Seq((1, "one"), (2, "two")).toDF("id", "name")
>>> scala> names.queryExecution.logical.stats
>>> java.lang.UnsupportedOperationException
>>>   at org.apache.spark.sql.catalyst.plans.logical.LeafNode.compute
>>> Stats(LogicalPlan.scala:232)
>>>   at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.
>>> SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStat
>>> sPlanVisitor.scala:55)
>>>   at org.apache.spark.sql.catalyst.plans.logical.statsEstimation.
>>> SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStat
>>> sPlanVisitor.scala:27)
>>> // analyzed logical plan works fine
>>> scala> names.queryExecution.analyzed.stats
>>> res23: org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(sizeInBytes=48.0 B, hints=none)
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> ----
>>> Mastering Spark SQL
>>> Spark Structured Streaming
>>> Mastering Kafka Streams
>>> Follow me at

Reply via email to