yea sure, thanks for doing it! On Sat, Jan 6, 2018 at 9:25 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi Wenchen, > > That's just now when I stumbled across this comment in > `LeafNode.computeStats` [1]: > > > Leaf nodes that can survive analysis must define their own statistics. > > And the other in scaladoc of AnalysisBarrier [2] > > > This analysis barrier will be removed at the end of analysis stage. > > That makes a lot of sense now and makes QueryExecution.analyzed crucial > (since there could be AnalysisBarriers in a plan). Thanks again! > > Regarding your comment: > > > For this particular case, we can make it work by defining `computeStats` > in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this > doesn't break any real use cases. > > Don't you think that `AnalysisBarrier.computeStats` could just dispatch > to child.computeStats and although "this doesn't break any real use cases" > would avoid questions like that? Less to worry about and would make things > more comprehensible. Mind if I proposed a PR with the change (and another > for the typo in the scaladoc where it says: "The SQL Analyzer goes through > a whole query plan even most part of it is analyzed." [3])? > > [1] https://github.com/apache/spark/blob/master/sql/ > catalyst/src/main/scala/org/apache/spark/sql/catalyst/ > plans/logical/LogicalPlan.scala#L231 > [2] https://github.com/apache/spark/blob/master/sql/ > catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ > basicLogicalOperators.scala#L903 > [3] https://github.com/apache/spark/blob/master/sql/ > catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/ > basicLogicalOperators.scala#L895 > > Pozdrawiam, > Jacek Laskowski > ---- > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Thu, Jan 4, 2018 at 11:57 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > >> First of all, I think you know that `QueryExecution` is a developer API >> right? By definition `QueryExecution.logical` is the input plan, which can >> even be unresolved. Developers should be aware of it and do not apply >> operations that need the plan to be resolved. Obviously `LogicalPlan.stats` >> needs the plan to be resolved. >> >> For this particular case, we can make it work by defining `computeStats` >> in `AnalysisBarrier`. But it's also OK to just leave it as it is, as this >> doesn't break any real use cases. >> >> On Thu, Jan 4, 2018 at 4:36 PM, Jacek Laskowski <ja...@japila.pl> wrote: >> >>> Hi, >>> >>> I use Spark from the master today. >>> >>> $ ./bin/spark-shell --version >>> Welcome to >>> ____ __ >>> / __/__ ___ _____/ /__ >>> _\ \/ _ \/ _ `/ __/ '_/ >>> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT >>> /_/ >>> >>> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_152 >>> Branch master >>> Compiled by user jacek on 2018-01-04T05:44:05Z >>> Revision 7d045c5f00e2c7c67011830e2169a4e130c3ace8 >>> >>> Can anyone explain why some queries have stats in logical plan while >>> others don't (and I had to use analyzed logical plan)? >>> >>> I can explain the difference using the code, but I don't know why there >>> is the difference. >>> >>> spark.range(1000).write.parquet("/tmp/p1000") >>> // The stats are available in logical plan (in logical "phase") >>> scala> spark.read.parquet("/tmp/p1000").queryExecution.logical.stats >>> res21: org.apache.spark.sql.catalyst.plans.logical.Statistics = >>> Statistics(sizeInBytes=6.9 KB, hints=none) >>> >>> // logical plan fails, but it worked fine above --> WHY?! >>> val names = Seq((1, "one"), (2, "two")).toDF("id", "name") >>> scala> names.queryExecution.logical.stats >>> java.lang.UnsupportedOperationException >>> at org.apache.spark.sql.catalyst.plans.logical.LeafNode.compute >>> Stats(LogicalPlan.scala:232) >>> at org.apache.spark.sql.catalyst.plans.logical.statsEstimation. >>> SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStat >>> sPlanVisitor.scala:55) >>> at org.apache.spark.sql.catalyst.plans.logical.statsEstimation. >>> SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStat >>> sPlanVisitor.scala:27) >>> >>> // analyzed logical plan works fine >>> scala> names.queryExecution.analyzed.stats >>> res23: org.apache.spark.sql.catalyst.plans.logical.Statistics = >>> Statistics(sizeInBytes=48.0 B, hints=none) >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> ---- >>> https://about.me/JacekLaskowski >>> Mastering Spark SQL https://bit.ly/mastering-spark-sql >>> Spark Structured Streaming https://bit.ly/spark-structured-streaming >>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >>> Follow me at https://twitter.com/jaceklaskowski >>> >> >> >