Re: Spark 1.6 Catalyst optimizer

Telmo Rodrigues Thu, 12 May 2016 18:31:04 -0700

Thank you Takeshi.

After executing df3.explain(true) I realised that the Optimiser batches are
being performed and also the predicate push down.


I think that only the analiser batches are executed when creating the data
frame by the context.sql(query). It seems that the optimiser batches are
executed when some action like collect or explain takes place.

scala> d3.explain(true)
16/05/13 02:08:12 DEBUG Analyzer$ResolveReferences: Resolving 't.id to id#0L
16/05/13 02:08:12 DEBUG Analyzer$ResolveReferences: Resolving 'u.id to id#1L
16/05/13 02:08:12 DEBUG Analyzer$ResolveReferences: Resolving 't.id to id#0L
16/05/13 02:08:12 DEBUG SQLContext$$anon$1:
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('t.id = 1)
   +- 'Join Inner, Some(('t.id = 'u.id))
      :- 'UnresolvedRelation `t`, None
      +- 'UnresolvedRelation `u`, None

== Analyzed Logical Plan ==
id: bigint, id: bigint
Project [id#0L,id#1L]
+- Filter (id#0L = cast(1 as bigint))
   +- Join Inner, Some((id#0L = id#1L))
      :- Subquery t
      :  +- Relation[id#0L] JSONRelation
      +- Subquery u
         +- Relation[id#1L] JSONRelation

== Optimized Logical Plan ==
Project [id#0L,id#1L]
+- Join Inner, Some((id#0L = id#1L))
   :- Filter (id#0L = 1)
   :  +- Relation[id#0L] JSONRelation
   +- Relation[id#1L] JSONRelation

== Physical Plan ==
Project [id#0L,id#1L]
+- BroadcastHashJoin [id#0L], [id#1L], BuildRight
   :- Filter (id#0L = 1)
   :  +- Scan JSONRelation[id#0L] InputPaths: file:/persons.json,
PushedFilters: [EqualTo(id,1)]
   +- Scan JSONRelation[id#1L] InputPaths: file:/cars.json


2016-05-12 16:34 GMT+01:00 Takeshi Yamamuro <linguin....@gmail.com>:

> Hi,
>
> What's the result of `df3.explain(true)`?
>
> // maropu
>
> On Thu, May 12, 2016 at 10:04 AM, Telmo Rodrigues <
> telmo.galante.rodrig...@gmail.com> wrote:
>
>> I'm building spark from branch-1.6 source with mvn -DskipTests package
>> and I'm running the following code with spark shell.
>>
>> *val* sqlContext *=* *new* org.apache.spark.sql.*SQLContext*(sc)
>>
>> *import* *sqlContext.implicits._*
>>
>>
>> *val df = sqlContext.read.json("persons.json")*
>>
>> *val df2 = sqlContext.read.json("cars.json")*
>>
>>
>> *df.registerTempTable("t")*
>>
>> *df2.registerTempTable("u")*
>>
>>
>> *val d3 =sqlContext.sql("select * from t join u on t.id <http://t.id> =
>> u.id <http://u.id> where t.id <http://t.id> = 1")*
>>
>> With the log4j root category level WARN, the last printed messages refers
>> to the Batch Resolution rules execution.
>>
>> === Result of Batch Resolution ===
>> !'Project [unresolvedalias(*)]              Project [id#0L,id#1L]
>> !+- 'Filter ('t.id = 1)                     +- Filter (id#0L = cast(1 as
>> bigint))
>> !   +- 'Join Inner, Some(('t.id = 'u.id))      +- Join Inner,
>> Some((id#0L = id#1L))
>> !      :- 'UnresolvedRelation `t`, None           :- Subquery t
>> !      +- 'UnresolvedRelation `u`, None           :  +- Relation[id#0L]
>> JSONRelation
>> !                                                 +- Subquery u
>> !                                                    +- Relation[id#1L]
>> JSONRelation
>>
>>
>> I think that only the analyser rules are being executed.
>>
>> The optimiser rules should not to run in this case?
>>
>> 2016-05-11 19:24 GMT+01:00 Michael Armbrust <mich...@databricks.com>:
>>
>>>
>>>> logical plan after optimizer execution:
>>>>
>>>> Project [id#0L,id#1L]
>>>> !+- Filter (id#0L = cast(1 as bigint))
>>>> !   +- Join Inner, Some((id#0L = id#1L))
>>>> !      :- Subquery t
>>>> !          :  +- Relation[id#0L] JSONRelation
>>>> !      +- Subquery u
>>>> !          +- Relation[id#1L] JSONRelation
>>>>
>>>
>>> I think you are mistaken.  If this was the optimized plan there would be
>>> no subqueries.
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Spark 1.6 Catalyst optimizer

Reply via email to