[jira] [Created] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

Srini E (Jira) Wed, 02 Oct 2019 13:32:20 -0700

Srini E created SPARK-29335:
-------------------------------

             Summary: Cost Based Optimizer stats are not used while evaluating 
query plans in Spark Sql
                 Key: SPARK-29335
                 URL: https://issues.apache.org/jira/browse/SPARK-29335
             Project: Spark
          Issue Type: Question
          Components: Optimizer
    Affects Versions: 2.3.0
         Environment: We tried to execute the same using Spark-sql and Thrify 
server using SQLWorkbench but we are not able to use the stats.
            Reporter: Srini E



We are trying to leverage CBO for getting better plan results for few critical 
queries run thru spark-sql or thru thrift server using jdbc driver. 

Following settings added to spark-defaults.conf

*spark.sql.cbo.enabled true* 

*spark.experimental.extrastrategies intervaljoin* 

*spark.sql.cbo.joinreorder.enabled true* 

 

The tables that we are using are not partitioned.

Spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;

analyze table arrow.t_fperiods_sundar compute statistics for columns eid, year, 
ptype, absref, fpid , pid ;

analyze table arrow.t_fdata_sundar compute statistics ;

analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
absref;

Analyze completed success fully.

Describe extended , does not show column level stats data and queries are not 
leveraging table or column level stats .

we are using Oracle as our Hive Catalog store and not Glue .

 

+*When we are using spark sql and running queries we are not able to see the 
stats in use in the explain plan and we are not sure if cbo is put to use.*+ 

 

*A quick response would be helpful.*

 

*Explain Plan:*

Following Explain command does not reference to any Statistics usage.
 
spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
 
19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
2017),(ptype#4546 = A),(eid#4542 = 29940),isnull(PID#4527),isnotnull(fpid#4523)
19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<FPID: 
decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 3 
more fields>
19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
isnotnull(absref#4569),(absref#4569 = 
Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct<IMNEM: 
string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: string 
... 3 more fields>
19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
== Parsed Logical Plan ==
'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
+- 'Filter (((('a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
(('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
('a12.eid = 29940)) && isnull('a12.PID)))
 +- 'Join Inner
 :- 'SubqueryAlias a12
 : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
 +- 'SubqueryAlias a13
 +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
 
== Analyzed Logical Plan ==
imnem: string, fvalue: string, ptype: string, absref: string
Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- Filter ((((eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
cast(fpid#4567 as decimal(38,0)))) && ((absref#4569 = Y2017) && (year#4545 = 
2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
cast(cast(29940 as decimal(5,0)) as decimal(10,0)))) && isnull(PID#4527)))
 +- Join Inner
 :- SubqueryAlias a12
 : +- SubqueryAlias t_fperiods_sundar
 : +- 
Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
 parquet
 +- SubqueryAlias a13
 +- SubqueryAlias t_fdata_sundar
 +- 
Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
 parquet
 
== Optimized Logical Plan ==
Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as 
decimal(38,0))))
 :- Project [FPID#4523, EID#4542, PTYPE#4546]
 : +- Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) && 
isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 = 
29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
 : +- 
Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
 parquet
 +- Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
 +- Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) && 
isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
 +- 
Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
 parquet
 
== Physical Plan ==
*(2) Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- *(2) BroadcastHashJoin [eid#4542, fpid#4523], [eid#4566, cast(fpid#4567 as 
decimal(38,0))], Inner, BuildLeft
 :- BroadcastExchange HashedRelationBroadcastMode(List(input[1, decimal(10,0), 
true], input[0, decimal(38,0), true]))
 : +- *(1) Project [FPID#4523, EID#4542, PTYPE#4546]
 : +- *(1) Filter (((((((isnotnull(ptype#4546) && isnotnull(year#4545)) && 
isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 = 
29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
 : +- *(1) FileScan parquet 
arrow.t_fperiods_sundar[FPID#4523,PID#4527,EID#4542,YEAR#4545,PTYPE#4546] 
Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FPERIODS_S...,
 PartitionFilters: [], PushedFilters: [IsNotNull(PTYPE), IsNotNull(YEAR), 
IsNotNull(EID), EqualTo(YEAR,2017), EqualTo(PTYPE,A), EqualTo..., ReadSchema: 
struct<FPID:decimal(38,0),PID:string,EID:decimal(10,0),YEAR:int,PTYPE:string>
 +- *(2) Project [IMNEM#4548, FVALUE#4552, EID#4566, FPID#4567, ABSREF#4569]
 +- *(2) Filter ((((isnotnull(absref#4569) && (absref#4569 = Y2017)) && 
isnotnull(fpid#4567)) && isnotnull(eid#4566)) && (eid#4566 = 29940))
 +- *(2) FileScan parquet 
arrow.t_fdata_sundar[IMNEM#4548,FVALUE#4552,EID#4566,FPID#4567,ABSREF#4569] 
Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[s3://spr-idf-uat-processed/replication/base/APPLICATION_SCHEMA/ARROW/FDATA_SUNDAR],
 PartitionFilters: [], PushedFilters: [IsNotNull(ABSREF), 
EqualTo(ABSREF,Y2017), IsNotNull(FPID), IsNotNull(EID), EqualTo(EID,29940)], 
ReadSchema: 
struct<IMNEM:string,FVALUE:string,EID:decimal(10,0),FPID:decimal(10,0),ABSREF:string>
Time taken: 0.35 seconds, Fetched 1 row(s)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

Reply via email to