[ 
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19890:
-------------------------------
    Description: 
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Multiple plan 
can use the same meatstorerelation, but the estimation is still much better 
than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.

  was:
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.


> Make MetastoreRelation statistics estimation more accurately
> ------------------------------------------------------------
>
>                 Key: SPARK-19890
>                 URL: https://issues.apache.org/jira/browse/SPARK-19890
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Zhan Zhang
>            Priority: Minor
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
> and the size is based on the table scope. But for partitioned table, this 
> statistics is not useful as table size may 100x+ larger than the input 
> partition size. As a result, the join optimization techniques is not 
> applicable.
> It would be great if we can postpone the statistics to the optimization phase 
> to get partition information but before physical plan generation phase so 
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but 
> through PhysicalOperation we can get the partition info for the table. 
> Multiple plan can use the same meatstorerelation, but the estimation is still 
> much better than table size. This way, retrieving statistics is 
> straightforward.
> Another possible way is to have a another data structure associating the 
> metastore relation and partitions with the plan to get most accurate 
> estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to