Zhan Zhang created SPARK-19890: ---------------------------------- Summary: Make MetastoreRelation statistics estimation more accurately Key: SPARK-19890 URL: https://issues.apache.org/jira/browse/SPARK-19890 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor
Currently the MetastoreRelation statistics is retrieved on the analyze phase, and the size is based on the table scope. But for partitioned table, this statistics is not useful as table size may 100x+ larger than the input partition size. As a result, the join optimization techniques is not applicable. It would be great if we can postpone the statistics to the optimization phase to get partition information but before physical plan generation phase so that JoinSelection can choose better join methd (broadcast, shuffledjoin, or sortmerjoin). Although the metastorerelation does not associated with partitions, but through PhysicalOperation we can get the partition info for the table. Although multiple plan can use the same meatstorerelation, but the estimation still much better than table size. This way, retrieving statistics is straightforward. Another possible way is to have a another data structure associating the metastore relation and partitions with the plan to get most accurate estimation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org