[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

Asif (Jira) Fri, 29 Sep 2023 15:15:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Asif updated SPARK-45373:
-------------------------
    Description: 
In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted 
to InMemoryFileIndex,  the HMS calls can get very expensive if :
1) The translated filter string for push down to HMS layer becomes empty ,  
resulting in fetching of all partitions and same table is referenced multiple 
times in the query. 
2) Or just in case same table is referenced multiple times in the query with 
different partition filters.
In such cases current code would result in multiple calls to HMS layer. 
This can be avoided by grouping the tables based on CatalogFileIndex and 
passing a common minimum filter ( filter1 || filter2) and getting a base 
PrunedInmemoryFileIndex which can become a basis for each of the specific table.

Opened following PR for ticket:
[SPARK-45373-PR|https://github.com/apache/spark/pull/43183]

  was:
In the rule PruneFileSourcePartitions where the CatalogFileIndex gets converted 
to InMemoryFileIndex,  the HMS calls can get very expensive if :
1) The translated filter string for push down to HMS layer becomes empty ,  
resulting in fetching of all partitions and same table is referenced multiple 
times in the query. 
2) Or just in case same table is referenced multiple times in the query with 
different partition filters.
In such cases current code would result in multiple calls to HMS layer. 
This can be avoided by grouping the tables based on CatalogFileIndex and 
passing a common minimum filter ( filter1 || filter2) and getting a base 
PrunedInmemoryFileIndex which can become a basis for each of the specific table.


> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-45373
>                 URL: https://issues.apache.org/jira/browse/SPARK-45373
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.1
>            Reporter: Asif
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

Reply via email to