[jira] [Commented] (SPARK-44199) CacheManager refreshes the fileIndex unnecessarily

Ignite TC Bot (Jira) Thu, 29 Jun 2023 09:53:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738673#comment-17738673
 ]


Ignite TC Bot commented on SPARK-44199:
---------------------------------------

User 'vihangk1' has created a pull request for this issue:
https://github.com/apache/spark/pull/41749

> CacheManager refreshes the fileIndex unnecessarily
> --------------------------------------------------
>
>                 Key: SPARK-44199
>                 URL: https://issues.apache.org/jira/browse/SPARK-44199
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.1
>            Reporter: Vihang Karajgaonkar
>            Priority: Major
>
> The CacheManager on this line 
> [https://github.com/apache/spark/blob/680ca2e56f2c8fc759743ad6755f6e3b1a19c629/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L372]
>  uses a prefix based matching to decide which file index needs to be 
> refreshed. However, that can be incorrect if the users have paths which are 
> not subdirectories but share prefixes. For example, in the function below:
>  
> {code:java}
>   private def refreshFileIndexIfNecessary(
>       fileIndex: FileIndex,
>       fs: FileSystem,
>       qualifiedPath: Path): Boolean = {
>     val prefixToInvalidate = qualifiedPath.toString
>     val needToRefresh = fileIndex.rootPaths
>       .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
>       .exists(_.startsWith(prefixToInvalidate))
>     if (needToRefresh) fileIndex.refresh()
>     needToRefresh
>   } {code}
> {{If the prefixToInvalidate is s3://bucket/mypath/table_dir and the file 
> index has one of the root paths as s3://bucket/mypath/table_dir_2/part=1, 
> then the needToRefresh will be true and the file index gets refreshed 
> unnecessarily. This is not just wasted CPU cycles but can cause query 
> failures as well, if there are access restrictions to the path being 
> refreshed.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44199) CacheManager refreshes the fileIndex unnecessarily

Reply via email to