[ https://issues.apache.org/jira/browse/SPARK-44199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738673#comment-17738673 ]
Ignite TC Bot commented on SPARK-44199: --------------------------------------- User 'vihangk1' has created a pull request for this issue: https://github.com/apache/spark/pull/41749 > CacheManager refreshes the fileIndex unnecessarily > -------------------------------------------------- > > Key: SPARK-44199 > URL: https://issues.apache.org/jira/browse/SPARK-44199 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.4.1 > Reporter: Vihang Karajgaonkar > Priority: Major > > The CacheManager on this line > [https://github.com/apache/spark/blob/680ca2e56f2c8fc759743ad6755f6e3b1a19c629/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala#L372] > uses a prefix based matching to decide which file index needs to be > refreshed. However, that can be incorrect if the users have paths which are > not subdirectories but share prefixes. For example, in the function below: > > {code:java} > private def refreshFileIndexIfNecessary( > fileIndex: FileIndex, > fs: FileSystem, > qualifiedPath: Path): Boolean = { > val prefixToInvalidate = qualifiedPath.toString > val needToRefresh = fileIndex.rootPaths > .map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString) > .exists(_.startsWith(prefixToInvalidate)) > if (needToRefresh) fileIndex.refresh() > needToRefresh > } {code} > {{If the prefixToInvalidate is s3://bucket/mypath/table_dir and the file > index has one of the root paths as s3://bucket/mypath/table_dir_2/part=1, > then the needToRefresh will be true and the file index gets refreshed > unnecessarily. This is not just wasted CPU cycles but can cause query > failures as well, if there are access restrictions to the path being > refreshed.}} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org