[ https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim resolved SPARK-44924. ---------------------------------- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45362 [https://github.com/apache/spark/pull/45362] > Add configurations for FileStreamSource cached files > ---------------------------------------------------- > > Key: SPARK-44924 > URL: https://issues.apache.org/jira/browse/SPARK-44924 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.1.0 > Reporter: kevin nacios > Assignee: kevin nacios > Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed > files was added for structured streaming to reduce cost of relisting from > filesystem each batch. The settings that drive this are currently hardcoded > and there is no way to change them. > > This impacts some of our workloads where we process large datasets where its > unknown how "heavy" some files are, so a single batch can take a long period > of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch > using the cached max of 10k files is causing the job to take longer since the > cluster is capable of handling the 100k files but is stuck doing 10% of the > workload. The benefit of the caching doesn't outweigh the cost of the > performance on the rest of the job. > > With config settings available for this, we could either absorb some > increased driver memory usage for caching the next 100k files, or opt to > disable caching entirely and just relist files each batch by setting the > cache amount to 0. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org