[ 
https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-44924.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 45362
[https://github.com/apache/spark/pull/45362]

> Add configurations for FileStreamSource cached files
> ----------------------------------------------------
>
>                 Key: SPARK-44924
>                 URL: https://issues.apache.org/jira/browse/SPARK-44924
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: kevin nacios
>            Assignee: kevin nacios
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed 
> files was added for structured streaming to reduce cost of relisting from 
> filesystem each batch.  The settings that drive this are currently hardcoded 
> and there is no way to change them.  
>  
> This impacts some of our workloads where we process large datasets where its 
> unknown how "heavy" some files are, so a single batch can take a long period 
> of time.  When we set maxFilesPerTrigger to 100k files, a subsequent batch 
> using the cached max of 10k files is causing the job to take longer since the 
> cluster is capable of handling the 100k files but is stuck doing 10% of the 
> workload.  The benefit of the caching doesn't outweigh the cost of the 
> performance on the rest of the job.
>  
> With config settings available for this, we could either absorb some 
> increased driver memory usage for caching the next 100k files, or opt to 
> disable caching entirely and just relist files each batch by setting the 
> cache amount to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to