kinolaev commented on issue #11648: URL: https://github.com/apache/iceberg/issues/11648#issuecomment-4103763216
I'd like to reopen the issue. I enabled executor cache in RewriteDataFilesSparkAction and it didn't cause any slowness or a hang. Instead it saved me millions of requests to an s3 bucket. Without cache spark loads every delete files for every data file. For example, if a table has data files with id bounds [11,20] and [20,30] and a delete file with id bounds [15,25] spark will load the delete file twice. In my case if I accidentally skip daily maintenance for several days, spark will make hundreds of millions unnecessary requests to s3. Here is my cache stats for 3 days after 3 days without maintenance: hitCount=219680918, loadSuccessCount=434508 hitCount=18618032, loadSuccessCount=106231 hitCount=4895082, loadSuccessCount=24220 That is why I made a PR #15714 that reverts #13820 and #13868. The option to disable cache is untouched in case the problem still exists in some special conditions. Btw, I found two cases when spark stucks when number of simultaneous connections is limited #15712, #15713. Maybe the problem with cache was only a side effect and not the root cause of slowness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
