[ https://issues.apache.org/jira/browse/SPARK-27664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prashant Sharma resolved SPARK-27664. ------------------------------------- Resolution: Won't Fix I am marking it as won't fix, because, this is now difficult to reproduce. In version, 2.3.x the problem was more evident but after the merging of SPARK-23896, the version 2.4.x and above do not do a lot of re-listing for the general case. But, problem of relisting still exists (i.e. the version 2.4.3 and current 3.0.0 unreleased) and following code can be used to reproduce it. {code:java} // First create the object store data for testing spark.range(0,1000000, 1, 100000).selectExpr("id", "id < 100 as p").write.partitionBy("p").save("<object store bucket>") // Then following commands would reproduce it. // With times. // 19/05/24 03:07:56 val s = s""" |CREATE EXTERNAL TABLE test11(id bigint) |PARTITIONED BY (p boolean) |STORED AS parquet |LOCATION '<object store bucket>'""".stripMargin spark.sql(s) spark.sql("ALTER TABLE test11 add partition (p=true)") spark.sql("ALTER TABLE test11 add partition (p=false)") spark.sql("SELECT * FROM test11 where id <10").show() // 19/05/24 03:50:43 spark.sql("SELECT * FROM test11 where id <100").show() // 19/05/24 04:28:19 {code} As you can see above, the overall time taken is much more the time taken for an extra re-listing. So, the difference in performance is hard to notice. However, this issue along with the fix can be reconsidered later, if the problem resurfaces with larger impact. > Performance issue with FileStatusCache, while reading from object stores. > ------------------------------------------------------------------------- > > Key: SPARK-27664 > URL: https://issues.apache.org/jira/browse/SPARK-27664 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.0, 2.4.3 > Reporter: Prashant Sharma > Priority: Major > > In short, > This issue(i.e. degraded performance ) surfaces when the number of files are > large > 100K, and is stored on an object store, or any remote storage. The > actual issue is due to, > Everything is inserted as a single entry in the FileStatusCache i.e. guava > cache, which does not fit unless the cache is configured to be very very > large or 4X. Reason: [https://github.com/google/guava/issues/3462]. > > Full story, with possible solutions, > When we read a directory in spark by, > {code:java} > spark.read.parquet("/dir/data/test").limit(1).show() > {code} > behind the scenes, it fetches the FileStatus objects and caches them, inside > a FileStatusCache, so that it does not need to refetch these objects. > Internally, it scans using listLeafFiles function at driver. > Inside the cache, the entire content of the listing as array of FileStatus > objects is inserted as a single entry, with key as "/dir/data/test", in the > FileStatusCache. The default size of this cache is 250MB and it is > configurable. This underlying cache uses guava cache. > The guava cache has one interesting property, i.e. a single entry can only be > as large as > {code:java} > maximumSize/concurrencyLevel{code} > see [https://github.com/google/guava/issues/3462], for details. So for a > cache size of 250MB, a single entry can be as large as only 250MB/4, since > the default concurrency level is 4 in guava. This size is around 62MB, which > is good enough for most datasets, but for directories with larger listing, it > does not work that well. And the effect of this is especially evident when > such listings are for object stores like Amazon s3 or IBM Cloud object store > etc.. > So, currently one can work around this problem by setting the value of size > of the cache (i.e. `spark.sql.hive.filesourcePartitionFileCacheSize`) as very > high, as it needs to be much more than 4x of what is required. But this has a > drawback, that either one has to start the driver with large amount of memory > than required or risk an OOM when cache does not evict older entries as the > size is configured to be 4x. > In order to fix this issue, we can take 3 different approaches, > 1) one stop gap fix can be, reduce the concurrency level of the guava cache > to be just 1, for few entries with very large size, we do not lose much by > doing this. > 2) The alternative would be, to divide the input array into multiple entries > in the cache, instead of inserting everything against a single key. This can > be done using directories as keys, if there are multiple nested directories > under a directory, but if a user has everything listed under a single dir, > then this solution does not help either and we cannot depend on them creating > partitions in their hive/sql table. > 3) One more alternative fix would be, to make concurrency level configurable, > for those who want to change it. And while inserting the entry in the cache > divide it into the `concurrencyLevel`(or even 2X or 3X of it) number of > parts, before inserting. This way cache will perform optimally, and even if > there is an eviction, it will evict only a part of the entries, as against > all the entries in the current implementation. How many entries are evicted > due to size, depends on concurrencyLevel configured. This approach can be > taken, even without making `concurrencyLevel` configurable. > The problem with this approach is, the partitions in cache are of no use as > such, because even if one partition is evicted, then all the partitions of > the key should also be evicted, otherwise the results would be wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org