gabotechs commented on code in PR #18146:
URL: https://github.com/apache/datafusion/pull/18146#discussion_r2499053803


##########
datafusion/core/tests/datasource/object_store_access.rs:
##########
@@ -194,17 +183,8 @@ async fn query_partitioned_csv_file() {
     +---------+-------+-------+---+----+-----+
     ------- Object Store Request Summary -------
     RequestCountingObjectStore()
-    Total Requests: 11
-    - LIST (with delimiter) prefix=data
-    - LIST (with delimiter) prefix=data/a=1
-    - LIST (with delimiter) prefix=data/a=2
-    - LIST (with delimiter) prefix=data/a=3
-    - LIST (with delimiter) prefix=data/a=1/b=10
-    - LIST (with delimiter) prefix=data/a=2/b=20
-    - LIST (with delimiter) prefix=data/a=3/b=30
-    - LIST (with delimiter) prefix=data/a=1/b=10/c=100
-    - LIST (with delimiter) prefix=data/a=2/b=20/c=200
-    - LIST (with delimiter) prefix=data/a=3/b=30/c=300
+    Total Requests: 2
+    - LIST prefix=data
     - GET  (opts) path=data/a=2/b=20/c=200/file_2.csv

Review Comment:
   Maybe you are already taking this into account, but leaving this thought 
here in case it helps:
   
   I see that when calling `list_all_files()` there's a caching mechanism that 
caches the listing results globally under the RuntimeEnv scope:
   
   
https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/url.rs#L321-L326
   
   The caching key is the prefix, which in this case the value is `"data"`, so 
for a single big LIST request, the cache will be populated with a single entry 
like this:
   
   ```json
   {
     "data": [...ObjectMeta]
   }
   ```
   
   I imagine that before, because of the nature of how partitions where listed, 
there was an opportunity to populate many more granular cache entries for 
different partition combinations:
   
   ```json
   {
     "prefix=data/a=1/b=10/c=100": [...ObjectMeta],
     "prefix=data/a=2/b=20/c=200": [...ObjectMeta],
     "prefix=data/a=3/b=30/c=300": [...ObjectMeta],
   }
   ```
   
   I suppose that with some smarter cache population the cache entries could 
actually be unwrapped an populated in a more granular manner so that there are 
greater chances of cache hits. Maybe this is what 
https://github.com/apache/datafusion/issues/17211 is referring to?
   
   Note that I'm unfamiliar with this mechanism, so if I'm saying something 
dumb please let me know.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to