BlakeOrth commented on code in PR #18855:
URL: https://github.com/apache/datafusion/pull/18855#discussion_r2582905212
##########
datafusion/core/tests/datasource/object_store_access.rs:
##########
@@ -98,6 +98,59 @@ async fn create_multi_file_csv_file() {
);
}
+#[tokio::test]
+async fn multi_query_multi_file_csv_file() {
+ let test = Test::new().with_multi_file_csv().await;
+ assert_snapshot!(
+ test.query("select * from csv_table").await,
+ @r"
+ ------- Query Output (6 rows) -------
+ +---------+-------+-------+
+ | c1 | c2 | c3 |
+ +---------+-------+-------+
+ | 0.0 | 0.0 | true |
+ | 0.00003 | 5e-12 | false |
+ | 0.00001 | 1e-12 | true |
+ | 0.00003 | 5e-12 | false |
+ | 0.00002 | 2e-12 | true |
+ | 0.00003 | 5e-12 | false |
+ +---------+-------+-------+
+ ------- Object Store Request Summary -------
+ RequestCountingObjectStore()
+ Total Requests: 4
+ - LIST prefix=data
+ - GET (opts) path=data/file_0.csv
+ - GET (opts) path=data/file_1.csv
+ - GET (opts) path=data/file_2.csv
+ "
+ );
+
+ // the second query should re-use the cached LIST results and should not
reissue LIST
+ assert_snapshot!(
+ test.query("select * from csv_table").await,
+ @r"
+ ------- Query Output (6 rows) -------
+ +---------+-------+-------+
+ | c1 | c2 | c3 |
+ +---------+-------+-------+
+ | 0.0 | 0.0 | true |
+ | 0.00003 | 5e-12 | false |
+ | 0.00001 | 1e-12 | true |
+ | 0.00003 | 5e-12 | false |
+ | 0.00002 | 2e-12 | true |
+ | 0.00003 | 5e-12 | false |
+ +---------+-------+-------+
+ ------- Object Store Request Summary -------
+ RequestCountingObjectStore()
+ Total Requests: 4
Review Comment:
With the current state of this PR I think this result is actually expected.
I chose to remove the code that enables the cache by default, as well as this
test (they were in the POC for illustration purposes). I thought since the
cache currently cannot be configured, and really specifically that cannot be
disabled, it would be best to leave it disabled by default to avoid breaking
users.
Additionally, quite a few of the `sqllogic` tests seem to rely on the `copy`
functionality and the existing expectation that DataFusion will pick up the
changes on query. The easiest way to handle those tests will be to disable this
cache for the duration of the tests, but that again relies on the cache having
some configuration.
All that being said, I'm more than happy to move forward on this effort in
whatever way you think is best! I can add a very simple configuration option
(and docs etc.) to enable/disable this cache so we can keep this test you've
written and have it show a reduction in `LIST` operations, or we can leave that
as a portion of the "configuration issue" associated with this cache.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]