HaochengLIU opened a new issue, #41493: URL: https://github.com/apache/arrow/issues/41493
### Describe the enhancement requested Hi, I have a use case that thousands of jobs are writing hive partitioned parquet files daily to the same bucket via S3FS filesystem. Note each job may generate from single digit to a few thousand parquet files depending on the volume from its data source. After abstraction, these jobs follow regex path patterns like `s3fs://my-S3-bucket/<vendor-name>/<fruit-type>/<color>/<origination>/<creation_date>/<data-center-location>/date=YYYY-MM-DD/...`. The gist here is a lot of keys are being created at the same time hense jobs hits **AWS Error SLOW_DOWN. during Put Object operation: The object exceeded the rate limit for object mutation operations(create, update, and delete). Please reduce your rate request error.** frequently. After investigation i realize they are creating too many objects in `[S3FileSystem::CreateDir(..)](https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2873-L2874)` function one by one. My local experiments show that if my implementation checks the existence of the path first then call `impl_->CreateEmptyDir(...)` only when necessary, it addresses the issue in my production environment. (I understand various cloud vendors have various [IO limits](https://cloud.google.com/storage/docs/request-rate) on a single bucket, in order to completely fix the the issue is another story to my daily work) I'm proposing a code change like below. Hi @pitrou I see you are the main author of s3fs.cc, can you pls share your insights when you have time? Also even with a vanilla build S3FS test fails quite a few on my mac... can you guide how to make them run successfully..? Many thanks. ```C++ diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc index 640888e1c..782d5f75d 100644 --- a/cpp/src/arrow/filesystem/s3fs.cc +++ b/cpp/src/arrow/filesystem/s3fs.cc @@ -2871,7 +2871,10 @@ Status S3FileSystem::CreateDir(const std::string& s, bool recursive) { for (const auto& part : path.key_parts) { parent_key += part; parent_key += kSep; - RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key)); + ARROW_ASSIGN_OR_RAISE(FileInfo parent_key_info, this->GetFileInfo(parent_key)); + if (parent_key_info.type() == FileType::NotFound) { + RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key)); + } } return Status::OK(); } else { ``` ``` TestS3FS.CreateDir even fails with a Clean build :sigh ➜ build ninja && ./debug/arrow-s3fs-test --gtest_filter="TestS3FS.CreateDir" ninja: no work to do. Running main() from /Users/haochengliu/Documents/projects/Arrow/build/_deps/googletest-src/googletest/src/gtest_main.cc Note: Google Test filter = TestS3FS.CreateDir [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from TestS3FS [ RUN ] TestS3FS.CreateDir /Users/haochengliu/Documents/projects/Arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934: Failure Failed Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK [ FAILED ] TestS3FS.CreateDir (219 ms) [----------] 1 test from TestS3FS (219 ms total) ``` ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org