HaochengLIU opened a new issue, #41493:
URL: https://github.com/apache/arrow/issues/41493

   ### Describe the enhancement requested
   
   Hi, I have a use case that thousands of jobs are writing hive partitioned 
parquet files daily to the same bucket via S3FS filesystem. Note each job may 
generate from single digit to a few thousand parquet files depending on the 
volume from its data source.
   After abstraction, these jobs follow regex path patterns like 
`s3fs://my-S3-bucket/<vendor-name>/<fruit-type>/<color>/<origination>/<creation_date>/<data-center-location>/date=YYYY-MM-DD/...`.
 The gist here is a lot of keys are being created at the same time hense jobs 
hits **AWS Error SLOW_DOWN. during Put Object operation: The object exceeded 
the rate limit for object mutation operations(create, update, and delete). 
Please reduce your rate request error.** frequently.
   
   After investigation i realize they are creating too many objects in 
`[S3FileSystem::CreateDir(..)](https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L2873-L2874)`
 function one by one. My local experiments show that if my implementation 
checks the existence of the path first then call `impl_->CreateEmptyDir(...)` 
only when necessary, it addresses the issue in my production environment. 
   
   (I understand various cloud vendors have various [IO 
limits](https://cloud.google.com/storage/docs/request-rate) on a single bucket, 
 in order to completely fix the the issue is another story to my daily work)
   
   I'm proposing a code change like below. Hi @pitrou I see you are the main 
author of s3fs.cc, can you pls share your insights when you have time?
   Also even with a vanilla build S3FS test fails quite a few on my mac... can 
you guide how to make them run successfully..?
   
   Many thanks.
   ```C++
   diff --git a/cpp/src/arrow/filesystem/s3fs.cc 
b/cpp/src/arrow/filesystem/s3fs.cc
   index 640888e1c..782d5f75d 100644
   --- a/cpp/src/arrow/filesystem/s3fs.cc
   +++ b/cpp/src/arrow/filesystem/s3fs.cc
   @@ -2871,7 +2871,10 @@ Status S3FileSystem::CreateDir(const std::string& s, 
bool recursive) {
        for (const auto& part : path.key_parts) {
          parent_key += part;
          parent_key += kSep;
   -      RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key));
   +      ARROW_ASSIGN_OR_RAISE(FileInfo parent_key_info, 
this->GetFileInfo(parent_key));
   +      if (parent_key_info.type() == FileType::NotFound) {
   +        RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key));
   +      }
        }
        return Status::OK();
      } else {
   
   ```
   
   ```
   TestS3FS.CreateDir even fails with a Clean build :sigh
   
   ➜  build ninja && ./debug/arrow-s3fs-test --gtest_filter="TestS3FS.CreateDir"
   ninja: no work to do.
   Running main() from 
/Users/haochengliu/Documents/projects/Arrow/build/_deps/googletest-src/googletest/src/gtest_main.cc
   Note: Google Test filter = TestS3FS.CreateDir
   [==========] Running 1 test from 1 test suite.
   [----------] Global test environment set-up.
   [----------] 1 test from TestS3FS
   [ RUN      ] TestS3FS.CreateDir
   
/Users/haochengliu/Documents/projects/Arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934:
 Failure
   Failed
   Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK
   [  FAILED  ] TestS3FS.CreateDir (219 ms)
   [----------] 1 test from TestS3FS (219 ms total)
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to