westonpace commented on a change in pull request #10955: URL: https://github.com/apache/arrow/pull/10955#discussion_r717804266
########## File path: cpp/src/arrow/dataset/file_base.h ########## @@ -343,6 +343,18 @@ class ARROW_DS_EXPORT FileWriter { fs::FileLocator destination_locator_; }; +/// \brief Controls what happens if files exist in an output directory during a dataset +/// write +enum ExistingDataBehavior : int8_t { + /// Deletes all files in a directory the first time that directory is encountered + kDeleteMatchingPartitions, + /// Ignores existing files, overwriting any that happen to have the same name as an + /// output file + kOverwriteOrIgnore, + /// Returns an error if there are any files or subdirectories in the output directory + kError, Review comment: Unfortunately, this still has the same logic, which is a bit coarse. It will give an error on the case you describe because it only checks to see if any files exist in the root directory. If we check for subdirectories we either have to check on the fly or do two passes. If we check on the fly then it could lead to case where data is partially written when an error occurs. Two passes would be pretty inefficient and we'd probably need to spill to disk to avoid exploding memory so it's too complex for us at the moment. Maybe someday in the future we could have a rollback mechanism so we can check on the fly and then rollback when finished. Or we can write to a temporary directory and do a `mv` onto the destination but that wouldn't be supported on S3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org