[GitHub] [arrow] westonpace commented on a change in pull request #10955: ARROW-13650: [C++] Create dataset writer to encapsulate dataset writer logic

GitBox Tue, 28 Sep 2021 10:24:30 -0700


westonpace commented on a change in pull request #10955:
URL: https://github.com/apache/arrow/pull/10955#discussion_r717804266




##########
File path: cpp/src/arrow/dataset/file_base.h
##########
@@ -343,6 +343,18 @@ class ARROW_DS_EXPORT FileWriter {
   fs::FileLocator destination_locator_;
 };
 
+/// \brief Controls what happens if files exist in an output directory during 
a dataset
+/// write
+enum ExistingDataBehavior : int8_t {
+  /// Deletes all files in a directory the first time that directory is 
encountered
+  kDeleteMatchingPartitions,
+  /// Ignores existing files, overwriting any that happen to have the same 
name as an
+  /// output file
+  kOverwriteOrIgnore,
+  /// Returns an error if there are any files or subdirectories in the output 
directory
+  kError,

Review comment:
       Unfortunately, this still has the same logic, which is a bit coarse.  It 
will give an error on the case you describe because it only checks to see if 
any files exist in the root directory.
   
   If we check for subdirectories we either have to check on the fly or do two 
passes.
   
   If we check on the fly then it could lead to case where data is partially 
written when an error occurs.
   
   Two passes would be pretty inefficient and we'd probably need to spill to 
disk to avoid exploding memory so it's too complex for us at the moment.
   
   Maybe someday in the future we could have a rollback mechanism so we can 
check on the fly and then rollback when finished.  Or we can write to a 
temporary directory and do a `mv` onto the destination but that wouldn't be 
supported on S3.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10955: ARROW-13650: [C++] Create dataset writer to encapsulate dataset writer logic

Reply via email to