westonpace commented on a change in pull request #11552: URL: https://github.com/apache/arrow/pull/11552#discussion_r737037093
########## File path: r/R/dataset-write.R ########## @@ -97,6 +97,7 @@ write_dataset <- function(dataset, partitioning = dplyr::group_vars(dataset), basename_template = paste0("part-{i}.", as.character(format)), hive_style = TRUE, + existing_data_behavior = c("overwrite", "error", "delete_matching"), Review comment: I'll add docs. An append behavior would be nice, but I think it's been rejected in the past. There are several approaches that could be taken: 1. Scan the directory before we start writing to find the largest counter value currently in use and start counting from there. 2. When we're about to write a file look to see if the filename already exists and increment some counter (e.g. when downloading from Firefox/Chrome you get `foo.txt` and `foo(1).txt`. 3. Allow a UUID to be used instead of a counter in the basename template. For example, you could use a basename template of `{uuid}-{i}` The JIRA for this is https://issues.apache.org/jira/browse/ARROW-10695 and the outcome was that the user is capable of fixing this themselves. For example, users can generate a UUID themselves every time they call write_dataset and include that as part of the basename template (e.g. see https://stackoverflow.com/questions/69184289/pyarrow-overwrites-dataset-when-using-s3-filesystem/69185178#69185178 ) "Delete matching" is pretty niche. The origin of the feature is https://issues.apache.org/jira/browse/ARROW-12358 and the use case was something like: * Every Friday user downloads data for the week and gets partial data for the current day (friday) * The next week the user does the same thing and this time they have the full data for last friday and they want to overwrite that partition of data but keep all of the other days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org