Weston Pace created ARROW-12811: ----------------------------------- Summary: [C++] [Dataset] Dataset repartition / filter / update Key: ARROW-12811 URL: https://issues.apache.org/jira/browse/ARROW-12811 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace
This feature would be to add support for an "update" workflow which scanned a set of batches and wrote them (potentially filtered/modified) back out to the same place. The existing dataset read / dataset write features wouldn't work because they would append the data. There is some discussion in ARROW-12358 and ARROW-12509 of an "overwrite mode" but an "overwrite partition" workflow wouldn't work unless you can scan in entire partitions at once (and in general this should probably be avoided). A naive "write to a different directory and rename" approach could work but it would be inefficient since it would require a copy of the entire dataset to modify a small part of it. The feature could be implemented using temporary directories in place that get renamed on top of the existing directory at the end. Files that are unchanged would be moved into the temporary directory instead of copied. Presumable no ACID guarantees would be made (and they would be quite hard to guarantee) since Arrow datasets do not make ACID guarantees of any kind currently. -- This message was sent by Atlassian Jira (v8.3.4#803005)