Weston Pace created ARROW-12811:
-----------------------------------

             Summary: [C++] [Dataset] Dataset repartition / filter / update
                 Key: ARROW-12811
                 URL: https://issues.apache.org/jira/browse/ARROW-12811
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


This feature would be to add support for an "update" workflow which scanned a 
set of batches and wrote them (potentially filtered/modified) back out to the 
same place.


The existing dataset read / dataset write features wouldn't work because they 
would append the data.

There is some discussion in ARROW-12358 and ARROW-12509 of an "overwrite mode" 
but an "overwrite partition" workflow wouldn't work unless you can scan in 
entire partitions at once (and in general this should probably be avoided).

A naive "write to a different directory and rename" approach could work but it 
would be inefficient since it would require a copy of the entire dataset to 
modify a small part of it.

 

The feature could be implemented using temporary directories in place that get 
renamed on top of the existing directory at the end.  Files that are unchanged 
would be moved into the temporary directory instead of copied.

Presumable no ACID guarantees would be made (and they would be quite hard to 
guarantee) since Arrow datasets do not make ACID guarantees of any kind 
currently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to