[I] dataset.write_dataset needs a better API for append operations [arrow]

via GitHub Thu, 18 Jan 2024 14:04:31 -0800


DavidEscott opened a new issue, #39693:
URL: https://github.com/apache/arrow/issues/39693


   ### Describe the enhancement requested
   
   The documentation on write_dataset says ‘overwrite_or_ignore’ combined with 
a unique basename_template will allow an append workflow. I really struggle to 
make sense of how to do this in a reasonably effective manner.
   
   Suppose my dataset is partitioned by month, and I want to append some data 
that spans multiple months. It will be pyarrow's responsibility to split the 
data into two subsets written to two different folders, but it is my 
responsibility to generate a basename_template that is unique across these two 
folders. 
   
   I believe the expected resolution is that the caller generates a UUID and 
calls the function with `basename_template=uuid.uuid4() + "-{i}."`.  While 
their are a couple other approaches that could be taken (e.g. use a datestamp 
or timestamp instead of a UUID), I generally prefer more opinionated APIs and 
would just ask for a way to directly ask pyarrow to append and have it handle 
the matter in whatever way it deems best.
   
   Alternately there might be something in the dataset that itself might be 
useful to generate a unique basename. For instance if the data is partitioned 
by year/month perhaps the desired basename_template might be 
`{day}-{i}.parquet` and the behavior should be to "append data if the 
year-month-day is not already written to disk, otherwise ignore". In this 
situation I cannot provide a single unique `basename_template` for all the data 
I might desire to conditionally append. What I need is a callback that would 
build the basename_template for each subset after partition splitting.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] dataset.write_dataset needs a better API for append operations [arrow]

Reply via email to