DavidEscott opened a new issue, #39693:
URL: https://github.com/apache/arrow/issues/39693
### Describe the enhancement requested
The documentation on write_dataset says ‘overwrite_or_ignore’ combined with
a unique basename_template will allow an append workflow. I really struggle to
make sense of how to do this in a reasonably effective manner.
Suppose my dataset is partitioned by month, and I want to append some data
that spans multiple months. It will be pyarrow's responsibility to split the
data into two subsets written to two different folders, but it is my
responsibility to generate a basename_template that is unique across these two
folders.
I believe the expected resolution is that the caller generates a UUID and
calls the function with `basename_template=uuid.uuid4() + "-{i}."`. While
their are a couple other approaches that could be taken (e.g. use a datestamp
or timestamp instead of a UUID), I generally prefer more opinionated APIs and
would just ask for a way to directly ask pyarrow to append and have it handle
the matter in whatever way it deems best.
Alternately there might be something in the dataset that itself might be
useful to generate a unique basename. For instance if the data is partitioned
by year/month perhaps the desired basename_template might be
`{day}-{i}.parquet` and the behavior should be to "append data if the
year-month-day is not already written to disk, otherwise ignore". In this
situation I cannot provide a single unique `basename_template` for all the data
I might desire to conditionally append. What I need is a callback that would
build the basename_template for each subset after partition splitting.
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]