[
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617228#comment-17617228
]
Apache Arrow JIRA Bot commented on ARROW-12358:
-----------------------------------------------
This issue was last updated over 90 days ago, which may be an indication it is
no longer being actively worked. To better reflect the current state, the issue
is being unassigned per [project
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
Please feel free to re-take assignment of the issue if it is being actively
worked, or if you plan to start that work soon.
> [C++][Python][R][Dataset] Control overwriting vs appending when writing to
> existing dataset
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Weston Pace
> Priority: Major
> Labels: dataset
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}})
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when
> you are writing to an existing dataset, you de facto overwrite previous data
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by
> ensuring the file names are unique (the user can specify the
> {{basename_template}} to be something unique). There is also ARROW-7706 about
> silently doubling data (so _not_ overwriting existing data) with the legacy
> {{parquet.write_to_dataset}} implementation.
> It could be good to have a "mode" when writing datasets that controls the
> different possible behaviours. And erroring when there is pre-existing data
> in the target directory is maybe the safest default, because both appending
> vs overwriting silently can be surprising behaviour depending on your
> expectations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)