[jira] [Resolved] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file

Joris Van den Bossche (Jira) Fri, 22 Apr 2022 09:12:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche resolved ARROW-16204.
-------------------------------------------
    Resolution: Fixed

Issue resolved by pull request 12898
[https://github.com/apache/arrow/pull/12898]

> [C++][Dataset] Default error existing_data_behaviour for writing dataset 
> ignores a single file
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16204
>                 URL: https://issues.apache.org/jira/browse/ARROW-16204
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 8.0.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> While trying to understand a failing test in 
> https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed 
> that the {{write_dataset}} function does not actually always raise an error 
> by default if there is already existing data in the target location.
> The documentation says it will raise "if any data exists in the destination" 
> (which is also what I would expect), but in practice it seems that it does 
> ignore certain file names:
> {code:python}
> import pyarrow.dataset as ds
> table = pa.table({'a': [1, 2, 3]})
> # write a first time to new directory: OK
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a second time to the same directory: passes, but should raise?
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a another time to the same directory with different name: still passes
> >>> ds.write_dataset(table, "test_overwrite", format="parquet", 
> >>> basename_template="data-{i}.parquet")
> >>> !ls test_overwrite
> data-0.parquet        part-0.parquet
> # now writing again finally raises an error
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> ...
> ArrowInvalid: Could not write to test_overwrite as the directory is not empty 
> and existing_data_behavior is to error
> {code}
> So it seems that when checking if existing data exists, it seems to ignore 
> any files that match the basename template pattern.
> cc [~westonpace] do you know if this was intentional? (I would find that a 
> strange corner case, and in any case it is also not documented)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (ARROW-16204) [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file

Reply via email to