[jira] [Commented] (ARROW-7706) [Python] saving a dataframe to the same partitioned location silently doubles the data

Joris Van den Bossche (Jira) Tue, 04 Feb 2020 00:50:56 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029654#comment-17029654
 ]


Joris Van den Bossche commented on ARROW-7706:
----------------------------------------------

[~tsvikas] Thanks for the report! I agree that silently adding data doesn't 
seem the best default behaviour, at least it can give surprising results. 

In addition to an overwrite flag (which would delete the contents of the 
directory?), we could maybe have an "append mode" to still have the possibility 
to write to an non-empty directory.

cc [~fsaintjacques] [~bkietz] this question will also be relevant once we start 
with the write part of Datasets

> [Python] saving a dataframe to the same partitioned location silently doubles 
> the data
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-7706
>                 URL: https://issues.apache.org/jira/browse/ARROW-7706
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Tsvika Shapira
>            Priority: Major
>              Labels: dataset
>
> When a user saves a dataframe:
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')
> {code}
> it will create sub-directories named "{{a=val1}}", "{{a=val2}}" in 
> {{/tmp/table}}. Each of them will contain one (or more?) parquet files with 
> random filenames.
> If a user runs the same command again, the code will use the existing 
> sub-directories, but with different (random) filenames. As a result, any data 
> loaded from this folder will be wrong - each row will be present twice.
> For example, when using
> {code:python}
> df1.to_parquet('/tmp/table', partition_cols=['col_a'], engine='pyarrow')  # 
> second time
> df2 = pd.read_parquet('/tmp/table', engine='pyarrow')
> assert len(df1) == len(df2)  # raise an error{code}
> This is a subtle change in the data that can pass unnoticed.
>  
> I would expect that the code will prevent the user from using an non-empty 
> destination as partitioned target. an overwrite flag can also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7706) [Python] saving a dataframe to the same partitioned location silently doubles the data

Reply via email to