jorisvandenbossche commented on a change in pull request #11632:
URL: https://github.com/apache/arrow/pull/11632#discussion_r745003542
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
def file_visitor(written_file):
visited_paths.append(written_file.path)
+ existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+ 'delete_matching'
+ Controls how the dataset will handle data that already exists in
+ the destination. The default behavior (error) is to raise an error
Review comment:
```suggestion
the destination. The default behavior ('error') is to raise an error
```
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
def file_visitor(written_file):
visited_paths.append(written_file.path)
+ existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+ 'delete_matching'
Review comment:
```suggestion
existing_data_behavior : 'error' | 'overwrite_or_ignore' | \
'delete_matching'
```
(I know this is ugly in the source code, but that's how to get it looking
properly in the sphinx docs (the `name : type` should be a single line for
rendering purposes)
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
def file_visitor(written_file):
visited_paths.append(written_file.path)
+ existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+ 'delete_matching'
+ Controls how the dataset will handle data that already exists in
+ the destination. The default behavior (error) is to raise an error
+ if any data exists in the destination.
+
+ overwrite_or_ignore will ignore any existing data and will
Review comment:
```suggestion
'overwrite_or_ignore' will ignore any existing data and will
```
##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3481,6 +3481,55 @@ def test_write_dataset_with_dataset(tempdir):
assert dict(load_back_table.to_pydict()) == table.to_pydict()
[email protected]
+def test_write_dataset_existing_data(tempdir):
+ directory = tempdir / 'ds'
+ table = pa.table({'b': ['x', 'y', 'z'], 'c': [1, 2, 3]})
+ partitioning = ds.partitioning(schema=pa.schema(
+ [pa.field('c', pa.int64())]), flavor='hive')
+
+ def compare_tables_ignoring_order(t1, t2):
+ df1 = t1.to_pandas().sort_values('b').reset_index(drop=True)
+ df2 = t2.to_pandas().sort_values('b').reset_index(drop=True)
+ assert df1.equals(df2)
+
+ # First write is ok
+ ds.write_dataset(table, directory, partitioning=partitioning, format='ipc')
+
+ table = pa.table({'b': ['a', 'b', 'c'], 'c': [2, 3, 4]})
+
+ # Second write should fail
+ with pytest.raises(pa.ArrowInvalid):
+ ds.write_dataset(table, directory,
+ partitioning=partitioning, format='ipc')
+
+ extra_table = pa.table({'b': ['e']})
+ extra_file = directory / 'c=2' / 'foo.arrow'
+ pyarrow.feather.write_feather(extra_table, extra_file)
+
+ # Should be ok and overwrite with overwrite behaivor
Review comment:
```suggestion
# Should be ok and overwrite with overwrite behavior
```
##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None,
format=None,
def file_visitor(written_file):
visited_paths.append(written_file.path)
+ existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+ 'delete_matching'
+ Controls how the dataset will handle data that already exists in
+ the destination. The default behavior (error) is to raise an error
+ if any data exists in the destination.
+
+ overwrite_or_ignore will ignore any existing data and will
+ overwrite files with the same name as an output file. Other
+ existing files will be ignored. This behavior, in combination
+ with a unique basename_template for each write, will allow for
+ an append workflow.
+
+ delete_matching is useful when you are writing a partitioned
Review comment:
```suggestion
'delete_matching' is useful when you are writing a partitioned
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]