[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11632: ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

GitBox Mon, 08 Nov 2021 10:54:54 -0800


jorisvandenbossche commented on a change in pull request #11632:
URL: https://github.com/apache/arrow/pull/11632#discussion_r745003542




##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
 
             def file_visitor(written_file):
                 visited_paths.append(written_file.path)
+    existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+                             'delete_matching'
+        Controls how the dataset will handle data that already exists in
+        the destination.  The default behavior (error) is to raise an error

Review comment:
       ```suggestion
           the destination.  The default behavior ('error') is to raise an error
   ```

##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
 
             def file_visitor(written_file):
                 visited_paths.append(written_file.path)
+    existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+                             'delete_matching'

Review comment:
       ```suggestion
       existing_data_behavior : 'error' | 'overwrite_or_ignore' | \
   'delete_matching'
   ```
   
   (I know this is ugly in the source code, but that's how to get it looking 
properly in the sphinx docs (the `name : type` should be a single line for 
rendering purposes)

##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
 
             def file_visitor(written_file):
                 visited_paths.append(written_file.path)
+    existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+                             'delete_matching'
+        Controls how the dataset will handle data that already exists in
+        the destination.  The default behavior (error) is to raise an error
+        if any data exists in the destination.
+
+        overwrite_or_ignore will ignore any existing data and will

Review comment:
       ```suggestion
           'overwrite_or_ignore' will ignore any existing data and will
   ```

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -3481,6 +3481,55 @@ def test_write_dataset_with_dataset(tempdir):
         assert dict(load_back_table.to_pydict()) == table.to_pydict()
 
 
[email protected]
+def test_write_dataset_existing_data(tempdir):
+    directory = tempdir / 'ds'
+    table = pa.table({'b': ['x', 'y', 'z'], 'c': [1, 2, 3]})
+    partitioning = ds.partitioning(schema=pa.schema(
+        [pa.field('c', pa.int64())]), flavor='hive')
+
+    def compare_tables_ignoring_order(t1, t2):
+        df1 = t1.to_pandas().sort_values('b').reset_index(drop=True)
+        df2 = t2.to_pandas().sort_values('b').reset_index(drop=True)
+        assert df1.equals(df2)
+
+    # First write is ok
+    ds.write_dataset(table, directory, partitioning=partitioning, format='ipc')
+
+    table = pa.table({'b': ['a', 'b', 'c'], 'c': [2, 3, 4]})
+
+    # Second write should fail
+    with pytest.raises(pa.ArrowInvalid):
+        ds.write_dataset(table, directory,
+                         partitioning=partitioning, format='ipc')
+
+    extra_table = pa.table({'b': ['e']})
+    extra_file = directory / 'c=2' / 'foo.arrow'
+    pyarrow.feather.write_feather(extra_table, extra_file)
+
+    # Should be ok and overwrite with overwrite behaivor

Review comment:
       ```suggestion
       # Should be ok and overwrite with overwrite behavior
   ```

##########
File path: python/pyarrow/dataset.py
##########
@@ -798,6 +799,22 @@ def write_dataset(data, base_dir, basename_template=None, 
format=None,
 
             def file_visitor(written_file):
                 visited_paths.append(written_file.path)
+    existing_data_behavior : 'error' | 'overwrite_or_ignore' |
+                             'delete_matching'
+        Controls how the dataset will handle data that already exists in
+        the destination.  The default behavior (error) is to raise an error
+        if any data exists in the destination.
+
+        overwrite_or_ignore will ignore any existing data and will
+        overwrite files with the same name as an output file.  Other
+        existing files will be ignored.  This behavior, in combination
+        with a unique basename_template for each write, will allow for
+        an append workflow.
+
+        delete_matching is useful when you are writing a partitioned

Review comment:
       ```suggestion
           'delete_matching' is useful when you are writing a partitioned
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11632: ARROW-14620: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Reply via email to