[arrow] branch master updated: ARROW-5436: [Python] parquet.read_table add filters keyword

uwe Thu, 06 Jun 2019 08:35:57 -0700

This is an automated email from the ASF dual-hosted git repository.

uwe pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/master by this push:
     new d235f69  ARROW-5436: [Python] parquet.read_table add filters keyword
d235f69 is described below

commit d235f69bc7ebfbbd03f031b2191c69244861cf4f
Author: Joris Van den Bossche <jorisvandenboss...@gmail.com>
AuthorDate: Thu Jun 6 11:35:34 2019 -0400

    ARROW-5436: [Python] parquet.read_table add filters keyword
    
    https://issues.apache.org/jira/browse/ARROW-5436
    
    I suppose the fact that `parquet.read_table` dispatched to 
FileSystem.read_parquet was for historical reasons (that function was added 
before ParquetDataset was added), but directly calling ParquetDataset there 
looks cleaner instead of going through FileSystem.read_parquet. So therefore I 
also changed that.
    
    In addition, I made sure the `memory_map` keyword was actually passed 
through, I think an oversight of https://github.com/apache/arrow/pull/2954.
    
    (those two changes should be useful anyway, regardless of adding `filters` 
keyword or not)
    
    Author: Joris Van den Bossche <jorisvandenboss...@gmail.com>
    
    Closes #4409 from jorisvandenbossche/ARROW-5436-parquet-read_table and 
squashes the following commits:
    
    85e5b0e1 <Joris Van den Bossche> lint
    0ae1488d <Joris Van den Bossche> add test with nested list
    9baf420b <Joris Van den Bossche> add filters to read_pandas
    0df8c881 <Joris Van den Bossche> Merge remote-tracking branch 
'upstream/master' into ARROW-5436-parquet-read_table
    4ea7b77d <Joris Van den Bossche> fix test
    4eb2ea7f <Joris Van den Bossche> add filters keyword
    9c10f700 <Joris Van den Bossche> fix passing of memory_map (leftover from 
ARROW-2807)
    896abb2a <Joris Van den Bossche> simplify read_table (use ParquetDataset 
directly)
---
 python/pyarrow/parquet.py            | 20 ++++++++++++--------
 python/pyarrow/tests/test_parquet.py | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index d44deee..34a9c42 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -1181,6 +1181,11 @@ memory_map : boolean, default True
     If the source is a file path, use a memory map to read file, which can
     improve performance in some environments
 {1}
+filters : List[Tuple] or List[List[Tuple]] or None (default)
+    List of filters to apply, like ``[[('x', '=', 0), ...], ...]``. This
+    implements partition-level (hive) filtering only, i.e., to prevent the
+    loading of some files of the dataset if `source` is a directory.
+    See the docstring of ParquetDataset for more details.
 
 Returns
 -------
@@ -1190,14 +1195,12 @@ Returns
 
 def read_table(source, columns=None, use_threads=True, metadata=None,
                use_pandas_metadata=False, memory_map=True,
-               filesystem=None):
+               filesystem=None, filters=None):
     if _is_path_like(source):
-        fs, path = _get_filesystem_and_path(filesystem, source)
-        return fs.read_parquet(path, columns=columns,
-                               use_threads=use_threads, metadata=metadata,
-                               use_pandas_metadata=use_pandas_metadata)
-
-    pf = ParquetFile(source, metadata=metadata)
+        pf = ParquetDataset(source, metadata=metadata, memory_map=memory_map,
+                            filesystem=filesystem, filters=filters)
+    else:
+        pf = ParquetFile(source, metadata=metadata, memory_map=memory_map)
     return pf.read(columns=columns, use_threads=use_threads,
                    use_pandas_metadata=use_pandas_metadata)
 
@@ -1212,10 +1215,11 @@ read_table.__doc__ = _read_table_docstring.format(
 
 
 def read_pandas(source, columns=None, use_threads=True, memory_map=True,
-                metadata=None):
+                metadata=None, filters=None):
     return read_table(source, columns=columns,
                       use_threads=use_threads,
                       metadata=metadata, memory_map=True,
+                      filters=filters,
                       use_pandas_metadata=True)
 
 
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index 76ec864..a97e885 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -1523,6 +1523,38 @@ def test_invalid_pred_op(tempdir):
                           ])
 
 
+@pytest.mark.pandas
+def test_filters_read_table(tempdir):
+    # test that filters keyword is passed through in read_table
+    fs = LocalFileSystem.get_instance()
+    base_path = tempdir
+
+    integer_keys = [0, 1, 2, 3, 4]
+    partition_spec = [
+        ['integers', integer_keys],
+    ]
+    N = 5
+
+    df = pd.DataFrame({
+        'index': np.arange(N),
+        'integers': np.array(integer_keys, dtype='i4'),
+    }, columns=['index', 'integers'])
+
+    _generate_partition_directories(fs, base_path, partition_spec, df)
+
+    table = pq.read_table(
+        base_path, filesystem=fs, filters=[('integers', '<', 3)])
+    assert table.num_rows == 3
+
+    table = pq.read_table(
+        base_path, filesystem=fs, filters=[[('integers', '<', 3)]])
+    assert table.num_rows == 3
+
+    table = pq.read_pandas(
+        base_path, filters=[('integers', '<', 3)])
+    assert table.num_rows == 3
+
+
 @pytest.yield_fixture
 def s3_example():
     access_key = os.environ['PYARROW_TEST_S3_ACCESS_KEY']

[arrow] branch master updated: ARROW-5436: [Python] parquet.read_table add filters keyword

Reply via email to