[jira] [Comment Edited] (ARROW-13517) Selective reading of rows for parquet file

Weston Pace (Jira) Wed, 04 Aug 2021 21:03:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393593#comment-17393593
 ]


Weston Pace edited comment on ARROW-13517 at 8/5/21, 4:02 AM:
--------------------------------------------------------------

> Queries result (e.g., list of rows matching a condition) - can be cached, and 
> reused to re-load data without having to perform linear scans over complete 
> data set.

I don't think you'd be able to reduce the amount of I/O actually required since 
you'd need to load whole row-groups / column-chunks either way.  I do agree 
that you'd reduce the computational load somewhat as handling indices should be 
simpler than applying a query.  However, if you're doing actual I/O this is 
likely to be a fraction of the total cost and if you're working on in-memory 
data you'd be better off memory mapping an IPC file.

> The C++ Stream API seems to support both skipping over row groups, and 
> skipping over column chunks. This can potentially reduce reading by 
> significant factor when recalling data for queries that have been processed 
> in the past.

The datasets API supports both of these types of skips.

> I believe that the AWS S3 select has similar capabilities

>From 
>(https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html)
> I see "For Parquet objects, all of the row groups that start within the scan 
>range requested are processed." so I believe it is doing the same column-chunk 
>and row-group skipping that the dataset API currently supports.  There will be 
>an advantage with S3 select as the parquet metadata is read and processed in 
>the data center although I am not sure how much of a difference that will make.

> I get much faster performance than the performance I see on my desktop Python

Can you describe what you are trying on your desktop python and what s3 select 
you are performing to get the similar results?

> How hard it will be to build this logic into Python to realize above saving ? 
> While it might not be trivial to implement - for certain cases it will be 
> extremely valuable.

Given that you are going to have to load the entire column chunk into memory 
either way you could probably do this in pure python using the compute module 
with something like...

{code:python}
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.parquet as pq

table = pa.Table.from_pydict({'x': [1, 2, 3], 'y': ['x', 'y', 'z']})
pq.write_table(table, '/tmp/foo.parquet')

pfile = pq.ParquetFile('/tmp/foo.parquet')
row_group = pfile.read_row_group(0)
row_group.take([0, 2]).to_pydict()
{code}


was (Author: westonpace):
> Queries result (e.g., list of rows matching a condition) - can be cached, and 
> reused to re-load data without having to perform linear scans over complete 
> data set.

I don't think you'd be able to reduce the amount of I/O actually required since 
you'd need to load whole row-groups either way.  I do agree that you'd reduce 
the computational load somewhat as handling indices should be simpler than 
applying a query.  However, if you're doing actual I/O this is likely to be a 
fraction of the total cost and if you're working on in-memory data you'd be 
better off memory mapping an IPC file.

> The C++ Stream API seems to support both skipping over row groups, and 
> skipping over column chunks. This can potentially reduce reading by 
> significant factor when recalling data for queries that have been processed 
> in the past.

The datasets API supports both of these types of skips.

> I believe that the AWS S3 select has similar capabilities

>From 
>(https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html)
> I see "For Parquet objects, all of the row groups that start within the scan 
>range requested are processed." so I believe it is doing the same column-chunk 
>and row-group skipping that the dataset API currently supports.  There will be 
>an advantage with S3 select as the parquet metadata is read and processed in 
>the data center although I am not sure how much of a difference that will make.

> I get much faster performance than the performance I see on my desktop Python

Can you describe what you are trying on your desktop python and what s3 select 
you are performing to get the similar results?

> How hard it will be to build this logic into Python to realize above saving ? 
> While it might not be trivial to implement - for certain cases it will be 
> extremely valuable.

Given that you are going to have to load the entire column chunk into memory 
either way you could probably do this in pure python using the compute module 
with something like...

{code:python}
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.parquet as pq

table = pa.Table.from_pydict({'x': [1, 2, 3], 'y': ['x', 'y', 'z']})
pq.write_table(table, '/tmp/foo.parquet')

pfile = pq.ParquetFile('/tmp/foo.parquet')
row_group = pfile.read_row_group(0)
row_group.take([0, 2]).to_pydict()
{code}

> Selective reading of rows for parquet file
> ------------------------------------------
>
>                 Key: ARROW-13517
>                 URL: https://issues.apache.org/jira/browse/ARROW-13517
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Parquet, Python
>            Reporter: Yair Lenga
>            Priority: Major
>
> The current interface for selective reading is to use *filters* 
> [https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html]
> The approach works well when the filters are simple (field in (v1, v2, v3, 
> …), and when the number of columns in small. It does not work well for the 
> folllowing conditions, which currently requires reading the complete set into 
> (python) memory.
>  * when condition is complex (e.g. condition between attributes: field1 + 
> field2 > filed3)
>  * When file as many columns (making it costly to create python structures).
> I have a repository of large number of parquet files (thousands of files, 500 
> MB each, 200  column), where specific records had to be selected quickly 
> based on logical condition that does not fit the filter condition. Very small 
> numbers of rows (<500) have to be returned.
> Proposed feature is to aextend read_row_group to support passing an array of 
> rows to read (list of integer in ascending order). 
> {code:java}
> pq =  pyarrow.parquet.ParquetFile(…)
> dd = PY.read_row_group(…, rows=[ 5, 35, …. ]{code}
> Using this method will enable complex filtering in two stages, eliminitating 
> the need to read all rows into memory.
>  # First pass - read attributes for filtering, collect row numbers that match 
> (complex) condition.
>  # second pass - create a python table with matching rows using the proposed 
> rows= parameter to read row group.
> I believe possible to achieve something similar using the c++ stream_reader 
> ([https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc]),
>  which is not exposed to python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-13517) Selective reading of rows for parquet file

Reply via email to