[ 
https://issues.apache.org/jira/browse/ARROW-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393579#comment-17393579
 ] 

Yair Lenga edited comment on ARROW-13517 at 8/5/21, 3:25 AM:
-------------------------------------------------------------

Thanks for pointing to the new data set API. For my situation (reading small 
number of rows from a large data set) - I believe it be beneficial if the above 
will be implemented. In particular two benefits:
 * Queries result (e.g., list of rows matching a condition) - can be cached, 
and reused to re-load data without having to perform linear scans over complete 
data set.
 * The C++ Stream API seems to support both skipping over row groups, and 
skipping over column chunks. This can potentially reduce reading by significant 
factor when recalling data for queries that have been processed in the past.

How hard it will be to build this logic into Python to realize above saving ? 
While it might not be trivial to implement - for certain cases it will be 
extremely valuable.

I believe that the AWS S3 select 
([https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-select.html)]  
has similar capabilities - as it can deliver results (for similar situation 
like I've describe) - I get much faster performance than the performance I see 
on my desktop Python - leading me to believe that they figure out a way to 
selectivly skip of parquest  data quickly.


was (Author: yair.lenga):
Thanks for pointing to the new data set API. For my situation (reading small 
number of rows from a large data set) - I believe it be beneficial if the above 
will be implemented. In particular two benefits:
 * Queries result (e.g., list of rows matching a condition) - can be cached, 
and reused to re-load data without having to perform linear scans over complete 
data set.
 * The C++ Stream API seems to support both skipping over row groups, and 
skipping over column chunks. This can potentially reduce reading by significant 
factor when recalling data for queries that have been processed in the past. 
How hard it will be to build this logic into Python to realize above saving ?

> Selective reading of rows for parquet file
> ------------------------------------------
>
>                 Key: ARROW-13517
>                 URL: https://issues.apache.org/jira/browse/ARROW-13517
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Parquet, Python
>            Reporter: Yair Lenga
>            Priority: Major
>
> The current interface for selective reading is to use *filters* 
> [https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html]
> The approach works well when the filters are simple (field in (v1, v2, v3, 
> …), and when the number of columns in small. It does not work well for the 
> folllowing conditions, which currently requires reading the complete set into 
> (python) memory.
>  * when condition is complex (e.g. condition between attributes: field1 + 
> field2 > filed3)
>  * When file as many columns (making it costly to create python structures).
> I have a repository of large number of parquet files (thousands of files, 500 
> MB each, 200  column), where specific records had to be selected quickly 
> based on logical condition that does not fit the filter condition. Very small 
> numbers of rows (<500) have to be returned.
> Proposed feature is to aextend read_row_group to support passing an array of 
> rows to read (list of integer in ascending order). 
> {code:java}
> pq =  pyarrow.parquet.ParquetFile(…)
> dd = PY.read_row_group(…, rows=[ 5, 35, …. ]{code}
> Using this method will enable complex filtering in two stages, eliminitating 
> the need to read all rows into memory.
>  # First pass - read attributes for filtering, collect row numbers that match 
> (complex) condition.
>  # second pass - create a python table with matching rows using the proposed 
> rows= parameter to read row group.
> I believe possible to achieve something similar using the c++ stream_reader 
> ([https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc]),
>  which is not exposed to python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to