Bulat Yaminov created ARROW-7972:
------------------------------------

             Summary: Allow reading CSV in chunks
                 Key: ARROW-7972
                 URL: https://issues.apache.org/jira/browse/ARROW-7972
             Project: Apache Arrow
          Issue Type: New Feature
          Components: Python
    Affects Versions: 0.16.0
            Reporter: Bulat Yaminov


Currently in the Python API you can read a CSV using 
[{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html].
 There are some settings for the reader that you can pass in 
[{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions],
 but I don't see an option to read a part of the CSV file instead of the whole 
(or starting from `skip_rows`). As a result if I have a big CSV file that 
cannot be fit into memory, I cannot process it with this API.

Is it possible to implement a chunked iterator in the similar way that [Pandas 
allows 
it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]:
{code:python}
from pyarrow import csv
for table_chunk in csv.read_csv("big.csv", 
read_options=csv.ReadOptions(chunksize=1_000_000)):
    # do something with the table_chunk, e.g. filter and save to disk
    pass
{code}

Thanks in advance for your reaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to