Bulat Yaminov created ARROW-7972: ------------------------------------ Summary: Allow reading CSV in chunks Key: ARROW-7972 URL: https://issues.apache.org/jira/browse/ARROW-7972 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 0.16.0 Reporter: Bulat Yaminov
Currently in the Python API you can read a CSV using [{{pyarrow.csv.read_csv("big.csv")}}|https://arrow.apache.org/docs/python/csv.html]. There are some settings for the reader that you can pass in [{{pyarrow.csv.ReadOptions}}|https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions], but I don't see an option to read a part of the CSV file instead of the whole (or starting from `skip_rows`). As a result if I have a big CSV file that cannot be fit into memory, I cannot process it with this API. Is it possible to implement a chunked iterator in the similar way that [Pandas allows it|https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking]: {code:python} from pyarrow import csv for table_chunk in csv.read_csv("big.csv", read_options=csv.ReadOptions(chunksize=1_000_000)): # do something with the table_chunk, e.g. filter and save to disk pass {code} Thanks in advance for your reaction. -- This message was sent by Atlassian Jira (v8.3.4#803005)