Juan Luis Cano Rodríguez created ARROW-18084: ------------------------------------------------
Summary: "CSV parser got out of sync with chunker" on subsequent batches regardless of block size Key: ARROW-18084 URL: https://issues.apache.org/jira/browse/ARROW-18084 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 9.0.0, 7.0.0 Environment: Ubuntu Linux pyarrow 9.0.0 installed with pip (manylinux wheel) Python 3.9.0 from conda-forge GCC 9.4.0 Reporter: Juan Luis Cano Rodríguez Attachments: Screenshot 2022-10-18 at 10-11-29 JupyterLab · Orchest.png I'm trying to read a specific large CSV file (`the-reddit-climate-change-dataset-comments.csv` from [this dataset|https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset]) by batches. This is my code: {code:python} import os import pyarrow as pa from pyarrow.csv import open_csv, ReadOptions import pyarrow.parquet as pq filename = "/data/reddit-climate/the-reddit-climate-change-dataset-comments.csv" print(f"Reading {filename}...") mmap = pa.memory_map(filename) reader = open_csv(mmap) while True: try: batch = reader.read_next_batch() print(len(batch)) except StopIteration: break {code} But, after a few batches, I get an exception: {noformat} Reading /data/reddit-climate/the-reddit-climate-change-dataset-comments.csv... 1233 1279 1293 --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Input In [1], in <cell line: 14>() 13 while True: 14 try: ---> 15 batch = reader.read_next_batch() 16 print(len(batch)) 17 except StopIteration: File /opt/conda/lib/python3.9/site-packages/pyarrow/ipc.pxi:683, in pyarrow.lib.RecordBatchReader.read_next_batch() File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: CSV parser got out of sync with chunker {noformat} I have tried changing the block size, but I always end up with that error sooner or later: - With {{read_options=ReadOptions(block_size=10_000)}}, it reads 1 batch of 11 rows and then crashes - With 100_000, 103 rows and then crashes - 1_000_000: 1164 rows and then crashes - 10_000_000: 12370 rows and then crashes I am not sure what else to try here. According to [the C++ source code|https://github.com/apache/arrow/blob/cd33544533ee7d70cd8ff7556e59ef8f1d33a176/cpp/src/arrow/csv/reader.cc#L266-L267], this "should not happen". I have tried with pyarrow 7.0 and 9.0, identical result and traceback. -- This message was sent by Atlassian Jira (v8.20.10#820010)