Hi Ruben, I'm not an expert here, but is it possible the CSV has newlines inside quotes or some oddity? There are a lot of configuration options for Read CSV and you might want to validate that the defaults are at the most conservative settings.
-Micah On Fri, Mar 5, 2021 at 12:40 PM Ruben Laguna <[email protected]> wrote: > Hi, > > I'm getting "CSV parser got out of sync with chunker", any idea on how to > troubleshoot this? > If I feed the original file it fails after 1477218 rows > if I remove the first line after the header then it fails after 2919443 > rows > if I remove the first 2 lines after the header then it fails after 55339 > rows > if I remove the first 3 lines after the header then it fails after 8200437 > rows > if I remove the first 4 line after the header then if fails after 1866573 > rows > To me it doesn't make sense, the failure shows at different, seemly random > places. > > What can be causing this? source code below-> > > > > Traceback (most recent call last): > File "pa_inspect.py", line 15, in <module> > for b in reader: > File "pyarrow/ipc.pxi", line 497, in __iter__ > File "pyarrow/ipc.pxi", line 531, in > pyarrow.lib.RecordBatchReader.read_next_batch > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker > in > > > import pyarrow as pa > from pyarrow import csv > import pyarrow.parquet as pq > > # > http://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv > # > http://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html > reader = csv.open_csv('inspect.csv') > > > # ParquetWriter : > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html > # RecordBat > # > http://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > crow = 0 > with pq.ParquetWriter('inspect.parquet', reader.schema) as writer: > for b in reader: > print(b.num_rows,b.num_columns) > crow = crow + b.num_rows > print(crow) > writer.write_table(pa.Table.from_batches([b])) > > -- > /Rubén >
