Bogdan Klichuk created ARROW-6481: ------------------------------------- Summary: Bad performance of read_csv() with column_types Key: ARROW-6481 URL: https://issues.apache.org/jira/browse/ARROW-6481 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Environment: ubuntu xenial Reporter: Bogdan Klichuk Attachments: 20k_cols.csv
Case: Dataset wit 20k columns. Amount of rows can be 0. `pyarrow.csv.read_csv()` works rather fine if no convert_options provided. Took 700ms. Now I call `read_csv()` with column types mapping that marks 2000 out of these columns as string. `pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))` (K1..K19999 are column names in attached dataset). My task globally is to read everything as string, avoid any inferring. This takes several minutes, consumes around 4GB memory. This doesn't look sane at all. -- This message was sent by Atlassian Jira (v8.3.2#803003)