[jira] [Created] (ARROW-6481) Bad performance of read_csv() with column_types

Bogdan Klichuk (Jira) Sat, 07 Sep 2019 16:53:58 -0700

Bogdan Klichuk created ARROW-6481:
-------------------------------------

             Summary: Bad performance of read_csv() with column_types
                 Key: ARROW-6481
                 URL: https://issues.apache.org/jira/browse/ARROW-6481
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.14.1
         Environment: ubuntu xenial
            Reporter: Bogdan Klichuk
         Attachments: 20k_cols.csv


Case: Dataset wit 20k columns. Amount of rows can be 0.

`pyarrow.csv.read_csv()` works rather fine if no convert_options provided.

Took 700ms.

Now I call `read_csv()` with column types mapping that marks 2000 out of these 
columns as string.

`pyarrow.csv.read_csv('20k_cols.csv', 
convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: 
pyarrow.string() for i in range(2000)}))`

(K1..K19999 are column names in attached dataset).

My task globally is to read everything as string, avoid any inferring.

This takes several minutes, consumes around 4GB memory.

This doesn't look sane at all.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6481) Bad performance of read_csv() with column_types

Reply via email to