[ https://issues.apache.org/jira/browse/ARROW-9474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158007#comment-17158007 ]
Antoine Pitrou commented on ARROW-9474: --------------------------------------- Well, the aim is to produce an homogenous stream of same-types record batches. We could perhaps add an option to return record batches with different types. cc [~npr] > [C++] Column type inference in read_csv vs. open_csv. CSV conversion error to > null > ---------------------------------------------------------------------------------- > > Key: ARROW-9474 > URL: https://issues.apache.org/jira/browse/ARROW-9474 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Reporter: Sep Dehpour > Priority: Minor > > The open_csv stream does not adjust the inferred column type based on the new > data seen in new blocks. > For example if a csv has null values in the first few blocks of open_csv > reader, the column is inferred as Null type. As PyArrow iterates over blocks > and sees non null values in that column, it crashes. > Example Error: > {code:java} > pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: > invalid value '-176400' {code} > > This problem is resolved if a read_option with a huge block size is passed to > the open_csv. But that negates the whole point of having a stream vs. > read_csv. > > System info: > PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)