[ 
https://issues.apache.org/jira/browse/ARROW-9474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158007#comment-17158007
 ] 

Antoine Pitrou commented on ARROW-9474:
---------------------------------------

Well, the aim is to produce an homogenous stream of same-types record batches. 
We could perhaps add an option to return record batches with different types.
cc [~npr]

> [C++] Column type inference in read_csv vs. open_csv. CSV conversion error to 
> null
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-9474
>                 URL: https://issues.apache.org/jira/browse/ARROW-9474
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Sep Dehpour
>            Priority: Minor
>
> The open_csv stream does not adjust the inferred column type based on the new 
> data seen in new blocks.
> For example if a csv has null values in the first few blocks of open_csv 
> reader, the column is inferred as Null type. As PyArrow iterates over blocks 
> and sees non null values in that column,  it crashes.
> Example Error:
> {code:java}
> pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: 
> invalid value '-176400' {code}
>  
> This problem is resolved if a read_option with a huge block size is passed to 
> the open_csv. But that negates the whole point of having a stream vs. 
> read_csv.
>  
> System info:
> PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to