[jira] [Commented] (ARROW-7655) [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema

Joris Van den Bossche (Jira) Tue, 28 Jan 2020 02:34:53 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025026#comment-17025026
 ]


Joris Van den Bossche commented on ARROW-7655:
----------------------------------------------

Currently, I think the {{column_types}} option is only meant to specify the 
types, while nullability is part of the Field in a Schema, and is not a 
fundamental property of the _type_ itself.

> [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-7655
>                 URL: https://issues.apache.org/jira/browse/ARROW-7655
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Reproduced on Ubuntu 18.04 and OSX Catalina in Python 
> 3.7.4.
>            Reporter: Tim Lantz
>            Priority: Major
>              Labels: CSV
>
>  
> Originally mentioned in: [https://github.com/apache/arrow/issues/6243]
> *High level description of the issue:*
>  * It is possible ([though not 
> documented|https://issues.apache.org/jira/browse/ARROW-7654]) that you may 
> assign the column_types field of ConvertOptions to a Schema object instead of 
> a Dict[str, DataType].
>  * Expected result: the nullable attribute, in addition to the type, of the 
> Fields in the Schema supplied are present on the Schema used when reading CSV 
> data.
>  * Actual result: the Field type information is present, but nullable is 
> lost. All fields are nullable.
> *Minimal reproduction case:*
>  * Use case notes: this is especially noticeable when using pyarrow as a 
> meant to save data with a known schema to parquet as the ParquetWriter will 
> check that the schema of a table being written matches the schema supplied to 
> the writer. If that same schema is used to to read the CSV data and contains 
> a nullable field, a mismatch will be detected resulting in an error which is 
> demonstrated below.
>  
> {code:java}
> $ cat test.csv 
> 0
> 1
> $ python
> >>> import pyarrow
> >>> schema = pyarrow.schema([pyarrow.field(name="foo", type=pyarrow.bool_(), 
> >>> nullable=False)])
> >>> read_options = csv.ReadOptions(column_names=["foo"])
> >>> from pyarrow import csv
> >>> read_options = csv.ReadOptions(column_names=["foo"])
> >>> convert_options = csv.ConvertOptions(column_types=schema)
> >>> table = csv.read_csv("test.csv", convert_options=convert_options, 
> >>> read_options=read_options)
> >>> schema
> foo: bool not null
> >>> table.schema
> foo: bool
> >>> from pyarrow import parquet as pq
> >>> writer = pq.ParquetWriter("test.parquet", schema)
> >>> writer.write_table(table)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "(REDACTED)/lib/python3.7/site-packages/pyarrow-0.15.1-py3.7-macosx-10.9-x86_64.egg/pyarrow/parquet.py",
>  line 472, in write_table
>     raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file: 
> table:
> foo: bool vs. 
> file:
> foo: bool not null
> >>> pyarrow.__version__
> '0.15.1'
> >>> exit()
> $ python --version
> Python 3.7.4{code}
>  
>  * As a side note: if I don't set column_names in read_options when calling 
> read_csv, but I set convert_options with column_types set, type inference is 
> still performed which seems like a bug vs. what the docs state. That seems 
> like a possibly related, but independent bug, and I haven't done a search yet 
> to see if it is an open/known issue but if someone believes it should be 
> filed with a repro case upon reading this I am happy to help! I only realized 
> this when minimizing the repro case as my original code was setting 
> column_names.
> *Potential source of issue:*
>  * **I did not yet look at how hard it is to fix, but I note that 
> [here|https://github.com/apache/arrow/blob/ace72c2afa6b7608bca9ba858fdd10b23e7f2dbf/python/pyarrow/_csv.pyx#L411]
>  only the name and type are passed down from a Field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7655) [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema

Reply via email to