[
https://issues.apache.org/jira/browse/ARROW-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025026#comment-17025026
]
Joris Van den Bossche commented on ARROW-7655:
----------------------------------------------
Currently, I think the {{column_types}} option is only meant to specify the
types, while nullability is part of the Field in a Schema, and is not a
fundamental property of the _type_ itself.
> [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema
> ------------------------------------------------------------------------------
>
> Key: ARROW-7655
> URL: https://issues.apache.org/jira/browse/ARROW-7655
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: Reproduced on Ubuntu 18.04 and OSX Catalina in Python
> 3.7.4.
> Reporter: Tim Lantz
> Priority: Major
> Labels: CSV
>
>
> Originally mentioned in: [https://github.com/apache/arrow/issues/6243]
> *High level description of the issue:*
> * It is possible ([though not
> documented|https://issues.apache.org/jira/browse/ARROW-7654]) that you may
> assign the column_types field of ConvertOptions to a Schema object instead of
> a Dict[str, DataType].
> * Expected result: the nullable attribute, in addition to the type, of the
> Fields in the Schema supplied are present on the Schema used when reading CSV
> data.
> * Actual result: the Field type information is present, but nullable is
> lost. All fields are nullable.
> *Minimal reproduction case:*
> * Use case notes: this is especially noticeable when using pyarrow as a
> meant to save data with a known schema to parquet as the ParquetWriter will
> check that the schema of a table being written matches the schema supplied to
> the writer. If that same schema is used to to read the CSV data and contains
> a nullable field, a mismatch will be detected resulting in an error which is
> demonstrated below.
>
> {code:java}
> $ cat test.csv
> 0
> 1
> $ python
> >>> import pyarrow
> >>> schema = pyarrow.schema([pyarrow.field(name="foo", type=pyarrow.bool_(),
> >>> nullable=False)])
> >>> read_options = csv.ReadOptions(column_names=["foo"])
> >>> from pyarrow import csv
> >>> read_options = csv.ReadOptions(column_names=["foo"])
> >>> convert_options = csv.ConvertOptions(column_types=schema)
> >>> table = csv.read_csv("test.csv", convert_options=convert_options,
> >>> read_options=read_options)
> >>> schema
> foo: bool not null
> >>> table.schema
> foo: bool
> >>> from pyarrow import parquet as pq
> >>> writer = pq.ParquetWriter("test.parquet", schema)
> >>> writer.write_table(table)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File
> "(REDACTED)/lib/python3.7/site-packages/pyarrow-0.15.1-py3.7-macosx-10.9-x86_64.egg/pyarrow/parquet.py",
> line 472, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> table:
> foo: bool vs.
> file:
> foo: bool not null
> >>> pyarrow.__version__
> '0.15.1'
> >>> exit()
> $ python --version
> Python 3.7.4{code}
>
> * As a side note: if I don't set column_names in read_options when calling
> read_csv, but I set convert_options with column_types set, type inference is
> still performed which seems like a bug vs. what the docs state. That seems
> like a possibly related, but independent bug, and I haven't done a search yet
> to see if it is an open/known issue but if someone believes it should be
> filed with a repro case upon reading this I am happy to help! I only realized
> this when minimizing the repro case as my original code was setting
> column_names.
> *Potential source of issue:*
> * **I did not yet look at how hard it is to fix, but I note that
> [here|https://github.com/apache/arrow/blob/ace72c2afa6b7608bca9ba858fdd10b23e7f2dbf/python/pyarrow/_csv.pyx#L411]
> only the name and type are passed down from a Field.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)