[ https://issues.apache.org/jira/browse/ARROW-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021293#comment-17021293 ]
Tim Lantz commented on ARROW-7655: ---------------------------------- Re: my side note above, I filed https://issues.apache.org/jira/browse/ARROW-7656 as well. I see that in ARROW-6536 there is discussion on why in the C++ API you need to set both and that makes perfect sense so this is just a documentation thing. > [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema > ------------------------------------------------------------------------------ > > Key: ARROW-7655 > URL: https://issues.apache.org/jira/browse/ARROW-7655 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.15.1 > Environment: Reproduced on Ubuntu 18.04 and OSX Catalina in Python > 3.7.4. > Reporter: Tim Lantz > Priority: Major > Labels: CSV > > > Originally mentioned in: [https://github.com/apache/arrow/issues/6243] > *High level description of the issue:* > * It is possible ([though not > documented|https://issues.apache.org/jira/browse/ARROW-7654]) that you may > assign the column_types field of ConvertOptions to a Schema object instead of > a Dict[str, DataType]. > * Expected result: the nullable attribute, in addition to the type, of the > Fields in the Schema supplied are present on the Schema used when reading CSV > data. > * Actual result: the Field type information is present, but nullable is > lost. All fields are nullable. > *Minimal reproduction case:* > * Use case notes: this is especially noticeable when using pyarrow as a > meant to save data with a known schema to parquet as the ParquetWriter will > check that the schema of a table being written matches the schema supplied to > the writer. If that same schema is used to to read the CSV data and contains > a nullable field, a mismatch will be detected resulting in an error which is > demonstrated below. > > {code:java} > $ cat test.csv > 0 > 1 > $ python > >>> import pyarrow > >>> schema = pyarrow.schema([pyarrow.field(name="foo", type=pyarrow.bool_(), > >>> nullable=False)]) > >>> read_options = csv.ReadOptions(column_names=["foo"]) > >>> from pyarrow import csv > >>> read_options = csv.ReadOptions(column_names=["foo"]) > >>> convert_options = csv.ConvertOptions(column_types=schema) > >>> table = csv.read_csv("test.csv", convert_options=convert_options, > >>> read_options=read_options) > >>> schema > foo: bool not null > >>> table.schema > foo: bool > >>> from pyarrow import parquet as pq > >>> writer = pq.ParquetWriter("test.parquet", schema) > >>> writer.write_table(table) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "(REDACTED)/lib/python3.7/site-packages/pyarrow-0.15.1-py3.7-macosx-10.9-x86_64.egg/pyarrow/parquet.py", > line 472, in write_table > raise ValueError(msg) > ValueError: Table schema does not match schema used to create file: > table: > foo: bool vs. > file: > foo: bool not null > >>> pyarrow.__version__ > '0.15.1' > >>> exit() > $ python --version > Python 3.7.4{code} > > * As a side note: if I don't set column_names in read_options when calling > read_csv, but I set convert_options with column_types set, type inference is > still performed which seems like a bug vs. what the docs state. That seems > like a possibly related, but independent bug, and I haven't done a search yet > to see if it is an open/known issue but if someone believes it should be > filed with a repro case upon reading this I am happy to help! I only realized > this when minimizing the repro case as my original code was setting > column_names. > *Potential source of issue:* > * **I did not yet look at how hard it is to fix, but I note that > [here|https://github.com/apache/arrow/blob/ace72c2afa6b7608bca9ba858fdd10b23e7f2dbf/python/pyarrow/_csv.pyx#L411] > only the name and type are passed down from a Field. -- This message was sent by Atlassian Jira (v8.3.4#803005)