Otávio Vasques created ARROW-7727: ------------------------------------- Summary: Unable to read a ParquetDataset when schema validation is on. Key: ARROW-7727 URL: https://issues.apache.org/jira/browse/ARROW-7727 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: _libgcc_mutex 0.1 main arrow-cpp 0.15.1 py37h982ac2c_6 conda-forge attrs 19.3.0 py_0 conda-forge backcall 0.1.0 py_0 conda-forge bleach 3.1.0 py_0 conda-forge boost-cpp 1.70.0 h8e57a91_2 conda-forge brotli 1.0.7 he1b5a44_1000 conda-forge bzip2 1.0.8 h516909a_2 conda-forge c-ares 1.15.0 h516909a_1001 conda-forge ca-certificates 2019.11.28 hecc5488_0 conda-forge certifi 2019.11.28 py37_0 conda-forge decorator 4.4.1 py_0 conda-forge defusedxml 0.6.0 py_0 conda-forge double-conversion 3.1.5 he1b5a44_2 conda-forge entrypoints 0.3 py37_1000 conda-forge gflags 2.2.2 he1b5a44_1002 conda-forge glog 0.4.0 he1b5a44_1 conda-forge grpc-cpp 1.25.0 h213be95_2 conda-forge icu 64.2 he1b5a44_1 conda-forge importlib_metadata 1.4.0 py37_0 conda-forge inflect 4.0.0 py37_1 conda-forge ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge ipython 7.11.1 py37h5ca1d4c_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jaraco.itertools 5.0.0 py_0 conda-forge jedi 0.16.0 py37_0 conda-forge jinja2 2.10.3 py_0 conda-forge jsonschema 3.2.0 py37_0 conda-forge jupyter_client 5.3.4 py37_1 conda-forge jupyter_core 4.6.1 py37_0 conda-forge ld_impl_linux-64 2.33.1 h53a641e_7 libblas 3.8.0 14_openblas conda-forge libcblas 3.8.0 14_openblas conda-forge libedit 3.1.20181209 hc058e9b_0 libevent 2.1.10 h72c5cf5_0 conda-forge libffi 3.2.1 hd88cf55_4 libgcc-ng 9.1.0 hdf63c60_0 libgfortran-ng 7.3.0 hdf63c60_4 conda-forge liblapack 3.8.0 14_openblas conda-forge libopenblas 0.3.7 h5ec1e0e_6 conda-forge libprotobuf 3.11.0 h8b12597_0 conda-forge libsodium 1.0.17 h516909a_0 conda-forge libstdcxx-ng 9.1.0 hdf63c60_0 lz4-c 1.8.3 he1b5a44_1001 conda-forge markupsafe 1.1.1 py37h516909a_0 conda-forge mistune 0.8.4 py37h516909a_1000 conda-forge more-itertools 8.1.0 py_0 conda-forge nbconvert 5.6.1 py37_0 conda-forge nbformat 5.0.4 py_0 conda-forge ncurses 6.1 he6710b0_1 notebook 6.0.3 py37_0 conda-forge numpy 1.17.5 py37h95a1406_0 conda-forge openssl 1.1.1d h516909a_0 conda-forge pandas 0.25.3 py37hb3f55d8_0 conda-forge pandoc 2.9.1.1 0 conda-forge pandocfilters 1.4.2 py_1 conda-forge parquet-cpp 1.5.1 2 conda-forge parso 0.6.0 py_0 conda-forge pexpect 4.8.0 py37_0 conda-forge pickleshare 0.7.5 py37_1000 conda-forge pip 20.0.2 py37_0 prometheus_client 0.7.1 py_0 conda-forge prompt_toolkit 3.0.2 py_0 conda-forge ptyprocess 0.6.0 py_1001 conda-forge pyarrow 0.15.1 py37h8b68381_1 conda-forge pygments 2.5.2 py_0 conda-forge pyrsistent 0.15.7 py37h516909a_0 conda-forge python 3.7.6 h0371630_2 python-dateutil 2.8.1 py_0 conda-forge pytz 2019.3 py_0 conda-forge pyzmq 18.1.1 py37h1768529_0 conda-forge re2 2020.01.01 he1b5a44_0 conda-forge readline 7.0 h7b6447c_5 send2trash 1.5.0 py_0 conda-forge setuptools 45.1.0 py37_0 six 1.14.0 py37_0 conda-forge snappy 1.1.7 he1b5a44_1003 conda-forge sqlite 3.30.1 h7b6447c_0 terminado 0.8.3 py37_0 conda-forge testpath 0.4.4 py_0 conda-forge thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge tk 8.6.8 hbc83047_0 tornado 6.0.3 py37h516909a_0 conda-forge traitlets 4.3.3 py37_0 conda-forge uriparser 0.9.3 he1b5a44_1 conda-forge wcwidth 0.1.8 py_0 conda-forge webencodings 0.5.1 py_1 conda-forge wheel 0.33.6 py37_0 xz 5.2.4 h14c3975_4 zeromq 4.3.2 he1b5a44_2 conda-forge zipp 2.1.0 py_0 conda-forge zlib 1.2.11 h7b6447c_3 zstd 1.4.4 h3b9ef0a_1 conda-forge Reporter: Otávio Vasques Fix For: 0.16.0
I was trying to read a subset of my parquet files using the ParquetDataset object with a predefined schema, when it tries to validate the schema a `to_arrow_schema` is called and the schema does not support this. I don't what is happening, this is a sample: ``` python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import numpy as np schema = pa.schema([ pa.field("field1", pa.string()), pa.field("field2", pa.string()), pa.field("field3", pa.string()), ]) ... pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema) AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema' ``` If we check the type of the schema as defined above we get: ``` type(schema) pyarrow.lib.Schema ``` But the required type according with the docs is `pyarrow.parquet.Schema`, I don't know how to produce a object with this since we are forbbiden to use the Schema constructor directly. If we check the implementation on github we get directly this line [here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]: ``` dataset_schema = self.schema.to_arrow_schema() ``` Is this a problem in the schema builder or the parquet dataset object? -- This message was sent by Atlassian Jira (v8.3.4#803005)