[jira] [Created] (ARROW-7727) Unable to read a ParquetDataset when schema validation is on.

Jira Thu, 30 Jan 2020 06:49:45 -0800

Otávio Vasques created ARROW-7727:
-------------------------------------

             Summary: Unable to read a ParquetDataset when schema validation is 
on.
                 Key: ARROW-7727
                 URL: https://issues.apache.org/jira/browse/ARROW-7727
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
         Environment: _libgcc_mutex             0.1                        main 
 
arrow-cpp                 0.15.1           py37h982ac2c_6    conda-forge
attrs                     19.3.0                     py_0    conda-forge
backcall                  0.1.0                      py_0    conda-forge
bleach                    3.1.0                      py_0    conda-forge
boost-cpp                 1.70.0               h8e57a91_2    conda-forge
brotli                    1.0.7             he1b5a44_1000    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2019.11.28           hecc5488_0    conda-forge
certifi                   2019.11.28               py37_0    conda-forge
decorator                 4.4.1                      py_0    conda-forge
defusedxml                0.6.0                      py_0    conda-forge
double-conversion         3.1.5                he1b5a44_2    conda-forge
entrypoints               0.3                   py37_1000    conda-forge
gflags                    2.2.2             he1b5a44_1002    conda-forge
glog                      0.4.0                he1b5a44_1    conda-forge
grpc-cpp                  1.25.0               h213be95_2    conda-forge
icu                       64.2                 he1b5a44_1    conda-forge
importlib_metadata        1.4.0                    py37_0    conda-forge
inflect                   4.0.0                    py37_1    conda-forge
ipykernel                 5.1.4            py37h5ca1d4c_0    conda-forge
ipython                   7.11.1           py37h5ca1d4c_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jaraco.itertools          5.0.0                      py_0    conda-forge
jedi                      0.16.0                   py37_0    conda-forge
jinja2                    2.10.3                     py_0    conda-forge
jsonschema                3.2.0                    py37_0    conda-forge
jupyter_client            5.3.4                    py37_1    conda-forge
jupyter_core              4.6.1                    py37_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_7  
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libedit                   3.1.20181209         hc058e9b_0  
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_4    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge
libprotobuf               3.11.0               h8b12597_0    conda-forge
libsodium                 1.0.17               h516909a_0    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0  
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
markupsafe                1.1.1            py37h516909a_0    conda-forge
mistune                   0.8.4           py37h516909a_1000    conda-forge
more-itertools            8.1.0                      py_0    conda-forge
nbconvert                 5.6.1                    py37_0    conda-forge
nbformat                  5.0.4                      py_0    conda-forge
ncurses                   6.1                  he6710b0_1  
notebook                  6.0.3                    py37_0    conda-forge
numpy                     1.17.5           py37h95a1406_0    conda-forge
openssl                   1.1.1d               h516909a_0    conda-forge
pandas                    0.25.3           py37hb3f55d8_0    conda-forge
pandoc                    2.9.1.1                       0    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.6.0                      py_0    conda-forge
pexpect                   4.8.0                    py37_0    conda-forge
pickleshare               0.7.5                 py37_1000    conda-forge
pip                       20.0.2                   py37_0  
prometheus_client         0.7.1                      py_0    conda-forge
prompt_toolkit            3.0.2                      py_0    conda-forge
ptyprocess                0.6.0                   py_1001    conda-forge
pyarrow                   0.15.1           py37h8b68381_1    conda-forge
pygments                  2.5.2                      py_0    conda-forge
pyrsistent                0.15.7           py37h516909a_0    conda-forge
python                    3.7.6                h0371630_2  
python-dateutil           2.8.1                      py_0    conda-forge
pytz                      2019.3                     py_0    conda-forge
pyzmq                     18.1.1           py37h1768529_0    conda-forge
re2                       2020.01.01           he1b5a44_0    conda-forge
readline                  7.0                  h7b6447c_5  
send2trash                1.5.0                      py_0    conda-forge
setuptools                45.1.0                   py37_0  
six                       1.14.0                   py37_0    conda-forge
snappy                    1.1.7             he1b5a44_1003    conda-forge
sqlite                    3.30.1               h7b6447c_0  
terminado                 0.8.3                    py37_0    conda-forge
testpath                  0.4.4                      py_0    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.8                hbc83047_0  
tornado                   6.0.3            py37h516909a_0    conda-forge
traitlets                 4.3.3                    py37_0    conda-forge
uriparser                 0.9.3                he1b5a44_1    conda-forge
wcwidth                   0.1.8                      py_0    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.33.6                   py37_0  
xz                        5.2.4                h14c3975_4  
zeromq                    4.3.2                he1b5a44_2    conda-forge
zipp                      2.1.0                      py_0    conda-forge
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.4                h3b9ef0a_1    conda-forge
            Reporter: Otávio Vasques
             Fix For: 0.16.0



I was trying to read a subset of my parquet files using the ParquetDataset 
object with a predefined schema, when it tries to validate the schema a 
`to_arrow_schema` is called and the schema does not support this. I don't what 
is happening, this is a sample:

 

``` python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

schema = pa.schema([
    pa.field("field1", pa.string()),
    pa.field("field2", pa.string()),
    pa.field("field3", pa.string()),
])

 ...

pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
```

If we check the type of the schema as defined above we get:
```
type(schema)
pyarrow.lib.Schema
```
But the required type according with the docs is `pyarrow.parquet.Schema`, I 
don't know how to produce a object with this since we are forbbiden to use the 
Schema constructor directly.

If we check the implementation on github we get directly this line 
[here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
 
```
dataset_schema = self.schema.to_arrow_schema()
```

Is this a problem in the schema builder or the parquet dataset object?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-7727) Unable to read a ParquetDataset when schema validation is on.

Reply via email to