[ https://issues.apache.org/jira/browse/ARROW-10008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Kietzman reassigned ARROW-10008: ------------------------------------ Assignee: Ben Kietzman > [Python] pyarrow.parquet.read_table fails with predicate pushdown on > categorical data with use_legacy_dataset=False > ------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-10008 > URL: https://issues.apache.org/jira/browse/ARROW-10008 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.17.1, 1.0.1 > Environment: Platform: > Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 > Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) > [GCC 7.3.0] > Pandas version: 1.1.2 > pyarrow version: 1.0.1 > Reporter: Caleb Hattingh > Assignee: Ben Kietzman > Priority: Major > Labels: categorical, category, dataset, filters, parquet, > predicate > Fix For: 2.0.0 > > > I apologise if this is a known issue; I looked both in this issue tracker and > on github and I didn't find it. > There seems to be a problem reading a dataset with predicate pushdown > (filters) on columns with categorical data. The problem only occurs with > `use_legacy_dataset=False` (but if that's True it has no effect if the column > isn't a partition key. > Reproducer: > {code:python} > import shutil > import sys, platform > from pathlib import Path > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # Settings > CATEGORICAL_DTYPE = True > USE_LEGACY_DATASET = False > print('Platform:', platform.platform()) > print('Python version:', sys.version) > print('Pandas version:', pd.__version__) > print('pyarrow version:', pa.__version__) > print('categorical enabled:', CATEGORICAL_DTYPE) > print('use_legacy_dataset:', USE_LEGACY_DATASET) > print() > # Clean up test dataset if present > path = Path('blah.parquet') > if path.exists(): > shutil.rmtree(str(path)) > # Simple data > d = dict(col1=['a', 'b'], col2=[1, 2]) > # Either categorical or not > if CATEGORICAL_DTYPE: > df = pd.DataFrame(data=d, dtype='category') > else: > df = pd.DataFrame(data=d) > # Write dataset > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, str(path)) > # Load dataset > table = pq.read_table( > str(path), > filters=[('col1', '=', 'a')], > use_legacy_dataset=USE_LEGACY_DATASET, > ) > df = table.to_pandas() > print(df.dtypes) > print(repr(df)) > {code} > Output: > {code:java} > $ python categorical_predicate_pushdown.py > Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 > Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) > [GCC 7.3.0] > Pandas version: 1.1.2 > pyarrow version: 1.0.1 > categorical enabled: True > use_legacy_dataset: False > /arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: > Cannot compare scalars of differing type: dictionary<values=string, > indices=int32, ordered=0> vs string > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZZN5arrow7dataset19GetScanTaskIteratorENS_8IteratorISt10shared_ptrINS0_8FragmentEEEES2_INS0_11ScanOptionsEES2_INS0_11ScanContextEEENKUlS4_E_clES4_+0x91)[0x7f50433485a1] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorINS0_ISt10shared_ptrINS_7dataset8ScanTaskEEEEE4NextINS_11MapIteratorIZNS2_19GetScanTaskIteratorENS0_IS1_INS2_8FragmentEEEES1_INS2_11ScanOptionsEES1_INS2_11ScanContextEEEUlSA_E_SA_S5_EEEENS_6ResultIS5_EEPv+0xde)[0x7f504334b55e] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow15FlattenIteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextEv+0x127)[0x7f50433616b7] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow8IteratorISt10shared_ptrINS_7dataset8ScanTaskEEE4NextINS_15FlattenIteratorIS4_EEEENS_6ResultIS4_EEPv+0x14)[0x7f5043361874] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset7Scanner7ToTableEv+0x611)[0x7f5043336691] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x3b150)[0x7f50435c9150] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2c0eb)[0x7f50435ba0eb] > /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/_dataset.cpython-38-x86_64-linux-gnu.so(+0x2d9ab)[0x7f50435bb9ab] > python(PyCFunction_Call+0x56)[0x562843a6dce6] > python(_PyObject_MakeTpCall+0x22f)[0x562843a2b5cf] > python(_PyEval_EvalFrameDefault+0x11d7)[0x562843aaf727] > python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802] > python(+0x18bb80)[0x562843a79b80] > python(+0x1001e3)[0x5628439ee1e3] > python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802] > python(_PyFunction_Vectorcall+0x1e3)[0x562843a797a3] > python(+0x1001e3)[0x5628439ee1e3] > python(_PyEval_EvalCodeWithName+0x2d2)[0x562843a78802] > python(PyEval_EvalCodeEx+0x44)[0x562843a795b4] > python(PyEval_EvalCode+0x1c)[0x562843b07bdc] > python(+0x219c84)[0x562843b07c84] > python(+0x24be94)[0x562843b39e94] > python(PyRun_FileExFlags+0xa1)[0x562843a0279a] > python(PyRun_SimpleFileExFlags+0x3b4)[0x562843a02b7f] > python(+0x115a44)[0x562843a03a44] > python(Py_BytesMain+0x39)[0x562843b3c9b9] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5058f2a0b3] > python(+0x1dea83)[0x562843acca83] > Aborted (core dumped) > {code} > With `CATEGORICAL_DTYPE = False`, it works as expected: > {code:java} > $ python categorical_predicate_pushdown.py > Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 > Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) > [GCC 7.3.0] > Pandas version: 1.1.2 > pyarrow version: 1.0.1 > categorical enabled: False > use_legacy_dataset: Falsecol1 object > col2 int64 > dtype: object > col1 col2 > 0 a 1 > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)