[ https://issues.apache.org/jira/browse/ARROW-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-9880: -------------------------------- Summary: [Python] Lose access to indices & dictionary roundtripping DictionaryArray to parquet file (was: Lose access to indices & dictionary roundtripping DictionaryArray to parquet file) > [Python] Lose access to indices & dictionary roundtripping DictionaryArray to > parquet file > ------------------------------------------------------------------------------------------ > > Key: ARROW-9880 > URL: https://issues.apache.org/jira/browse/ARROW-9880 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Environment: Mac running macOS Catalina (10.15.2), Python 3.7.6. > Reporter: Nick Radcliffe > Priority: Major > Attachments: pyarraw_dictionaryarray_bug.py > > > I am in the process of adding support for reading/writing Parquet to a data > analysis tool (Miró: [https://stochasticsolutions.com/miro/).] The tool has a > string column type that is extremely close to PyArrow's DictionaryArray, so > it was natural to add support for that, but round-tripping doesn't seem to > work, as this example shows: > The code creates writes a table with single column, a dictionary array, and > writes it as a parquet file using `write_table`. On reading it back in, the > column's `.type` indicates that it's a DictionaryArray, but Python reports > its type as a `ChunkedArray`. Either way, it doesn't seem to have `indices` > or `dictionary` properties. `to_pylist` works, so I can get the data in, but > almost all the benefit of writing as a dictionary array is lost if I need to > convert it to a Python list to access its values. > I presume it isn't supposed to be like this. > > {code:python} > $ python3 > Python 3.7.6 (v3.7.6:43364a7ae0, Dec 18 2019, 14:18:50) > [Clang 6.0 (clang-600.0.57)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> print('PyArrow version:', pa.__version__) > PyArrow version: 1.0.1 > >>> > >>> > >>> dictionary = ['zero', 'one', 'two'] > >>> indices = [None, 0, 1, 2, 0, 1, 0] > >>> > >>> col = pa.DictionaryArray.from_arrays(indices, dictionary) > >>> print('col:', col) > col: > -- dictionary: > [ > "zero", > "one", > "two" > ] > -- indices: > [ > null, > 0, > 1, > 2, > 0, > 1, > 0 > ] > >>> print('col.to_pylist():', col.to_pylist()) > col.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero'] > >>> print('col.type:', col.type) > col.type: dictionary<values=string, indices=int64, ordered=0> > >>> print('type(col):', type(col)) > type(col): <class 'pyarrow.lib.DictionaryArray'> > >>> print('col.indices:', col.indices) > col.indices: [ > null, > 0, > 1, > 2, > 0, > 1, > 0 > ] > >>> print('col.dictionary:', col.dictionary) > col.dictionary: [ > "zero", > "one", > "two" > ] > >>> > >>> path = '/tmp/zot.parquet' > >>> pq.write_table(pa.lib.Table.from_pydict({'zot': col}), path) > >>> table = pq.read_table(path) > >>> > >>> zot = table['zot'] > >>> print('zot:', zot) > zot: [ > -- dictionary: > [ > "zero", > "one", > "two" > ] > -- indices: > [ > null, > 0, > 1, > 2, > 0, > 1, > 0 > ] > ] > >>> print('zot.to_pylist():', zot.to_pylist()) > zot.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero'] > >>> print('zot.type:', zot.type) > zot.type: dictionary<values=string, indices=int32, ordered=0> > >>> print('type(zot):', type(zot)) > type(zot): <class 'pyarrow.lib.ChunkedArray'> > >>> print('zot.indices:', zot.indices) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'indices' > >>> print('zot.dictionary:', zot.dictionary) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute > 'dictionary' > >>> ^D > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)