[ 
https://issues.apache.org/jira/browse/ARROW-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9880:
--------------------------------
    Summary: [Python] Lose access to indices & dictionary roundtripping 
DictionaryArray to parquet file  (was: Lose access to indices & dictionary 
roundtripping DictionaryArray to parquet file)

> [Python] Lose access to indices & dictionary roundtripping DictionaryArray to 
> parquet file
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9880
>                 URL: https://issues.apache.org/jira/browse/ARROW-9880
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>         Environment: Mac running macOS Catalina (10.15.2), Python 3.7.6.
>            Reporter: Nick Radcliffe
>            Priority: Major
>         Attachments: pyarraw_dictionaryarray_bug.py
>
>
> I am in the process of adding support for reading/writing Parquet to a data 
> analysis tool (Miró: [https://stochasticsolutions.com/miro/).] The tool has a 
> string column type that is extremely close to PyArrow's DictionaryArray, so 
> it was natural to add support for that, but round-tripping doesn't seem to 
> work, as this example shows:
> The code creates writes a table with single column, a dictionary array, and 
> writes it as a parquet file using `write_table`. On reading it back in, the 
> column's `.type` indicates that it's a DictionaryArray, but Python reports 
> its type as a `ChunkedArray`. Either way, it doesn't seem to have `indices` 
> or `dictionary` properties. `to_pylist` works, so I can get the data in, but 
> almost all the benefit of writing as a dictionary array is lost if I need to 
> convert it to a Python list to access its values.
> I presume it isn't supposed to be like this.
>  
> {code:python}
> $ python3
> Python 3.7.6 (v3.7.6:43364a7ae0, Dec 18 2019, 14:18:50) 
> [Clang 6.0 (clang-600.0.57)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> print('PyArrow version:', pa.__version__)
> PyArrow version: 1.0.1
> >>> 
> >>> 
> >>> dictionary = ['zero', 'one', 'two']
> >>> indices = [None, 0, 1, 2, 0, 1, 0]
> >>> 
> >>> col = pa.DictionaryArray.from_arrays(indices, dictionary)
> >>> print('col:', col)
> col: 
> -- dictionary:
>   [
>     "zero",
>     "one",
>     "two"
>   ]
> -- indices:
>   [
>     null,
>     0,
>     1,
>     2,
>     0,
>     1,
>     0
>   ]
> >>> print('col.to_pylist():', col.to_pylist())
> col.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero']
> >>> print('col.type:', col.type)
> col.type: dictionary<values=string, indices=int64, ordered=0>
> >>> print('type(col):', type(col))
> type(col): <class 'pyarrow.lib.DictionaryArray'>
> >>> print('col.indices:', col.indices)
> col.indices: [
>   null,
>   0,
>   1,
>   2,
>   0,
>   1,
>   0
> ]
> >>> print('col.dictionary:', col.dictionary)
> col.dictionary: [
>   "zero",
>   "one",
>   "two"
> ]
> >>> 
> >>> path = '/tmp/zot.parquet'
> >>> pq.write_table(pa.lib.Table.from_pydict({'zot': col}), path)
> >>> table = pq.read_table(path)
> >>> 
> >>> zot = table['zot']
> >>> print('zot:', zot)
> zot: [
>   -- dictionary:
>     [
>       "zero",
>       "one",
>       "two"
>     ]
>   -- indices:
>     [
>       null,
>       0,
>       1,
>       2,
>       0,
>       1,
>       0
>     ]
> ]
> >>> print('zot.to_pylist():', zot.to_pylist())
> zot.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero']
> >>> print('zot.type:', zot.type)
> zot.type: dictionary<values=string, indices=int32, ordered=0>
> >>> print('type(zot):', type(zot))
> type(zot): <class 'pyarrow.lib.ChunkedArray'>
> >>> print('zot.indices:', zot.indices)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'indices'
> >>> print('zot.dictionary:', zot.dictionary)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 
> 'dictionary'
> >>> ^D
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to