[ https://issues.apache.org/jira/browse/ARROW-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-1658: ----------------------------------- Assignee: Wes McKinney > [Python] Out of bounds dictionary indices causes segfault after converting to > pandas > ------------------------------------------------------------------------------------ > > Key: ARROW-1658 > URL: https://issues.apache.org/jira/browse/ARROW-1658 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.7.1 > Reporter: Wes McKinney > Assignee: Wes McKinney > Fix For: 0.8.0 > > > Minimal reproduction: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > > num = 100 > arr = pa.DictionaryArray.from_arrays( > np.arange(0, num), > np.array(['a'], np.object), > np.zeros(num, np.bool), > True) > print(arr.to_pandas()) > {code} > At no time in the Arrow codebase do we validate that the dictionary indices > are in bounds. It seems that pandas is overly trusting of the validity of the > indices. So we should add a method someplace to validate that the dictionary > non-null indices are not out of bounds (perhaps in > {{CategoricalBlock::WriteIndices}}). > As an aside: there may be other times when doing analytics on categorical > data that external data will have out of bounds index values. We should plan > for these and decide whether to raise an exception or treat them as null -- This message was sent by Atlassian JIRA (v6.4.14#64029)