[ 
https://issues.apache.org/jira/browse/ARROW-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1658:
----------------------------------
    Labels: pull-request-available  (was: )

> [Python] Out of bounds dictionary indices causes segfault after converting to 
> pandas
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-1658
>                 URL: https://issues.apache.org/jira/browse/ARROW-1658
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.7.1
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> Minimal reproduction:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
>  
> num = 100
> arr = pa.DictionaryArray.from_arrays(
>     np.arange(0, num),
>     np.array(['a'], np.object),
>     np.zeros(num, np.bool),
>     True)
> print(arr.to_pandas())
> {code}
> At no time in the Arrow codebase do we validate that the dictionary indices 
> are in bounds. It seems that pandas is overly trusting of the validity of the 
> indices. So we should add a method someplace to validate that the dictionary 
> non-null indices are not out of bounds (perhaps in 
> {{CategoricalBlock::WriteIndices}}).
> As an aside: there may be other times when doing analytics on categorical 
> data that external data will have out of bounds index values. We should plan 
> for these and decide whether to raise an exception or treat them as null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to