[ 
https://issues.apache.org/jira/browse/ARROW-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227100#comment-16227100
 ] 

ASF GitHub Bot commented on ARROW-1658:
---------------------------------------

wesm commented on issue #1270: ARROW-1658: [Python] Add boundschecking of 
dictionary indices when creating CategoricalBlock
URL: https://github.com/apache/arrow/pull/1270#issuecomment-340828132
 
 
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Out of bounds dictionary indices causes segfault after converting to 
> pandas
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-1658
>                 URL: https://issues.apache.org/jira/browse/ARROW-1658
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.7.1
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> Minimal reproduction:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
>  
> num = 100
> arr = pa.DictionaryArray.from_arrays(
>     np.arange(0, num),
>     np.array(['a'], np.object),
>     np.zeros(num, np.bool),
>     True)
> print(arr.to_pandas())
> {code}
> At no time in the Arrow codebase do we validate that the dictionary indices 
> are in bounds. It seems that pandas is overly trusting of the validity of the 
> indices. So we should add a method someplace to validate that the dictionary 
> non-null indices are not out of bounds (perhaps in 
> {{CategoricalBlock::WriteIndices}}).
> As an aside: there may be other times when doing analytics on categorical 
> data that external data will have out of bounds index values. We should plan 
> for these and decide whether to raise an exception or treat them as null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to