[ 
https://issues.apache.org/jira/browse/ARROW-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351882#comment-17351882
 ] 

Ying Zhou commented on ARROW-10635:
-----------------------------------

[~rgsl888] Cool! I will close it.

> [C++] ORC reader issue with bool column
> ---------------------------------------
>
>                 Key: ARROW-10635
>                 URL: https://issues.apache.org/jira/browse/ARROW-10635
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.0.1
>            Reporter: Ramakrishna Prabhu
>            Assignee: Ying Zhou
>            Priority: Minor
>              Labels: orc
>         Attachments: bool_pq.parquet, broken_bool.zip
>
>
> The ORC file contains single column of boolean type, from row number `20000` 
> the values are mismatching compared to what is expected.
>  
> As per my observation, the writer used for this ORC file assumes RLE is 
> aligned with row index boundaries. That means, no two row groups will share 
> same byte. And there will be no offset within byte. But I think that pyarrow 
> considers whatever leftover of that partial byte which was left at end of a 
> row group as data which causes the shift in the values.
>  
> I have attached another parquet file with same data for reference. You would 
> notice that ORC considers last two bits of partial byte and shifts the data 
> by two rows.
>  
> {code:java}
> // code placeholder
> from pyarrow import orc
> f = orc.ORCFile('broken_bool.orc')
> pdf_orc=f.read().to_pandas() 
> pdf_pq=pd.read_parquet("bool_pq.parquet")  
> pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != 
> pdf_pq.col_bool.dropna()] 
> 20002 False 
> 20004 False 
> 20005 True 
> 20007 False 
> 20014 True 
> ... 
> 21973 False 
> 21974 False 
> 21985 True 
> 21988 True 
> 21993 False
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to