[ https://issues.apache.org/jira/browse/ARROW-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351882#comment-17351882 ]
Ying Zhou commented on ARROW-10635: ----------------------------------- [~rgsl888] Cool! I will close it. > [C++] ORC reader issue with bool column > --------------------------------------- > > Key: ARROW-10635 > URL: https://issues.apache.org/jira/browse/ARROW-10635 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 1.0.1 > Reporter: Ramakrishna Prabhu > Assignee: Ying Zhou > Priority: Minor > Labels: orc > Attachments: bool_pq.parquet, broken_bool.zip > > > The ORC file contains single column of boolean type, from row number `20000` > the values are mismatching compared to what is expected. > > As per my observation, the writer used for this ORC file assumes RLE is > aligned with row index boundaries. That means, no two row groups will share > same byte. And there will be no offset within byte. But I think that pyarrow > considers whatever leftover of that partial byte which was left at end of a > row group as data which causes the shift in the values. > > I have attached another parquet file with same data for reference. You would > notice that ORC considers last two bits of partial byte and shifts the data > by two rows. > > {code:java} > // code placeholder > from pyarrow import orc > f = orc.ORCFile('broken_bool.orc') > pdf_orc=f.read().to_pandas() > pdf_pq=pd.read_parquet("bool_pq.parquet") > pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != > pdf_pq.col_bool.dropna()] > 20002 False > 20004 False > 20005 True > 20007 False > 20014 True > ... > 21973 False > 21974 False > 21985 True > 21988 True > 21993 False > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)