[jira] [Comment Edited] (ARROW-11629) [C++] Writing float32 values with "Dictionary Encoding" makes parquet files not readable for some tools

Micah Kornfield (Jira) Thu, 01 Apr 2021 00:04:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-11629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312926#comment-17312926
 ]


Micah Kornfield edited comment on ARROW-11629 at 4/1/21, 7:03 AM:
------------------------------------------------------------------

[~matthros] Sorry I posted a lot of thoughts in a row so my communication might 
have been unclear.  I created java code using parque-mr 
([gist|https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe]) 
that parses the parquet file written by pyarrow.

the java code then reads through all the data and selects two columns to write 
out in the Arrow format.  When I read the arrow file produced from java back in 
python the columns are identical.    So it seems the latest version of 
parquet-mr (java which I believe drill relies on) is able to read the files 
produced by pyarrow.  If there are other columns I should compare I can add 
them (I compared the first column which appears to be row-number and one of the 
float columns ('I_Injection_IA').  

So my question is what do you mean by values are "displaced" in Drill?  Was it 
for 'I_Injection_IA' or other columns?


was (Author: emkornfield):
[~matthros] Sorry I posed a lot of thoughts in a row so my communication might 
have been unclear.  I created java code using parque-mr 
([gist|https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe]) 
that parses the parquet file written by pyarrow.

the java code then reads through all the data and selects two columns to write 
out in the Arrow format.  When I read the arrow file produced from java back in 
python the columns are identical.    So it seems the latest version of 
parquet-mr (java which I believe drill relies on) is able to read the files 
produced by pyarrow.  If there are other columns I should compare I can add 
them (I compared the first column which appears to be row-number and one of the 
float columns ('I_Injection_IA').  

So my question is what do you mean by values are "displaced" in Drill?  Was it 
for 'I_Injection_IA' or other columns?

> [C++] Writing float32 values with "Dictionary Encoding" makes parquet files 
> not readable for some tools
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11629
>                 URL: https://issues.apache.org/jira/browse/ARROW-11629
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 3.0.0
>            Reporter: Matthias Rosenthaler
>            Priority: Major
>         Attachments: foo.parquet, image-2021-02-15-15-49-41-908.png, 
> output.csv, output.parquet
>
>
> If I try to read the attached csv file with pyarrow, changing the float64 
> columns to float32 and export it to parquet, the parquet file gets corrupted. 
> It is not readable for apache drill or Parquet.Net any longer.
>  
> Update: Bug in "*Dictionary Encoding*" feature. If I switch it off for 
> float32 columns, everything works as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11629) [C++] Writing float32 values with "Dictionary Encoding" makes parquet files not readable for some tools

Reply via email to