[ https://issues.apache.org/jira/browse/ARROW-11629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312926#comment-17312926 ]
Micah Kornfield edited comment on ARROW-11629 at 4/1/21, 7:03 AM: ------------------------------------------------------------------ [~matthros] Sorry I posted a lot of thoughts in a row so my communication might have been unclear. I created java code using parque-mr ([gist|https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe]) that parses the parquet file written by pyarrow. the java code then reads through all the data and selects two columns to write out in the Arrow format. When I read the arrow file produced from java back in python the columns are identical. So it seems the latest version of parquet-mr (java which I believe drill relies on) is able to read the files produced by pyarrow. If there are other columns I should compare I can add them (I compared the first column which appears to be row-number and one of the float columns ('I_Injection_IA'). So my question is what do you mean by values are "displaced" in Drill? Was it for 'I_Injection_IA' or other columns? was (Author: emkornfield): [~matthros] Sorry I posed a lot of thoughts in a row so my communication might have been unclear. I created java code using parque-mr ([gist|https://gist.github.com/emkornfield/efd3a4c3c1012dc19cf9769198e3bffe]) that parses the parquet file written by pyarrow. the java code then reads through all the data and selects two columns to write out in the Arrow format. When I read the arrow file produced from java back in python the columns are identical. So it seems the latest version of parquet-mr (java which I believe drill relies on) is able to read the files produced by pyarrow. If there are other columns I should compare I can add them (I compared the first column which appears to be row-number and one of the float columns ('I_Injection_IA'). So my question is what do you mean by values are "displaced" in Drill? Was it for 'I_Injection_IA' or other columns? > [C++] Writing float32 values with "Dictionary Encoding" makes parquet files > not readable for some tools > ------------------------------------------------------------------------------------------------------- > > Key: ARROW-11629 > URL: https://issues.apache.org/jira/browse/ARROW-11629 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 3.0.0 > Reporter: Matthias Rosenthaler > Priority: Major > Attachments: foo.parquet, image-2021-02-15-15-49-41-908.png, > output.csv, output.parquet > > > If I try to read the attached csv file with pyarrow, changing the float64 > columns to float32 and export it to parquet, the parquet file gets corrupted. > It is not readable for apache drill or Parquet.Net any longer. > > Update: Bug in "*Dictionary Encoding*" feature. If I switch it off for > float32 columns, everything works as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)