[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338074#comment-17338074 ] Vova Vysotskyi commented on DRILL-7864: --- [~matthros], I have tried querying the attached parquet file on the fresh Drill master version, and it returned the correct results, so looks like it was already fixed (perhaps by parquet update). Could you please confirm that it works as expected? > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313650#comment-17313650 ] Matthias Rosenthaler commented on DRILL-7864: - [~emkornfield]: don't think so, because if I create the file by pyarrow, the same problem already exists > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313644#comment-17313644 ] Micah Kornfield commented on DRILL-7864: FWIW, I hacked together a java application using the Avro Parquet-mr bindings and converted the parquet file generated by pyarrow with the sample CSV to an arrow file was able to round trip it for Parquet-MR versions of 1.11.0, 1.11.1 and 1.12. More details in https://issues.apache.org/jira/browse/ARROW-11629?focusedCommentId=17313258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17313258 and https://issues.apache.org/jira/browse/ARROW-11629?focusedCommentId=17312926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17312926 One other option is that there is subtle bug in ParquetSharp bindings to Parquet-cpp. > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313065#comment-17313065 ] Matthias Rosenthaler commented on DRILL-7864: - [~apitrou]: yes that proofs that the bugs seems to be on drill side. Not all float columns are affected an only a small portions of parquet files including float columns are corrupted. Around 1 % of the files get corrupted. There muste be any special criteria which has to be met which cause this problem. > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313058#comment-17313058 ] Antoine Pitrou commented on DRILL-7864: --- Note that Pandas and Parquet C++ give the following result on this example: {code:python} >>> df = pq.read_table("../output.parquet").to_pandas() >>> df[(df['operating_point'] == 214) & (df['statistic'] == 'mean')] Unnamed: 0 operating_point time statistic InjectionTimingCursor I_Injection_IA InjectionRate ADC02_IA 64 64 214 0 mean0.0 1.140145 0.079771 -9.997519 640001 640001 21410 mean0.0 2.079204 0.177609 -9.997519 640002 640002 21420 mean0.0 4.028618 0.272391 -9.997519 640003 640003 21430 mean0.0 6.042390 0.325176 -9.997519 640004 640004 21440 mean0.0 7.825881 0.327692 -9.997519 ...... ... ... ...... ...... ... 640995 640995 214 9950 mean0.0 -0.036111 -0.032860 -9.997519 640996 640996 214 9960 mean0.0 -0.034652 -0.013963 -9.997519 640997 640997 214 9970 mean0.0 -0.034540 0.002608 -9.997519 640998 640998 214 9980 mean0.0 -0.034211 0.013444 -9.997519 640999 640999 214 9990 mean0.0 -0.033422 0.016216 -9.997519 [1000 rows x 8 columns] {code} ... which seems the same as {{parquet_dotnet.csv}}. Also note that only "InjectionRate" seems different with Drill, not "I_Injection_IA". But the two columns use the same types and encodings: {code} Column 0: Unnamed: 0 (INT64) Column 1: operating_point (INT64) Column 2: time (INT64) Column 3: statistic (BYTE_ARRAY/UTF8) Column 4: InjectionTimingCursor (DOUBLE) Column 5: I_Injection_IA (FLOAT) Column 6: InjectionRate (FLOAT) Column 7: ADC02_IA (DOUBLE) --- Row Group: 0 --- --- Total Bytes: 9439795 --- --- Total Compressed Bytes: 9439795 --- --- Rows: 708000 --- [...] Column 5 Values: 708000, Null Values: 0, Distinct Values: 0 Max: 21.7209, Min: -0.388101 Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN Uncompressed Size: 3448473, Compressed Size: 3448636 Column 6 Values: 708000, Null Values: 0, Distinct Values: 0 Max: 82.7128, Min: -2.17565 Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN Uncompressed Size: 2789654, Compressed Size: 2789801 [...] {code} > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313040#comment-17313040 ] Matthias Rosenthaler commented on DRILL-7864: - [~cgivre]: I uploaded a csv output of parquet-dotnet and apache drill so you are able to identify the differences. I did the following query on sample data to get a smaller subset of it: WHERE `operating_point` = 214 AND `statistic` = 'mean'\{{}} > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292905#comment-17292905 ] Charles Givre commented on DRILL-7864: -- [~matthros] Just as an update, DRILL-7825 was blocked by https://issues.apache.org/jira/browse/PARQUET-1898, which in turn had its own blockers. It looks like these issues have been resolved so I think we'll see some progress this week. > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: output.parquet > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292903#comment-17292903 ] Matthias Rosenthaler commented on DRILL-7864: - [~cgivre], don't think so, but as soon as you merged the ticket I could make a test. The problem is I don't know if the bug is on arrow or drill side. But it is a very important bug which should be fixed because in this way the dictionary encoding feature is not usable. I am wondering why nobody else is affected by this bug. > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: output.parquet > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly
[ https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285954#comment-17285954 ] Charles Givre commented on DRILL-7864: -- [~matthros] There is a pending PR (https://issues.apache.org/jira/browse/DRILL-7825) which is upgrading the parquet libraries to the latest version. This PR is blocked by one remaining issue on the Parquet side. This should be merged soon. Do you think this could solve this issue? > Parquet file could not be read correctly > > > Key: DRILL-7864 > URL: https://issues.apache.org/jira/browse/DRILL-7864 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.18.0 >Reporter: Matthias Rosenthaler >Priority: Major > Attachments: output.parquet > > > The following parquet file which is generated by ParquetSharp (which is using > the underlying apache arrow c++ lib) is not readable by drill. The values of > the columns are displaced. If I write the affected float32 columns > "InjectionRate" and "I_injection_IA" as float64, everything is fine. > Update: It seems that the bug is *caused by dictionary encoding*. If I turn > this feature of, drill is able to read it. So please take a look into reading > dictionary encoded columns in drill to solve the bug. > Also created a ticket for the arrow project, but they redirect me to the > drill project. https://issues.apache.org/jira/browse/ARROW-11629 > -- This message was sent by Atlassian Jira (v8.3.4#803005)