[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-05-02 Thread Vova Vysotskyi (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338074#comment-17338074
 ] 

Vova Vysotskyi commented on DRILL-7864:
---

[~matthros], I have tried querying the attached parquet file on the fresh Drill 
master version, and it returned the correct results, so looks like it was 
already fixed (perhaps by parquet update). Could you please confirm that it 
works as expected?

> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-04-02 Thread Matthias Rosenthaler (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313650#comment-17313650
 ] 

Matthias Rosenthaler commented on DRILL-7864:
-

[~emkornfield]: don't think so, because if I create the file by pyarrow, the 
same problem already exists

> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-04-02 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313644#comment-17313644
 ] 

Micah Kornfield commented on DRILL-7864:


FWIW, I hacked together a java application using the  Avro Parquet-mr bindings 
and converted the parquet file generated by pyarrow with the sample CSV to an 
arrow file was able to round trip it for Parquet-MR versions of 1.11.0, 1.11.1 
and 1.12.

More details in 
https://issues.apache.org/jira/browse/ARROW-11629?focusedCommentId=17313258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17313258
 

and

https://issues.apache.org/jira/browse/ARROW-11629?focusedCommentId=17312926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17312926

One other option is that there is subtle bug in ParquetSharp bindings to 
Parquet-cpp.

> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-04-01 Thread Matthias Rosenthaler (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313065#comment-17313065
 ] 

Matthias Rosenthaler commented on DRILL-7864:
-

[~apitrou]: yes that proofs that the bugs seems to be on drill side. Not all 
float columns are affected an only a small portions of parquet files including 
float columns are corrupted. Around 1 % of the files get corrupted. There muste 
be any special criteria which has to be met which cause this problem.

> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-04-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313058#comment-17313058
 ] 

Antoine Pitrou commented on DRILL-7864:
---

Note that Pandas and Parquet C++ give the following result on this example:
{code:python}
>>> df = pq.read_table("../output.parquet").to_pandas()
>>> df[(df['operating_point'] == 214) & (df['statistic'] == 'mean')]
Unnamed: 0  operating_point  time statistic  InjectionTimingCursor  
I_Injection_IA  InjectionRate  ADC02_IA
64  64  214 0  mean0.0  
  1.140145   0.079771 -9.997519
640001  640001  21410  mean0.0  
  2.079204   0.177609 -9.997519
640002  640002  21420  mean0.0  
  4.028618   0.272391 -9.997519
640003  640003  21430  mean0.0  
  6.042390   0.325176 -9.997519
640004  640004  21440  mean0.0  
  7.825881   0.327692 -9.997519
......  ...   ...   ......  
   ......   ...
640995  640995  214  9950  mean0.0  
 -0.036111  -0.032860 -9.997519
640996  640996  214  9960  mean0.0  
 -0.034652  -0.013963 -9.997519
640997  640997  214  9970  mean0.0  
 -0.034540   0.002608 -9.997519
640998  640998  214  9980  mean0.0  
 -0.034211   0.013444 -9.997519
640999  640999  214  9990  mean0.0  
 -0.033422   0.016216 -9.997519

[1000 rows x 8 columns]
{code}

... which seems the same as {{parquet_dotnet.csv}}.

Also note that only "InjectionRate" seems different with Drill, not 
"I_Injection_IA". But the two columns use the same types and encodings:
{code}
Column 0: Unnamed: 0 (INT64)
Column 1: operating_point (INT64)
Column 2: time (INT64)
Column 3: statistic (BYTE_ARRAY/UTF8)
Column 4: InjectionTimingCursor (DOUBLE)
Column 5: I_Injection_IA (FLOAT)
Column 6: InjectionRate (FLOAT)
Column 7: ADC02_IA (DOUBLE)
--- Row Group: 0 ---
--- Total Bytes: 9439795 ---
--- Total Compressed Bytes: 9439795 ---
--- Rows: 708000 ---
[...]
Column 5
  Values: 708000, Null Values: 0, Distinct Values: 0
  Max: 21.7209, Min: -0.388101
  Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN
  Uncompressed Size: 3448473, Compressed Size: 3448636
Column 6
  Values: 708000, Null Values: 0, Distinct Values: 0
  Max: 82.7128, Min: -2.17565
  Compression: SNAPPY, Encodings: PLAIN_DICTIONARY PLAIN RLE PLAIN
  Uncompressed Size: 2789654, Compressed Size: 2789801
[...]
{code}


> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-04-01 Thread Matthias Rosenthaler (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313040#comment-17313040
 ] 

Matthias Rosenthaler commented on DRILL-7864:
-

[~cgivre]: I uploaded a csv output of parquet-dotnet and apache drill so you 
are able to identify the differences. I did the following query on sample data 
to get a smaller subset of it: WHERE `operating_point` = 214 AND `statistic` = 
'mean'\{{}}

> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: drill_query.csv, output.parquet, parquet-dotnet.csv
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-03-01 Thread Charles Givre (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292905#comment-17292905
 ] 

Charles Givre commented on DRILL-7864:
--

[~matthros] Just as an update, DRILL-7825 was blocked by 
https://issues.apache.org/jira/browse/PARQUET-1898, which in turn had its own 
blockers.  It looks like these issues have been resolved so I think we'll see 
some progress this week. 


> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: output.parquet
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-03-01 Thread Matthias Rosenthaler (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292903#comment-17292903
 ] 

Matthias Rosenthaler commented on DRILL-7864:
-

[~cgivre], don't think so, but as soon as you merged the ticket I could make a 
test. The problem is I don't know if the bug is on arrow or drill side. But it 
is a very important bug which should be fixed because in this way the 
dictionary encoding feature is not usable. I am wondering why nobody else is 
affected by this bug.

> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: output.parquet
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7864) Parquet file could not be read correctly

2021-02-17 Thread Charles Givre (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285954#comment-17285954
 ] 

Charles Givre commented on DRILL-7864:
--

[~matthros]
There is a pending PR (https://issues.apache.org/jira/browse/DRILL-7825) which 
is upgrading the parquet libraries to the latest version.  This PR is blocked 
by one remaining issue on the Parquet side.  This should be merged soon.  Do 
you think this could solve this issue?


> Parquet file could not be read correctly
> 
>
> Key: DRILL-7864
> URL: https://issues.apache.org/jira/browse/DRILL-7864
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.18.0
>Reporter: Matthias Rosenthaler
>Priority: Major
> Attachments: output.parquet
>
>
> The following parquet file which is generated by ParquetSharp (which is using 
> the underlying apache arrow c++ lib) is not readable by drill. The values of 
> the columns are displaced. If I write the affected float32 columns 
> "InjectionRate" and "I_injection_IA" as float64, everything is fine.
> Update: It seems that the bug is *caused by dictionary encoding*. If I turn 
> this feature of, drill is able to read it. So please take a look into reading 
> dictionary encoded columns in drill to solve the bug.
> Also created a ticket for the arrow project, but they redirect me to the 
> drill project. https://issues.apache.org/jira/browse/ARROW-11629
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)