[jira] [Created] (ARROW-16983) Delta byte array encoder broken due to memory leak

2022-07-05 Thread Matt DePero (Jira)
Matt DePero created ARROW-16983:
---

 Summary: Delta byte array encoder broken due to memory leak
 Key: ARROW-16983
 URL: https://issues.apache.org/jira/browse/ARROW-16983
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Reporter: Matt DePero


The `DeltaByteArrayEncoder` has a memory leak due to a bug in how 
`EstimatedDataEncodedSize` is calculated. DeltaByteArrayEncoder extends 
`encoder` which calculates EstimatedDataEncodedSize by calling `Len()` on its 
`PooledBufferWriter` sink. DeltaByteArrayEncoder however does not write data to 
sink, instead writing data to `prefixEncoder` and `suffixEncoder` causing 
EstimatedDataEncodedSize to always return zero. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16813) [Go][Parquet] Disabling dictionary encoding per column in writer config broken

2022-06-10 Thread Matt DePero (Jira)
Matt DePero created ARROW-16813:
---

 Summary: [Go][Parquet] Disabling dictionary encoding per column in 
writer config broken
 Key: ARROW-16813
 URL: https://issues.apache.org/jira/browse/ARROW-16813
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Matt DePero


Small bug in how we set column level dictionary encoding config, always set to 
true rather than respecting input value.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16790) [Go][Parquet] Avoid allocating new memory when skipping rows

2022-06-08 Thread Matt DePero (Jira)
Matt DePero created ARROW-16790:
---

 Summary: [Go][Parquet] Avoid allocating new memory when skipping 
rows
 Key: ARROW-16790
 URL: https://issues.apache.org/jira/browse/ARROW-16790
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Matt DePero


As referenced in 
[apache/arrow#13277|[https://github.com/apache/arrow/pull/13277],] we allocate 
scratch space memory every time we want to skip rows in the Go parquet reader 
despite this memory simply being discarded. We ideally should reuse scratch 
space when skipping to avoid these unnecessary allocations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (ARROW-16638) [Go][Parquet] Boolean column reader fails to skip rows

2022-05-31 Thread Matt DePero (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt DePero reopened ARROW-16638:
-

Boolean skip still has issue when skipping a large number of rows, due to the 
buffer passed for `values` not being large enough. If batch size to skip is 
1024, the buffer allocated for values when skipping a boolean column is only 
128, resulting in an index out of bounds panic.

> [Go][Parquet] Boolean column reader fails to skip rows
> --
>
> Key: ARROW-16638
> URL: https://issues.apache.org/jira/browse/ARROW-16638
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Matt DePero
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Skipping values in the go parquet column reader is effectively implemented by 
> reading the target number of rows into scratch space which is then discarded. 
> In the boolean case, 
> [BytesRequired|https://github.com/apache/arrow/blob/4c21fd12f93e4853c03c05919ffb22c6bb8f09b0/go/parquet/file/column_reader.go#L439]
>  returns returns a scratch buffer that allocates one bit per row, however 
> that [same scratch 
> space|https://github.com/apache/arrow/blob/4c21fd12f93e4853c03c05919ffb22c6bb8f09b0/go/parquet/file/column_reader_types.gen.go#L212-L213]
>  is also attempted to be used for `defLvls` and `repLvls` (both int16), which 
> requires two bytes per row. Since the boolean `values` buffer is not large 
> enough to hold the same number of rows worth of def and rep levels, skipping 
> too many rows results in an index out of bounds panic.
>  
> Note that for other column types, this does not seem to be an issue since the 
> buffer needed for `values` is always larger than the buffer needed for def 
> and rep levels, however there still seems to be no reason to include any 
> non-nil value to `cr.ReadBatch(...)` for [rep and def 
> lvls|https://github.com/apache/arrow/blob/4c21fd12f93e4853c03c05919ffb22c6bb8f09b0/go/parquet/file/column_reader_types.gen.go#L212-L213]
>  when skipping any column in the reader.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16638) [Go][Parquet] Boolean column reader fails to skip rows

2022-05-23 Thread Matt DePero (Jira)
Matt DePero created ARROW-16638:
---

 Summary: [Go][Parquet] Boolean column reader fails to skip rows
 Key: ARROW-16638
 URL: https://issues.apache.org/jira/browse/ARROW-16638
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Matt DePero
 Fix For: 9.0.0


Skipping values in the go parquet column reader is effectively implemented by 
reading the target number of rows into scratch space which is then discarded. 
In the boolean case, 
[BytesRequired|https://github.com/apache/arrow/blob/4c21fd12f93e4853c03c05919ffb22c6bb8f09b0/go/parquet/file/column_reader.go#L439]
 returns returns a scratch buffer that allocates one bit per row, however that 
[same scratch 
space|https://github.com/apache/arrow/blob/4c21fd12f93e4853c03c05919ffb22c6bb8f09b0/go/parquet/file/column_reader_types.gen.go#L212-L213]
 is also attempted to be used for `defLvls` and `repLvls` (both int16), which 
requires two bytes per row. Since the boolean `values` buffer is not large 
enough to hold the same number of rows worth of def and rep levels, skipping 
too many rows results in an index out of bounds panic.

 

Note that for other column types, this does not seem to be an issue since the 
buffer needed for `values` is always larger than the buffer needed for def and 
rep levels, however there still seems to be no reason to include any non-nil 
value to `cr.ReadBatch(...)` for [rep and def 
lvls|https://github.com/apache/arrow/blob/4c21fd12f93e4853c03c05919ffb22c6bb8f09b0/go/parquet/file/column_reader_types.gen.go#L212-L213]
 when skipping any column in the reader.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16563) [Go][Parquet] Plain boolean decoder is broken

2022-05-12 Thread Matt DePero (Jira)
Matt DePero created ARROW-16563:
---

 Summary: [Go][Parquet] Plain boolean decoder is broken
 Key: ARROW-16563
 URL: https://issues.apache.org/jira/browse/ARROW-16563
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Matt DePero


When reading parquet files with `PLAIN` encoded boolean values, `bitOffset` in 
the boolean decoder is not properly managed which results in garbled values 
being returned.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)