[jira] [Updated] (PARQUET-1124) Add new compression codecs to the Parquet spec

2017-10-12 Thread Lars Volker (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated PARQUET-1124:
-
Summary: Add new compression codecs to the Parquet spec  (was: Add new 
compression codecs to the Paruqet spec)

> Add new compression codecs to the Parquet spec
> --
>
> Key: PARQUET-1124
> URL: https://issues.apache.org/jira/browse/PARQUET-1124
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: format-2.3.2
>
>
> After [recent 
> tests|https://lists.apache.org/thread.html/2fc572ac5fd4ac414c39047b1e6e81c36c38fc0f92e85b9aa4e5493a@%3Cdev.parquet.apache.org%3E],
>  I think we should add Zstd to the spec.
> I'm also proposing we add LZ4 because it is widely available and outperforms 
> snappy. As a successor for fast compression but not necessarily good 
> compression ratios, I think it makes sense to have it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PARQUET-1137) Check Makefile in Travis build

2017-10-12 Thread Lars Volker (JIRA)
Lars Volker created PARQUET-1137:


 Summary: Check Makefile in Travis build
 Key: PARQUET-1137
 URL: https://issues.apache.org/jira/browse/PARQUET-1137
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Lars Volker
Assignee: Lars Volker


We recently figured out that the Makefile was broken and it would be best to 
check it during the travis tests. I have a fix locally that I'll rebase and 
push once [PR #72|https://github.com/apache/parquet-format/pull/72] has been 
merged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-12 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202552#comment-16202552
 ] 

Deepak Majeti commented on PARQUET-1065:


INT96 timestamps can be sorted using both signed and unsigned sort orders.
The date values are always positive since they are Julian day numbers. 
Therefore, both orders should work.
Discussion on how the values must be compared is here: 
https://github.com/apache/parquet-format/pull/55


> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PARQUET-1070) Add CPack support to the build

2017-10-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1070:
-

Assignee: Mike Trinkala

> Add CPack support to the build
> --
>
> Key: PARQUET-1070
> URL: https://issues.apache.org/jira/browse/PARQUET-1070
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Mike Trinkala
>Assignee: Mike Trinkala
>Priority: Minor
>  Labels: build
> Fix For: cpp-1.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-1070) Add CPack support to the build

2017-10-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1070.
---
   Resolution: Fixed
Fix Version/s: cpp-1.3.1

Issue resolved by pull request 409
[https://github.com/apache/parquet-cpp/pull/409]

> Add CPack support to the build
> --
>
> Key: PARQUET-1070
> URL: https://issues.apache.org/jira/browse/PARQUET-1070
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Mike Trinkala
>Priority: Minor
>  Labels: build
> Fix For: cpp-1.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-1136) Makefile is broken

2017-10-12 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-1136.

Resolution: Fixed

Merged #73.

> Makefile is broken
> --
>
> Key: PARQUET-1136
> URL: https://issues.apache.org/jira/browse/PARQUET-1136
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Lars Volker
>Assignee: Lars Volker
> Fix For: format-2.4.0
>
>
> The path to the parquet.thrift file in the Makefile is broken. I'll send a PR 
> shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1136) Makefile is broken

2017-10-12 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1136:
---
Fix Version/s: format-2.4.0

> Makefile is broken
> --
>
> Key: PARQUET-1136
> URL: https://issues.apache.org/jira/browse/PARQUET-1136
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Lars Volker
>Assignee: Lars Volker
> Fix For: format-2.4.0
>
>
> The path to the parquet.thrift file in the Makefile is broken. I'll send a PR 
> shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression

2017-10-12 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201945#comment-16201945
 ] 

Justin Uang commented on PARQUET-118:
-

Instead of switching to onheap, which I imagine negatively affects performance, 
I would be in favor of making sure that the off heap used doesn't scale with 
the size of what we're decompressing.

> Provide option to use on-heap buffers for Snappy compression/decompression
> --
>
> Key: PARQUET-118
> URL: https://issues.apache.org/jira/browse/PARQUET-118
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Patrick Wendell
>
> The current code uses direct off-heap buffers for decompression. If many 
> decompressors are instantiated across multiple threads, and/or the objects 
> being decompressed are large, this can lead to a huge amount of off-heap 
> allocation by the JVM. This can be exacerbated if overall, there is not heap 
> contention, since no GC will be performed to reclaim the space used by these 
> buffers.
> It would be nice if there was a flag we cold use to simply allocate on-heap 
> buffers here:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28
> We ran into an issue today where these buffers totaled a very large amount of 
> storage and caused our Java processes (running within containers) to be 
> terminated by the kernel OOM-killer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression

2017-10-12 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201943#comment-16201943
 ] 

Justin Uang commented on PARQUET-118:
-

We are running into this same issue with Spark. We have some rows that are 
fairly large, and because of the amount of off heap storage being used, yarn is 
killing it for going over the memoryOverhead set in spark. Seems like the 
amount of off heap memory used scales with the size of a row, which appears 
wrong.

> Provide option to use on-heap buffers for Snappy compression/decompression
> --
>
> Key: PARQUET-118
> URL: https://issues.apache.org/jira/browse/PARQUET-118
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Patrick Wendell
>
> The current code uses direct off-heap buffers for decompression. If many 
> decompressors are instantiated across multiple threads, and/or the objects 
> being decompressed are large, this can lead to a huge amount of off-heap 
> allocation by the JVM. This can be exacerbated if overall, there is not heap 
> contention, since no GC will be performed to reclaim the space used by these 
> buffers.
> It would be nice if there was a flag we cold use to simply allocate on-heap 
> buffers here:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28
> We ran into an issue today where these buffers totaled a very large amount of 
> storage and caused our Java processes (running within containers) to be 
> terminated by the kernel OOM-killer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)