[jira] [Updated] (PARQUET-1124) Add new compression codecs to the Parquet spec
[ https://issues.apache.org/jira/browse/PARQUET-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars Volker updated PARQUET-1124: - Summary: Add new compression codecs to the Parquet spec (was: Add new compression codecs to the Paruqet spec) > Add new compression codecs to the Parquet spec > -- > > Key: PARQUET-1124 > URL: https://issues.apache.org/jira/browse/PARQUET-1124 > Project: Parquet > Issue Type: Task > Components: parquet-format >Reporter: Ryan Blue >Assignee: Ryan Blue > Fix For: format-2.3.2 > > > After [recent > tests|https://lists.apache.org/thread.html/2fc572ac5fd4ac414c39047b1e6e81c36c38fc0f92e85b9aa4e5493a@%3Cdev.parquet.apache.org%3E], > I think we should add Zstd to the spec. > I'm also proposing we add LZ4 because it is widely available and outperforms > snappy. As a successor for fast compression but not necessarily good > compression ratios, I think it makes sense to have it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-1137) Check Makefile in Travis build
Lars Volker created PARQUET-1137: Summary: Check Makefile in Travis build Key: PARQUET-1137 URL: https://issues.apache.org/jira/browse/PARQUET-1137 Project: Parquet Issue Type: Bug Components: parquet-format Reporter: Lars Volker Assignee: Lars Volker We recently figured out that the Makefile was broken and it would be best to check it during the travis tests. I have a fix locally that I'll rebase and push once [PR #72|https://github.com/apache/parquet-format/pull/72] has been merged. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type
[ https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202552#comment-16202552 ] Deepak Majeti commented on PARQUET-1065: INT96 timestamps can be sorted using both signed and unsigned sort orders. The date values are always positive since they are Julian day numbers. Therefore, both orders should work. Discussion on how the values must be compared is here: https://github.com/apache/parquet-format/pull/55 > Deprecate type-defined sort ordering for INT96 type > --- > > Key: PARQUET-1065 > URL: https://issues.apache.org/jira/browse/PARQUET-1065 > Project: Parquet > Issue Type: Bug >Reporter: Zoltan Ivanfi >Assignee: Zoltan Ivanfi > > [parquet.thrift in > parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37] > defines the the sort order for INT96 to be signed. > [ParquetMetadataConverter.java in > parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422] > uses unsigned ordering instead. In practice, INT96 is only used for > timestamps and neither signed nor unsigned ordering of the numeric values is > correct for this purpose. For this reason, the INT96 sort order should be > specified as undefined. > (As a special case, min == max signifies that all values are the same, and > can be considered valid even for undefined orderings.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (PARQUET-1070) Add CPack support to the build
[ https://issues.apache.org/jira/browse/PARQUET-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned PARQUET-1070: - Assignee: Mike Trinkala > Add CPack support to the build > -- > > Key: PARQUET-1070 > URL: https://issues.apache.org/jira/browse/PARQUET-1070 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Mike Trinkala >Assignee: Mike Trinkala >Priority: Minor > Labels: build > Fix For: cpp-1.3.1 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (PARQUET-1070) Add CPack support to the build
[ https://issues.apache.org/jira/browse/PARQUET-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1070. --- Resolution: Fixed Fix Version/s: cpp-1.3.1 Issue resolved by pull request 409 [https://github.com/apache/parquet-cpp/pull/409] > Add CPack support to the build > -- > > Key: PARQUET-1070 > URL: https://issues.apache.org/jira/browse/PARQUET-1070 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Mike Trinkala >Priority: Minor > Labels: build > Fix For: cpp-1.3.1 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (PARQUET-1136) Makefile is broken
[ https://issues.apache.org/jira/browse/PARQUET-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved PARQUET-1136. Resolution: Fixed Merged #73. > Makefile is broken > -- > > Key: PARQUET-1136 > URL: https://issues.apache.org/jira/browse/PARQUET-1136 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Lars Volker >Assignee: Lars Volker > Fix For: format-2.4.0 > > > The path to the parquet.thrift file in the Makefile is broken. I'll send a PR > shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PARQUET-1136) Makefile is broken
[ https://issues.apache.org/jira/browse/PARQUET-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue updated PARQUET-1136: --- Fix Version/s: format-2.4.0 > Makefile is broken > -- > > Key: PARQUET-1136 > URL: https://issues.apache.org/jira/browse/PARQUET-1136 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Lars Volker >Assignee: Lars Volker > Fix For: format-2.4.0 > > > The path to the parquet.thrift file in the Makefile is broken. I'll send a PR > shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression
[ https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201945#comment-16201945 ] Justin Uang commented on PARQUET-118: - Instead of switching to onheap, which I imagine negatively affects performance, I would be in favor of making sure that the off heap used doesn't scale with the size of what we're decompressing. > Provide option to use on-heap buffers for Snappy compression/decompression > -- > > Key: PARQUET-118 > URL: https://issues.apache.org/jira/browse/PARQUET-118 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Patrick Wendell > > The current code uses direct off-heap buffers for decompression. If many > decompressors are instantiated across multiple threads, and/or the objects > being decompressed are large, this can lead to a huge amount of off-heap > allocation by the JVM. This can be exacerbated if overall, there is not heap > contention, since no GC will be performed to reclaim the space used by these > buffers. > It would be nice if there was a flag we cold use to simply allocate on-heap > buffers here: > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 > We ran into an issue today where these buffers totaled a very large amount of > storage and caused our Java processes (running within containers) to be > terminated by the kernel OOM-killer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression
[ https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201943#comment-16201943 ] Justin Uang commented on PARQUET-118: - We are running into this same issue with Spark. We have some rows that are fairly large, and because of the amount of off heap storage being used, yarn is killing it for going over the memoryOverhead set in spark. Seems like the amount of off heap memory used scales with the size of a row, which appears wrong. > Provide option to use on-heap buffers for Snappy compression/decompression > -- > > Key: PARQUET-118 > URL: https://issues.apache.org/jira/browse/PARQUET-118 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: Patrick Wendell > > The current code uses direct off-heap buffers for decompression. If many > decompressors are instantiated across multiple threads, and/or the objects > being decompressed are large, this can lead to a huge amount of off-heap > allocation by the JVM. This can be exacerbated if overall, there is not heap > contention, since no GC will be performed to reclaim the space used by these > buffers. > It would be nice if there was a flag we cold use to simply allocate on-heap > buffers here: > https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28 > We ran into an issue today where these buffers totaled a very large amount of > storage and caused our Java processes (running within containers) to be > terminated by the kernel OOM-killer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)