[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269028#comment-16269028 ] Ryan Blue commented on PARQUET-845: --- Here's my initial write-up of the encodings I'm proposing: https://lists.apache.org/thread.html/8fc11a8e1538b477162eed2a89946e49dbdcf595b5c7fbe80533432d@%3Cdev.parquet.apache.org%3E > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269020#comment-16269020 ] Ryan Blue commented on PARQUET-845: --- The main blocker for delta encoding is that we haven't finalized the spec for the set of 2.0 encodings, which means that current releases will be backward-compatible, but we don't guarantee forward-compatibility if you use the current set of 2.0 encodings. In practice, if you upgrade to a new version in the future you might start writing files that aren't supported by current readers. (But we do guarantee that new readers will be able to read files written by older ones.) To make that forward-compatibility guarantee, we want to lock down what writers should produce. What writers should produce for delta encoding is still undecided. The delta encoding implementation isn't based on the RLE encoding (a combination of bit packing and run-length encoding) that Parquet uses in a lot of places because the RLE encoding doesn't support negative integers. Instead, it is a complicated custom encoding. I've proposed an alternative: zig-zag encode and then use the existing RLE encoding to support negative numbers, and then layer deltas on top of that. Those encodings are in a branch: https://github.com/rdblue/parquet-mr/commit/89b4f16bdfd3817ece42049748745a3b22b83335 I think the current blocker is for people to get time to evaluate the encodings and discuss it somewhere to decide. If you'd like to test out the encodings and push on this issue, that would be a great place to help out. Thanks! > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268415#comment-16268415 ] Fernando Pereira commented on PARQUET-845: -- Great, thanks for the clarification! I would be happy to contribute! Would you mind explain better the "work that needs to be done before we get there"? > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)