[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16

2017-11-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269028#comment-16269028
 ] 

Ryan Blue commented on PARQUET-845:
---

Here's my initial write-up of the encodings I'm proposing: 
https://lists.apache.org/thread.html/8fc11a8e1538b477162eed2a89946e49dbdcf595b5c7fbe80533432d@%3Cdev.parquet.apache.org%3E

> Efficient storage for several INT_8 and INT_16
> --
>
> Key: PARQUET-845
> URL: https://issues.apache.org/jira/browse/PARQUET-845
> Project: Parquet
>  Issue Type: Wish
>Reporter: Fernando Pereira
>Priority: Minor
>
> In very large datasets, aggregating several INT8 into INT32 fields (or byte 
> array) can make a big difference.
> In parquet, efficient algorithms exist for INT32, so if the LogicalType is 
> INT_8 the encoded int might take up only one byte.
> However further optimizations could be made by allowing the user to better 
> specify the types.
> What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or 
> eventually INT_32)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16

2017-11-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269020#comment-16269020
 ] 

Ryan Blue commented on PARQUET-845:
---

The main blocker for delta encoding is that we haven't finalized the spec for 
the set of 2.0 encodings, which means that current releases will be 
backward-compatible, but we don't guarantee forward-compatibility if you use 
the current set of 2.0 encodings. In practice, if you upgrade to a new version 
in the future you might start writing files that aren't supported by current 
readers. (But we do guarantee that new readers will be able to read files 
written by older ones.) To make that forward-compatibility guarantee, we want 
to lock down what writers should produce.

What writers should produce for delta encoding is still undecided. The delta 
encoding implementation isn't based on the RLE encoding (a combination of bit 
packing and run-length encoding) that Parquet uses in a lot of places because 
the RLE encoding doesn't support negative integers. Instead, it is a 
complicated custom encoding. I've proposed an alternative: zig-zag encode and 
then use the existing RLE encoding to support negative numbers, and then layer 
deltas on top of that. Those encodings are in a branch: 
https://github.com/rdblue/parquet-mr/commit/89b4f16bdfd3817ece42049748745a3b22b83335

I think the current blocker is for people to get time to evaluate the encodings 
and discuss it somewhere to decide. If you'd like to test out the encodings and 
push on this issue, that would be a great place to help out. Thanks!

> Efficient storage for several INT_8 and INT_16
> --
>
> Key: PARQUET-845
> URL: https://issues.apache.org/jira/browse/PARQUET-845
> Project: Parquet
>  Issue Type: Wish
>Reporter: Fernando Pereira
>Priority: Minor
>
> In very large datasets, aggregating several INT8 into INT32 fields (or byte 
> array) can make a big difference.
> In parquet, efficient algorithms exist for INT32, so if the LogicalType is 
> INT_8 the encoded int might take up only one byte.
> However further optimizations could be made by allowing the user to better 
> specify the types.
> What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or 
> eventually INT_32)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16

2017-11-28 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268415#comment-16268415
 ] 

Fernando Pereira commented on PARQUET-845:
--

Great, thanks for the clarification!
I would be happy to contribute! Would you mind explain better the "work that 
needs to be done before we get there"?

> Efficient storage for several INT_8 and INT_16
> --
>
> Key: PARQUET-845
> URL: https://issues.apache.org/jira/browse/PARQUET-845
> Project: Parquet
>  Issue Type: Wish
>Reporter: Fernando Pereira
>Priority: Minor
>
> In very large datasets, aggregating several INT8 into INT32 fields (or byte 
> array) can make a big difference.
> In parquet, efficient algorithms exist for INT32, so if the LogicalType is 
> INT_8 the encoded int might take up only one byte.
> However further optimizations could be made by allowing the user to better 
> specify the types.
> What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or 
> eventually INT_32)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)