[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268415#comment-16268415 ] Fernando Pereira commented on PARQUET-845: -- Great, thanks for the clarification! I would be happy to contribute! Would you mind explain better the "work that needs to be done before we get there"? > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264329#comment-16264329 ] Fernando Pereira edited comment on PARQUET-845 at 11/27/17 6:28 PM: I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) In terms of Logical types, we have INT8, INT16, etc, which sounds fine for me. My question was regarding efficient storage, and whether parquet already chooses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Are there situations it falls back to the PLAIN encoding and uses 32 physical bits? Same for a field which is Array of INT8. It is gonna use any run-length encoder? [This was the initial question actually] PS: my question targets especially parquet-cpp, even though I am interested in the "standard" too. Thanks so much was (Author: ferdonline): I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet already chooses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Same for a field which is Array of INT8, will it use e.g. Delta-length byte array encoding? Thanks > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264329#comment-16264329 ] Fernando Pereira edited comment on PARQUET-845 at 11/27/17 6:21 PM: I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet already chooses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Same for a field which is Array of INT8, will it use e.g. Delta-length byte array encoding? Thanks was (Author: ferdonline): I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet by default already choses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Thanks > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264329#comment-16264329 ] Fernando Pereira edited comment on PARQUET-845 at 11/23/17 1:31 PM: I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet by default already choses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Thanks was (Author: ferdonline): I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet by default already choses efficient encoders +by default+. (?) If the uses doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Thanks > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16
[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264329#comment-16264329 ] Fernando Pereira commented on PARQUET-845: -- I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet by default already choses efficient encoders +by default+. (?) If the uses doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Thanks > Efficient storage for several INT_8 and INT_16 > -- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish >Reporter: Fernando Pereira >Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PARQUET-845) Efficient storage for several INT_8 and INT_16
Fernando Pereira created PARQUET-845: Summary: Efficient storage for several INT_8 and INT_16 Key: PARQUET-845 URL: https://issues.apache.org/jira/browse/PARQUET-845 Project: Parquet Issue Type: Wish Reporter: Fernando Pereira Priority: Minor In very large datasets, aggregating several INT8 into INT32 fields (or byte array) can make a big difference. In parquet, efficient algorithms exist for INT32, so if the LogicalType is INT_8 the encoded int might take up only one byte. However further optimizations could be made by allowing the user to better specify the types. What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.3.4#6332)