[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264329#comment-16264329 ]
Fernando Pereira edited comment on PARQUET-845 at 11/27/17 6:21 PM: -------------------------------------------------------------------- I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet already chooses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Same for a field which is Array of INT8, will it use e.g. Delta-length byte array encoding? Thanks was (Author: ferdonline): I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me. My question was regarding efficient storage, and whether parquet by default already choses efficient encoders +by default+. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits? Thanks > Efficient storage for several INT_8 and INT_16 > ---------------------------------------------- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish > Reporter: Fernando Pereira > Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.4.14#64029)