[ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839660#comment-15839660 ]
Uwe L. Korn commented on PARQUET-845: ------------------------------------- Storagewise it should not make a difference whether you would have an INT8 or an INT32 physical type. Putting 4 INT8s into a single INT32 actually would decrease Parquet's efficiency as some of the encoding "tricks" aren't as effective anymore. (Usually my INT8 columns takes less than a bit per row when stored in Parquet. ) Or are you maybe talking about a particular API that should return INT8s instead of INT32s? > Efficient storage for several INT_8 and INT_16 > ---------------------------------------------- > > Key: PARQUET-845 > URL: https://issues.apache.org/jira/browse/PARQUET-845 > Project: Parquet > Issue Type: Wish > Reporter: Fernando Pereira > Priority: Minor > > In very large datasets, aggregating several INT8 into INT32 fields (or byte > array) can make a big difference. > In parquet, efficient algorithms exist for INT32, so if the LogicalType is > INT_8 the encoded int might take up only one byte. > However further optimizations could be made by allowing the user to better > specify the types. > What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or > eventually INT_32)? -- This message was sent by Atlassian JIRA (v6.3.4#6332)