[ https://issues.apache.org/jira/browse/HIVE-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Teddy Choi updated HIVE-13306: ------------------------------ Attachment: HIVE-13306.2.patch This patch is more improved implementation of new decimal vectorization. I wanted to see whether it passes the integration test. However, it still needs to be integrated with the execution engine. I will keep working on this topic. > Better Decimal vectorization > ---------------------------- > > Key: HIVE-13306 > URL: https://issues.apache.org/jira/browse/HIVE-13306 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Matt McCline > Assignee: Teddy Choi > Priority: Critical > Attachments: HIVE-13306.1.patch, HIVE-13306.2.patch > > > Decimal Vectorization Requirements > • Today, the LongColumnVector, DoubleColumnVector, BytesColumnVector, > TimestampColumnVector classes store the data as primitive Java data types > long, double, or byte arrays for efficiency. > • DecimalColumnVector is different - it has an array of Object references > to HiveDecimal objects. > • The HiveDecimal object uses an internal object BigDecimal for its > implementation. Further, BigDecimal itself uses an internal object > BigInteger for its implementation, and BigInteger uses an int array. 4 > objects total. > • And, HiveDecimal is an immutable object which means arithmetic and > other operations produce new HiveDecimal object with 3 new objects underneath. > • A major reason Vectorization is fast is the ColumnVector classes except > DecimalColumnVector do not have to allocate additional memory per row. This > avoids memory fragmentation and pressure on the Java Garbage Collector that > DecimalColumnVector can generate. It is very significant. > • What can be done with DecimalColumnVector to make it much more > efficient? > o Design several new decimal classes that allow the caller to manage the > decimal storage. > o If it takes N int values to store a decimal (e.g. N=1..5), then a new > DecimalColumnVector would have an int[] of length N*1024 (where 1024 is the > default column vector size). > o Why store a decimal in separate int values? > • Java does not support 128 bit integers. > • Java does not support unsigned integers. > • In order to do multiplication of a decimal represented in a long you > need twice the storage (i.e. 128 bits). So you need to represent parts in 32 > bit integers. > • But really since we do not have unsigned, really you can only do > multiplications on N-1 bits or 31 bits. > • So, 5 ints are needed for decimal storage... of 38 digits. > o It makes sense to have just one algorithm for decimals rather than one > for HiveDecimal and another for DecimalColumnVector. So, make HiveDecimal > store N int values, too. > o A lower level primitive decimal class would accept decimals stored as > int arrays and produces results into int arrays. It would be used by > HiveDecimal and DecimalColumnVector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)