Matt McCline created HIVE-13306:
-----------------------------------
Summary: Better Decimal vectorization
Key: HIVE-13306
URL: https://issues.apache.org/jira/browse/HIVE-13306
Project: Hive
Issue Type: Bug
Components: Hive
Reporter: Matt McCline
Priority: Critical
Decimal Vectorization Requirements
• Today, the LongColumnVector, DoubleColumnVector, BytesColumnVector,
TimestampColumnVector classes store the data as primitive Java data types long,
double, or byte arrays for efficiency.
• DecimalColumnVector is different - it has an array of Object references
to HiveDecimal objects.
• The HiveDecimal object uses an internal object BigDecimal for its
implementation. Further, BigDecimal itself uses an internal object BigInteger
for its implementation, and BigInteger uses an int array. 4 objects total.
• And, HiveDecimal is an immutable object which means arithmetic and
other operations produce new HiveDecimal object with 3 new objects underneath.
• A major reason Vectorization is fast is the ColumnVector classes except
DecimalColumnVector do not have to allocate additional memory per row. This
avoids memory fragmentation and pressure on the Java Garbage Collector that
DecimalColumnVector can generate. It is very significant.
• What can be done with DecimalColumnVector to make it much more
efficient?
o Design several new decimal classes that allow the caller to manage the
decimal storage.
o If it takes N int values to store a decimal (e.g. N=1..5), then a new
DecimalColumnVector would have an int[] of length N*1024 (where 1024 is the
default column vector size).
o Why store a decimal in separate int values?
• Java does not support 128 bit integers.
• Java does not support unsigned integers.
• In order to do multiplication of a decimal represented in a long you
need twice the storage (i.e. 128 bits). So you need to represent parts in 32
bit integers.
• But really since we do not have unsigned, really you can only do
multiplications on N-1 bits or 31 bits.
• So, 5 ints are needed for decimal storage... of 38 digits.
o It makes sense to have just one algorithm for decimals rather than one
for HiveDecimal and another for DecimalColumnVector. So, make HiveDecimal
store N int values, too.
o A lower level primitive decimal class would accept decimals stored as
int arrays and produces results into int arrays. It would be used by
HiveDecimal and DecimalColumnVector.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)