[ 
https://issues.apache.org/jira/browse/HIVE-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-13306:
------------------------------
    Attachment: HIVE-13306.1.patch

It's a working draft. It shows 70x addition performance, 3x multiplication and 
2x division performance regarding to existing implementations. I will modify 
this code further for wider use cases and more performance and more 
readability. Thanks. :)

{noformat}
# Run complete. Total time: 00:02:30

Benchmark                                                                       
       Mode  Samples            Score   Error  Units
o.a.h.b.v.VectorizedArithmeticBench.DecimalColAddDecimalColColumnBench.bench    
       avgt        2   4012665235.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalColDivideDecimalColColumnBench.bench 
       avgt        2  19167315269.000 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalColMultiplyDecimalColColumnBench.bench
      avgt        2   3391096996.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalV2ColAddDecimalColColumnBench.bench  
       avgt        2     56848247.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalV2ColDivideDecimalColColumnBench.bench
      avgt        2   9162374089.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalV2ColMultiplyDecimalColColumnBench.bench
    avgt        2   1146261770.500 ±   NaN  ns/op
{noformat}

> Better Decimal vectorization
> ----------------------------
>
>                 Key: HIVE-13306
>                 URL: https://issues.apache.org/jira/browse/HIVE-13306
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Matt McCline
>            Assignee: Teddy Choi
>            Priority: Critical
>         Attachments: HIVE-13306.1.patch
>
>
> Decimal Vectorization Requirements
> •     Today, the LongColumnVector, DoubleColumnVector, BytesColumnVector, 
> TimestampColumnVector classes store the data as primitive Java data types 
> long, double, or byte arrays for efficiency.
> •     DecimalColumnVector is different - it has an array of Object references 
> to HiveDecimal objects.
> •     The HiveDecimal object uses an internal object BigDecimal for its 
> implementation.  Further, BigDecimal itself uses an internal object 
> BigInteger for its implementation, and BigInteger uses an int array.  4 
> objects total.
> •     And, HiveDecimal is an immutable object which means arithmetic and 
> other operations produce new HiveDecimal object with 3 new objects underneath.
> •     A major reason Vectorization is fast is the ColumnVector classes except 
> DecimalColumnVector do not have to allocate additional memory per row.   This 
> avoids memory fragmentation and pressure on the Java Garbage Collector that 
> DecimalColumnVector can generate.  It is very significant.
> •     What can be done with DecimalColumnVector to make it much more 
> efficient?
> o     Design several new decimal classes that allow the caller to manage the 
> decimal storage.
> o     If it takes N int values to store a decimal (e.g. N=1..5), then a new 
> DecimalColumnVector would have an int[] of length N*1024 (where 1024 is the 
> default column vector size).
> o     Why store a decimal in separate int values?
> •     Java does not support 128 bit integers.
> •     Java does not support unsigned integers.
> •     In order to do multiplication of a decimal represented in a long you 
> need twice the storage (i.e. 128 bits).  So you need to represent parts in 32 
> bit integers.
> •     But really since we do not have unsigned, really you can only do 
> multiplications on N-1 bits or 31 bits.
> •     So, 5 ints are needed for decimal storage... of 38 digits.
> o     It makes sense to have just one algorithm for decimals rather than one 
> for HiveDecimal and another for DecimalColumnVector.  So, make HiveDecimal 
> store N int values, too.
> o     A lower level primitive decimal class would accept decimals stored as 
> int arrays and produces results into int arrays.  It would be used by 
> HiveDecimal and DecimalColumnVector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to