[discuss] Fixing DecimalColumnVector cache misses

Gopal Vijayaraghavan Thu, 03 Nov 2016 22:08:20 -0700

Hi,

(x-posted for discussion)


Hive's storage-api + ORC vector readers have a cache miss built-into it for the 
case of Decimal readers. 

With LLAP, two distinct cache misses are basically dragging Decimal performance 
down.

DecimalColumnVector -> HiveDecimalWritable -> HiveDecimal(BigInteger) -> new 
BigDecimal()

The writable is entirely overhead and so is the BigInteger -> BigDecimal 
conversions, particularly since the HiveDecimal type is not boxed unlike a 
"long".

Modifying the writable involves a fresh allocation of a HiveDecimal, which 
makes the object reference a rather unsightly cache miss (this is TPC-H Q1).



Changing this in hive/storage-api will produce a chicken-egg scenario between 
hive/storage-api -> orc -> hive/ql/exec/vectorization, across projects.

I'm conflicted on how to change DecimalColumnVector one-shot without breaking 
things (if possible, remove BigInteger allocations in the read-path as a 
possible optimization). 

Suggestions/discuss?

Cheers,
Gopal

[discuss] Fixing DecimalColumnVector cache misses

Reply via email to