There is a WIP PR https://github.com/apache/spark/pull/37536, only implement 
the add operator of Decimal128







At 2022-08-24 12:09:03, "beliefer" <belie...@163.com> wrote:

Hi all,




Recently, we found many SQL query could improve performance by replace Spark 
Decimal to Double. This is confirmed by our many investigations and tests. We 
have investigated other databases or big data systems and found that some 
projects also have problems or PR for improving the performance of Java 
BigDecimal.




Please refer:




Hive: 
https://github.com/apache/hive/commit/98893823dc57a9c187ad5c97c9c8ce03af1fa734#diff-b4a1ba785de48d9b18f680e38c39c76e86d939c7c7a2dc75be776ff6b11999df




Hive uses an int array with a length of 4 to implement the new Decimal128 type.




Trino: https://github.com/trinodb/trino/pull/10051




Trino references the github project https://github.com/martint/int128. The 
project int128 uses an long array with a length of 2 to implement the Int128 
type. Trino further encapsulates Int128 and implements the new decimal type.




In combination with the memory format of spark SQL, Hive not only stores 4 ints 
to unsaferow, but also stores one byte for positive and negative, and one short 
for scale.The Trino method only needs to store 2 long to unsaferow in 
succession, and needs to store an int to represent scale. Because UnsafeRow 
store one 8-byte word per field, so the Hive method need store 6 * 8 = 48 
bytes, but Trino method only need stroe 3 * 8 = 24 bytes.




Considering that we can express the positive and negative and scale information 
required by Hive method through type metadata, we still need to occupy 4 * 8 = 
32 bytes. At the same time, we can store the scale information required by the 
Trino method into the type metadata, so we only need to occupy 2 * 8 = 16 bytes.




We already implement a new Decimal128 type with Int128 and implement the + 
operator of Decimal128 in my forked Spark. After the performance comparison 
between Decimal128(Int128) and Spark Decimal, we get the benchmark and know 
Decimal128 has 1.5x ~ 3x performance improvement compared with Spark Decimal.




The attachment of the email contains relevant design documents and benchmarks.




What do you think of this idea? 

If you support this idea, I could create a first PR for further and deeper 
discussion.

Reply via email to