Zach Amsden has uploaded a new patch set (#16). Change subject: IMPALA-2020: Add rounding for decimal casts ......................................................................
IMPALA-2020: Add rounding for decimal casts This change adds support for DECIMAL_V2 rounding behavior for both DECIMAL to INT and DOUBLE to DECIMAL casts. The round behavior implemented for exact halves is round halves away from zero (e.g (0.5 -> 1) and (-0.5 -> -1)). Testing: Added expr-test and decimal-test test coverage as well as manual testing. I tried to update the expr benchmark to get some kind of results but the benchmark is pretty bit-rotted. It was throwing JNI exceptions. Fixed up the JNI init call, but there is still a lot of work to do to get this back in a runnable state. Even with the hack to get at the RuntimeContext, we end up getting null derefs due to the slot descriptor table not being initialized. I have decided to wait on expanding the python test until the bugs with overflow are fixed, which will make it easier to test sane behavior. [localhost:21000] > select cast(0.59999 AS int); +----------------------+ | cast(0.59999 as int) | +----------------------+ | 0 | +----------------------+ Fetched 1 row(s) in 0.01s [localhost:21000] > select cast(cast(0.5999 as float) as decimal(5,1)); +---------------------------------------------+ | cast(cast(0.5999 as float) as decimal(5,1)) | +---------------------------------------------+ | 0.5 | +---------------------------------------------+ Fetched 1 row(s) in 0.01s [localhost:21000] > set decimal_v2=1; DECIMAL_V2 set to 1 [localhost:21000] > select cast(0.59999 AS int); +----------------------+ | cast(0.59999 as int) | +----------------------+ | 1 | +----------------------+ Fetched 1 row(s) in 0.01s [localhost:21000] > select cast(cast(0.5999 as float) as decimal(5,1)); +---------------------------------------------+ | cast(cast(0.5999 as float) as decimal(5,1)) | +---------------------------------------------+ | 0.6 | +---------------------------------------------+ Fetched 1 row(s) in 0.01s Performance summary. In all cases I have tried multiple times and taken the fastest query results. Old version, head at 815c76f9cbbe6585ebed961da506fc54ce2ef4e3: [localhost:21000] > select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem; Query: select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 19:31:32 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=bf4d06517e3093ca:b053953b00000000 +--------------------------------------+ | sum(cast(l_extendedprice as bigint)) | +--------------------------------------+ | 2293784575265 | +--------------------------------------+ Fetched 1 row(s) in 0.53s With this change, and decimal_v2 off: [localhost:21000] > select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem; Query: select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 19:06:05 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=e54f6fc2c21d63d0:da7b6baa00000000 +--------------------------------------+ | sum(cast(l_extendedprice as bigint)) | +--------------------------------------+ | 2293784575265 | +--------------------------------------+ Fetched 1 row(s) in 0.52s Note that there is some noise / instability in these results and across invocations there is quite a bit of variance. Still we appear not to have regressed. With decimal V2 enabled, we loose some performance due to rounding. [localhost:21000] > set decimal_v2=1; DECIMAL_V2 set to 1 [localhost:21000] > select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem; Query: select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 21:04:45 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=444a15a709f0672:74197af00000000 +--------------------------------------+ | sum(cast(l_extendedprice as bigint)) | +--------------------------------------+ | 2293814088985 | +--------------------------------------+ Fetched 1 row(s) in 0.63s So we're about 20% slower. The variance is quite a lot so this is not a scientific number, but the trend is maintained. So we have some work to do to get this back. Casting from double seems to be roughly at parity: Old version, head at 815c76f9cbbe6585ebed961da506fc54ce2ef4e3: [localhost:21000] > select sum(cast(cast(l_extendedprice as double) as decimal(14,2))) from tpch10_parquet.lineitem; Query: select sum(cast(cast(l_extendedprice as double) as decimal(14,2))) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 19:41:07 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=2645812a55b081c7:8b8eecc800000000 +-------------------------------------------------------------+ | sum(cast(cast(l_extendedprice as double) as decimal(14,2))) | +-------------------------------------------------------------+ | 2293813121802.09 | +-------------------------------------------------------------+ Fetched 1 row(s) in 0.63s [localhost:21000] > select sum(cast(cast(l_extendedprice as double) as decimal(38,10))) from tpch10_parquet.lineitem; Query: select sum(cast(cast(l_extendedprice as double) as decimal(38,10))) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 20:59:38 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=72472ad54b4d2324:4e5354c500000000 +--------------------------------------------------------------+ | sum(cast(cast(l_extendedprice as double) as decimal(38,10))) | +--------------------------------------------------------------+ | 2293813156773.3596978911 | +--------------------------------------------------------------+ Fetched 1 row(s) in 0.72s [localhost:21000] > set decimal_v2=0; DECIMAL_V2 set to 0 [localhost:21000] > select sum(cast(cast(l_extendedprice as double) as decimal(14,2))) from tpch10_parquet.lineitem; Query: select sum(cast(cast(l_extendedprice as double) as decimal(14,2))) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 21:02:48 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=5f48db565c4426e1:9a9b255000000000 +-------------------------------------------------------------+ | sum(cast(cast(l_extendedprice as double) as decimal(14,2))) | +-------------------------------------------------------------+ | 2293813121802.09 | +-------------------------------------------------------------+ Fetched 1 row(s) in 0.64s [localhost:21000] > select sum(cast(cast(l_extendedprice as double) as decimal(38,10))) from tpch10_parquet.lineitem; Query: select sum(cast(cast(l_extendedprice as double) as decimal(38,10))) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 21:01:44 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=a148fc7f88508bf3:7d0381a00000000 +--------------------------------------------------------------+ | sum(cast(cast(l_extendedprice as double) as decimal(38,10))) | +--------------------------------------------------------------+ | 2293813156773.3596978911 | +--------------------------------------------------------------+ Fetched 1 row(s) in 0.73s [localhost:21000] > set decimal_v2=1; DECIMAL_V2 set to 1 [localhost:21000] > select sum(cast(cast(l_extendedprice as double) as decimal(14,2))) from tpch10_parquet.lineitem; Query: select sum(cast(cast(l_extendedprice as double) as decimal(14,2))) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 21:03:44 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=724bf16838f51065:78675af700000000 +-------------------------------------------------------------+ | sum(cast(cast(l_extendedprice as double) as decimal(14,2))) | +-------------------------------------------------------------+ | 2293813156773.36 | +-------------------------------------------------------------+ Fetched 1 row(s) in 0.63s [localhost:21000] > select sum(cast(cast(l_extendedprice as double) as decimal(38,10))) from tpch10_parquet.lineitem; Query: select sum(cast(cast(l_extendedprice as double) as decimal(38,10))) from tpch10_parquet.lineitem Query submitted at: 2017-02-21 21:03:53 (Coordinator: http://impala-dev:25000) Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=d746528416873278:a94e61ef00000000 +--------------------------------------------------------------+ | sum(cast(cast(l_extendedprice as double) as decimal(38,10))) | +--------------------------------------------------------------+ | 2293813156773.3600000000 | +--------------------------------------------------------------+ Fetched 1 row(s) in 0.73s Interestingly, you can see the effect of the rounding as well - the decimal 38,10 result is now precise, where as the truncation before left artifacts from the division. Change-Id: I2daf186b4770a022f9cb349d512067a1dd624810 --- M be/src/benchmarks/expr-benchmark.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/expr-test.cc M be/src/exprs/expr.h M be/src/exprs/literal.cc M be/src/runtime/decimal-test.cc M be/src/runtime/decimal-value.h M be/src/runtime/decimal-value.inline.h M be/src/udf/udf.h 9 files changed, 485 insertions(+), 112 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/5951/16 -- To view, visit http://gerrit.cloudera.org:8080/5951 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2daf186b4770a022f9cb349d512067a1dd624810 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden <zams...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Zach Amsden <zams...@cloudera.com>