Hive 0.13 & Hadoop 2.4
I am having an issue when using the combination of vectorized query
execution, BETWEEN, and a custom UDF. When I have vectorization on, my
query returns an empty set. When I then turn vectorization off, my query
returns the correct results.
Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1
- X) and UDF_1
My UDFs seem to be working for everything else except this specific
circumstance. Is this a issue in the hive software or am I writing my UDFs
in such a way that they do not work with vectorization? If the latter,
what is the correct way?
I created a test scenario where I was able to reproduce this problem I am
seeing:
*TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000): *
package com.test;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import java.lang.String;
import java.lang.*;
public class tenThousand extends UDF {
private final LongWritable result = new LongWritable();
public LongWritable evaluate() {
result.set(10000);
return result;
}
}
*TEST DATA (test.input):*
1|CBCABC|12
2|DBCABC|13
3|EBCABC|14
40000|ABCABC|15
50000|BBCABC|16
60000|CBCABC|17
*CREATING ORC TABLE:*
0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint,
second varchar(20), third int) partitioned by (range int) clustered by
(first) sorted by (first) into 8 buckets stored as orc tblproperties
("orc.compress" = "SNAPPY", "orc.index" = "true");
*CREATE LOADING TABLE:*
0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint,
second varchar(20), third int) partitioned by (range int) row format
delimited fields terminated by '|' stored as textfile;
*COPY IN DATA:*
[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/.
*ORC DATA:*
[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf
hive.exec.dynamic.partition.mode=nonstrict --hiveconf
hive.enforce.sorting=true -e "insert into table testTabOrc partition(range)
select * from loadingDir;"
*LOAD TEST FUNCTION:*
0: jdbc:hive2://server:10002/db> add jar /opt/hadoop/lib/testFunction.jar
0: jdbc:hive2://server:10002/db> create temporary function ten_thousand as
'com.test.tenThousand';
*TURN OFF VECTORIZATION:*
0: jdbc:hive2://server:10002/db> set
hive.vectorized.execution.enabled=false;
*QUERY (RESULTS AS EXPECTED):*
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
between ten_thousand()-10000 and ten_thousand()-9995;
+--------+
| first |
+--------+
| 1 |
| 2 |
| 3 |
+--------+
3 rows selected (15.286 seconds)
*TURN ON VECTORIZATION:*
0: jdbc:hive2://server:10002/db> set
hive.vectorized.execution.enabled=true;
*QUERY AGAIN (WRONG RESULTS):*
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
between ten_thousand()-10000 and ten_thousand()-9995;
+--------+
| first |
+--------+
+--------+
No rows selected (17.763 seconds)