RE: Vectorization with UDFs returns incorrect results
I downloaded the Hive13 and confirmed this problem existed. My suggestion is to change your query to select first from testTabOrc where first >= ten_thousand()-1 and first <= ten_thousand()-9995; which works in my environment (Hive 0.13 + hadoop 2.2) You can create a Jira for it. It looks like the problem caused by "between" with Vectorization and UDF. BTW, your UDF looks fine to me. Yong Date: Fri, 30 May 2014 11:22:18 -0400 Subject: Re: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Yong, Thanks for the reply. Yes the query "select ten_thousand() from testTabOrc" does return consistent results regardless of the vectorization setting. I ran that query with the vectorization off and then turned it on and ran it 5 more times. Each time it returned correct results. See output below. Thanks,Ben = 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (17.807 seconds)0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=true; No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;++ | _c0 |++| 1 || 1 | | 1 || 1 || 1 || 1 | ++6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (14.267 seconds)0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (18.259 seconds)0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++| _c0 |++| 1 | | 1 || 1 || 1 || 1 | | 1 |++6 rows selected (18.778 seconds) On Fri, May 30, 2014 at 10:52 AM, java8964 wrote: When you turn "vectorized" on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 & Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() {result.set(1);return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16 6|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = "SNAPPY", "orc.index" = "true"); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) select * from loadingDir;" LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db> add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db> create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabl
Re: Vectorization with UDFs returns incorrect results
Yong, Thanks for the reply. Yes the query "select ten_thousand() from testTabOrc" does return consistent results regardless of the vectorization setting. I ran that query with the vectorization off and then turned it on and ran it 5 more times. Each time it returned correct results. See output below. Thanks, Ben = 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (17.807 seconds) 0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=true; No rows affected (0.002 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (17.23 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (14.267 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (18.259 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; ++ | _c0 | ++ | 1 | | 1 | | 1 | | 1 | | 1 | | 1 | ++ 6 rows selected (18.778 seconds) On Fri, May 30, 2014 at 10:52 AM, java8964 wrote: > When you turn "vectorized" on, does the following query consistently > return 1 in the output? > > select ten_thousand() from testTabOrc > > Yong > > -- > Date: Fri, 30 May 2014 08:24:43 -0400 > Subject: Vectorization with UDFs returns incorrect results > From: bbowman...@gmail.com > To: user@hive.apache.org > > > Hive 0.13 & Hadoop 2.4 > > I am having an issue when using the combination of vectorized query > execution, BETWEEN, and a custom UDF. When I have vectorization on, my > query returns an empty set. When I then turn vectorization off, my query > returns the correct results. > > Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 > - X) and UDF_1 > > My UDFs seem to be working for everything else except this specific > circumstance. Is this a issue in the hive software or am I writing my UDFs > in such a way that they do not work with vectorization? If the latter, > what is the correct way? > > I created a test scenario where I was able to reproduce this problem I am > seeing: > > *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): * > package com.test; > > import org.apache.hadoop.hive.ql.exec.Description; > import org.apache.hadoop.hive.ql.exec.UDF; > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import java.lang.String; > import java.lang.*; > > public class tenThousand extends UDF { > > private final LongWritable result = new LongWritable(); > > public LongWritable evaluate() { > result.set(1); > return result; > } > } > > *TEST DATA (test.input):* > 1|CBCABC|12 > 2|DBCABC|13 > 3|EBCABC|14 > 4|ABCABC|15 > 5|BBCABC|16 > 6|CBCABC|17 > > *CREATING ORC TABLE:* > 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, > second varchar(20), third int) partitioned by (range int) clustered by > (first) sorted by (first) into 8 buckets stored as orc tblproperties > ("orc.compress" = "SNAPPY", "orc.index" = "true"); > > *CREATE LOADING TABLE:* > 0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint, > second varchar(20), third int) partitioned by (range int) row format > delimited fields terminated by '|' stored as textfile; > > *COPY IN DATA:* > [root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. > > *ORC DATA:* > [root@server]# beeline -u jdbc:hive2://server:10002/db -n root > --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf > hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) > select * from loadingDir;" > > *LOAD TEST FUNCTION:* > 0: jdbc:hive2://server:10002/db> add jar /opt/hadoop/lib/testFunction.jar > 0: jdbc:hive2://server:10002/db> create temporary function ten_thousand > as 'com.test.tenThousand'; > > *TURN OFF VECTORIZATION:* > 0: jdbc:hive2://server:10002/db> set > hive.vectorized.execution.enabled=false; > > *QUERY (RESULTS AS EXPECTED):* > 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first > between ten_thousand()-1 and ten_thousand()-9995; > ++ > | first | > ++ > | 1 | > | 2 | > | 3 | > ++ > 3 rows selected (15.286 seconds) > > *TURN ON VECTORIZATION:* > 0: jdbc:hive2://server:10002/db> set > hive.vectorized.execution.enabled=true; > > *QUERY AGAIN (WRONG RESULTS):* > 0: jdbc:hive2://server:10002/db> s
RE: Vectorization with UDFs returns incorrect results
When you turn "vectorized" on, does the following query consistently return 1 in the output? select ten_thousand() from testTabOrc Yong Date: Fri, 30 May 2014 08:24:43 -0400 Subject: Vectorization with UDFs returns incorrect results From: bbowman...@gmail.com To: user@hive.apache.org Hive 0.13 & Hadoop 2.4 I am having an issue when using the combination of vectorized query execution, BETWEEN, and a custom UDF. When I have vectorization on, my query returns an empty set. When I then turn vectorization off, my query returns the correct results. Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) and UDF_1 My UDFs seem to be working for everything else except this specific circumstance. Is this a issue in the hive software or am I writing my UDFs in such a way that they do not work with vectorization? If the latter, what is the correct way? I created a test scenario where I was able to reproduce this problem I am seeing: TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1): package com.test; import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import java.lang.String; import java.lang.*; public class tenThousand extends UDF { private final LongWritable result = new LongWritable(); public LongWritable evaluate() {result.set(1);return result; }} TEST DATA (test.input):1|CBCABC|12 2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16 6|CBCABC|17 CREATING ORC TABLE: 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second varchar(20), third int) partitioned by (range int) clustered by (first) sorted by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = "SNAPPY", "orc.index" = "true"); CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint, second varchar(20), third int) partitioned by (range int) row format delimited fields terminated by '|' stored as textfile; COPY IN DATA:[root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. ORC DATA:[root@server]# beeline -u jdbc:hive2://server:10002/db -n root --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) select * from loadingDir;" LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db> add jar /opt/hadoop/lib/testFunction.jar 0: jdbc:hive2://server:10002/db> create temporary function ten_thousand as 'com.test.tenThousand'; TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=false; QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db> select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++| first |++ | 1 || 2 || 3 |++ 3 rows selected (15.286 seconds) TURN ON VECTORIZATION: 0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=true; QUERY AGAIN (WRONG RESULTS): 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995; ++| first |++++ No rows selected (17.763 seconds)