Yong, Thanks for the reply. Yes the query "select ten_thousand() from testTabOrc" does return consistent results regardless of the vectorization setting. I ran that query with the vectorization off and then turned it on and ran it 5 more times. Each time it returned correct results. See output below.
Thanks, Ben ================================================= 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; +--------+ | _c0 | +--------+ | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | +--------+ 6 rows selected (17.807 seconds) 0: jdbc:hive2://server:10002/db> set hive.vectorized.execution.enabled=true; No rows affected (0.002 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; +--------+ | _c0 | +--------+ | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | +--------+ 6 rows selected (17.23 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; +--------+ | _c0 | +--------+ | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | +--------+ 6 rows selected (14.267 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; +--------+ | _c0 | +--------+ | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | +--------+ 6 rows selected (18.259 seconds) 0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc; +--------+ | _c0 | +--------+ | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | | 10000 | +--------+ 6 rows selected (18.778 seconds) On Fri, May 30, 2014 at 10:52 AM, java8964 <[email protected]> wrote: > When you turn "vectorized" on, does the following query consistently > return 10000 in the output? > > select ten_thousand() from testTabOrc > > Yong > > ------------------------------ > Date: Fri, 30 May 2014 08:24:43 -0400 > Subject: Vectorization with UDFs returns incorrect results > From: [email protected] > To: [email protected] > > > Hive 0.13 & Hadoop 2.4 > > I am having an issue when using the combination of vectorized query > execution, BETWEEN, and a custom UDF. When I have vectorization on, my > query returns an empty set. When I then turn vectorization off, my query > returns the correct results. > > Example Query: SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 > - X) and UDF_1 > > My UDFs seem to be working for everything else except this specific > circumstance. Is this a issue in the hive software or am I writing my UDFs > in such a way that they do not work with vectorization? If the latter, > what is the correct way? > > I created a test scenario where I was able to reproduce this problem I am > seeing: > > *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000): * > package com.test; > > import org.apache.hadoop.hive.ql.exec.Description; > import org.apache.hadoop.hive.ql.exec.UDF; > import org.apache.hadoop.io.LongWritable; > import org.apache.hadoop.io.Text; > import java.lang.String; > import java.lang.*; > > public class tenThousand extends UDF { > > private final LongWritable result = new LongWritable(); > > public LongWritable evaluate() { > result.set(10000); > return result; > } > } > > *TEST DATA (test.input):* > 1|CBCABC|12 > 2|DBCABC|13 > 3|EBCABC|14 > 40000|ABCABC|15 > 50000|BBCABC|16 > 60000|CBCABC|17 > > *CREATING ORC TABLE:* > 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, > second varchar(20), third int) partitioned by (range int) clustered by > (first) sorted by (first) into 8 buckets stored as orc tblproperties > ("orc.compress" = "SNAPPY", "orc.index" = "true"); > > *CREATE LOADING TABLE:* > 0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint, > second varchar(20), third int) partitioned by (range int) row format > delimited fields terminated by '|' stored as textfile; > > *COPY IN DATA:* > [root@server]# hadoop fs -copyFromLocal /tmp/test.input /db/loading/. > > *ORC DATA:* > [root@server]# beeline -u jdbc:hive2://server:10002/db -n root > --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf > hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) > select * from loadingDir;" > > *LOAD TEST FUNCTION:* > 0: jdbc:hive2://server:10002/db> add jar /opt/hadoop/lib/testFunction.jar > 0: jdbc:hive2://server:10002/db> create temporary function ten_thousand > as 'com.test.tenThousand'; > > *TURN OFF VECTORIZATION:* > 0: jdbc:hive2://server:10002/db> set > hive.vectorized.execution.enabled=false; > > *QUERY (RESULTS AS EXPECTED):* > 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first > between ten_thousand()-10000 and ten_thousand()-9995; > +--------+ > | first | > +--------+ > | 1 | > | 2 | > | 3 | > +--------+ > 3 rows selected (15.286 seconds) > > *TURN ON VECTORIZATION:* > 0: jdbc:hive2://server:10002/db> set > hive.vectorized.execution.enabled=true; > > *QUERY AGAIN (WRONG RESULTS):* > 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first > between ten_thousand()-10000 and ten_thousand()-9995; > +--------+ > | first | > +--------+ > +--------+ > No rows selected (17.763 seconds) > > >
