Re: Vectorization with UDFs returns incorrect results

Benjamin Bowman Fri, 30 May 2014 08:23:46 -0700

Yong,

Thanks for the reply.  Yes the query "select ten_thousand() from testTabOrc"
does return consistent results regardless of the vectorization setting.  I
ran that query with the vectorization off and then turned it on and ran it
5 more times.  Each time it returned correct results.  See output below.


Thanks,
Ben

=================================================

0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+
|  _c0   |
+--------+
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
+--------+
6 rows selected (17.807 seconds)
0: jdbc:hive2://server:10002/db>  set
hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+
|  _c0   |
+--------+
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
+--------+
6 rows selected (17.23 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+
|  _c0   |
+--------+
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
+--------+
6 rows selected (14.267 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+
|  _c0   |
+--------+
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
+--------+
6 rows selected (18.259 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+
|  _c0   |
+--------+
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
| 10000  |
+--------+
6 rows selected (18.778 seconds)





On Fri, May 30, 2014 at 10:52 AM, java8964 <[email protected]> wrote:

> When you turn "vectorized" on, does the following query consistently
> return 10000 in the output?
>
> select ten_thousand() from testTabOrc
>
> Yong
>
> ------------------------------
> Date: Fri, 30 May 2014 08:24:43 -0400
> Subject: Vectorization with UDFs returns incorrect results
> From: [email protected]
> To: [email protected]
>
>
> Hive 0.13 & Hadoop 2.4
>
> I am having an issue when using the combination of vectorized query
> execution, BETWEEN, and a custom UDF.  When I have vectorization on, my
> query returns an empty set.  When I then turn vectorization off, my query
> returns the correct results.
>
> Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1
> - X) and UDF_1
>
> My UDFs seem to be working for everything else except this specific
> circumstance.  Is this a issue in the hive software or am I writing my UDFs
> in such a way that they do not work with vectorization?  If the latter,
> what is the correct way?
>
> I created a test scenario where I was able to reproduce this problem I am
> seeing:
>
> *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000):  *
> package com.test;
>
> import org.apache.hadoop.hive.ql.exec.Description;
> import org.apache.hadoop.hive.ql.exec.UDF;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import java.lang.String;
> import java.lang.*;
>
> public class tenThousand extends UDF {
>
>   private final LongWritable result = new LongWritable();
>
>   public LongWritable evaluate() {
>     result.set(10000);
>     return result;
>   }
> }
>
> *TEST DATA (test.input):*
> 1|CBCABC|12
> 2|DBCABC|13
> 3|EBCABC|14
> 40000|ABCABC|15
> 50000|BBCABC|16
> 60000|CBCABC|17
>
> *CREATING ORC TABLE:*
> 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint,
> second varchar(20), third int) partitioned by (range int) clustered by
> (first) sorted by (first) into 8 buckets stored as orc tblproperties
> ("orc.compress" = "SNAPPY", "orc.index" = "true");
>
> *CREATE LOADING TABLE:*
> 0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint,
> second varchar(20), third int) partitioned by (range int) row format
> delimited fields terminated by '|' stored as textfile;
>
> *COPY IN DATA:*
> [root@server]#  hadoop fs -copyFromLocal /tmp/test.input /db/loading/.
>
> *ORC DATA:*
> [root@server]#  beeline -u jdbc:hive2://server:10002/db -n root
> --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf
> hive.enforce.sorting=true -e "insert into table testTabOrc partition(range)
> select * from loadingDir;"
>
> *LOAD TEST FUNCTION:*
> 0: jdbc:hive2://server:10002/db>  add jar /opt/hadoop/lib/testFunction.jar
> 0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand
> as 'com.test.tenThousand';
>
> *TURN OFF VECTORIZATION:*
> 0: jdbc:hive2://server:10002/db>  set
> hive.vectorized.execution.enabled=false;
>
> *QUERY (RESULTS AS EXPECTED):*
> 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
> between ten_thousand()-10000 and ten_thousand()-9995;
> +--------+
> | first  |
> +--------+
> | 1      |
> | 2      |
> | 3      |
> +--------+
> 3 rows selected (15.286 seconds)
>
> *TURN ON VECTORIZATION:*
> 0: jdbc:hive2://server:10002/db>  set
> hive.vectorized.execution.enabled=true;
>
> *QUERY AGAIN (WRONG RESULTS):*
> 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
> between ten_thousand()-10000 and ten_thousand()-9995;
> +--------+
> | first  |
> +--------+
> +--------+
> No rows selected (17.763 seconds)
>
>
>

Re: Vectorization with UDFs returns incorrect results

Reply via email to