RE: Vectorization with UDFs returns incorrect results

java8964 Fri, 30 May 2014 12:48:07 -0700

I downloaded the Hive13 and confirmed this problem existed.
My suggestion is to change your query to
select first from testTabOrc where first >= ten_thousand()-10000 and first <= 
ten_thousand()-9995;
which works in my environment (Hive 0.13 + hadoop 2.2)
You can create a Jira for it. It looks like the problem caused by "between" 
with Vectorization and UDF.
BTW, your UDF looks fine to me.
Yong


Date: Fri, 30 May 2014 11:22:18 -0400
Subject: Re: Vectorization with UDFs returns incorrect results
From: [email protected]
To: [email protected]

Yong,
Thanks for the reply.  Yes the query "select ten_thousand() from testTabOrc" 
does return consistent results regardless of the vectorization setting.  I ran 
that query with the vectorization off and then turned it on and ran it 5 more 
times.  Each time it returned correct results.  See output below. 

Thanks,Ben

=================================================

0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+|  _c0   |+--------+| 10000  |
| 10000  || 10000  || 10000  || 10000  |
| 10000  |+--------+6 rows selected (17.807 seconds)0: 
jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db> select 
ten_thousand() from testTabOrc;+--------+
|  _c0   |+--------+| 10000  || 10000  |
| 10000  || 10000  || 10000  || 10000  |
+--------+6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db> 
select ten_thousand() from testTabOrc;
+--------+|  _c0   |+--------+| 10000  |
| 10000  || 10000  || 10000  || 10000  |
| 10000  |+--------+6 rows selected (14.267 seconds)0: 
jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+|  _c0   |+--------+| 10000  |
| 10000  || 10000  || 10000  || 10000  |
| 10000  |+--------+6 rows selected (18.259 seconds)0: 
jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
+--------+|  _c0   |+--------+| 10000  |
| 10000  || 10000  || 10000  || 10000  |
| 10000  |+--------+6 rows selected (18.778 seconds) 





On Fri, May 30, 2014 at 10:52 AM, java8964 <[email protected]> wrote:




When you turn "vectorized" on, does the following query consistently return 
10000 in the output?

select ten_thousand() from testTabOrc

Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results

From: [email protected]
To: [email protected]

Hive 0.13 & Hadoop 2.4

I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 


Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?


I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 10000):  
package com.test;


import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;

import java.lang.*;
public class tenThousand extends UDF {


  private final LongWritable result = new LongWritable();


  public LongWritable evaluate() {    result.set(10000);    return result;

  }}
TEST DATA (test.input):1|CBCABC|12

2|DBCABC|133|EBCABC|1440000|ABCABC|1550000|BBCABC|16

60000|CBCABC|17
CREATING ORC TABLE:

0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = 
"SNAPPY", "orc.index" = "true");


CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;


COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.


ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) 
select * from loadingDir;"


LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db>  add jar 
/opt/hadoop/lib/testFunction.jar


0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand as 
'com.test.tenThousand';



TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db>  set 
hive.vectorized.execution.enabled=false;


QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db> select first from 
testTabOrc where first between ten_thousand()-10000 and ten_thousand()-9995;

+--------+| first  |+--------+

| 1      || 2      || 3      |+--------+

3 rows selected (15.286 seconds)
TURN ON VECTORIZATION:

0: jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=true;
QUERY AGAIN (WRONG RESULTS):

0: jdbc:hive2://server:10002/db> select first from testTabOrc where first 
between ten_thousand()-10000 and ten_thousand()-9995;

+--------+| first  |+--------++--------+

No rows selected (17.763 seconds)

RE: Vectorization with UDFs returns incorrect results

Reply via email to