RE: Vectorization with UDFs returns incorrect results

2014-05-30 Thread java8964
I downloaded the Hive13 and confirmed this problem existed.
My suggestion is to change your query to
select first from testTabOrc where first >= ten_thousand()-1 and first <= 
ten_thousand()-9995;
which works in my environment (Hive 0.13 + hadoop 2.2)
You can create a Jira for it. It looks like the problem caused by "between" 
with Vectorization and UDF.
BTW, your UDF looks fine to me.
Yong

Date: Fri, 30 May 2014 11:22:18 -0400
Subject: Re: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Yong,
Thanks for the reply.  Yes the query "select ten_thousand() from testTabOrc" 
does return consistent results regardless of the vectorization setting.  I ran 
that query with the vectorization off and then turned it on and ran it 5 more 
times.  Each time it returned correct results.  See output below. 

Thanks,Ben

=

0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (17.807 seconds)0: 
jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)0: jdbc:hive2://server:10002/db> select 
ten_thousand() from testTabOrc;++
|  _c0   |++| 1  || 1  |
| 1  || 1  || 1  || 1  |
++6 rows selected (17.23 seconds)0: jdbc:hive2://server:10002/db> 
select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (14.267 seconds)0: 
jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (18.259 seconds)0: 
jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++|  _c0   |++| 1  |
| 1  || 1  || 1  || 1  |
| 1  |++6 rows selected (18.778 seconds) 





On Fri, May 30, 2014 at 10:52 AM, java8964  wrote:




When you turn "vectorized" on, does the following query consistently return 
1 in the output?

select ten_thousand() from testTabOrc

Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results

From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13 & Hadoop 2.4

I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 


Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?


I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  
package com.test;


import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;

import java.lang.*;
public class tenThousand extends UDF {


  private final LongWritable result = new LongWritable();


  public LongWritable evaluate() {result.set(1);return result;

  }}
TEST DATA (test.input):1|CBCABC|12

2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16

6|CBCABC|17
CREATING ORC TABLE:

0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = 
"SNAPPY", "orc.index" = "true");


CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;


COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.


ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) 
select * from loadingDir;"


LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db>  add jar 
/opt/hadoop/lib/testFunction.jar


0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand as 
'com.test.tenThousand';



TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db>  set 
hive.vectorized.execution.enabl

Re: Vectorization with UDFs returns incorrect results

2014-05-30 Thread Benjamin Bowman
Yong,

Thanks for the reply.  Yes the query "select ten_thousand() from testTabOrc"
does return consistent results regardless of the vectorization setting.  I
ran that query with the vectorization off and then turned it on and ran it
5 more times.  Each time it returned correct results.  See output below.

Thanks,
Ben

=

0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (17.807 seconds)
0: jdbc:hive2://server:10002/db>  set
hive.vectorized.execution.enabled=true;
No rows affected (0.002 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (17.23 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (14.267 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (18.259 seconds)
0: jdbc:hive2://server:10002/db> select ten_thousand() from testTabOrc;
++
|  _c0   |
++
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
| 1  |
++
6 rows selected (18.778 seconds)





On Fri, May 30, 2014 at 10:52 AM, java8964  wrote:

> When you turn "vectorized" on, does the following query consistently
> return 1 in the output?
>
> select ten_thousand() from testTabOrc
>
> Yong
>
> --
> Date: Fri, 30 May 2014 08:24:43 -0400
> Subject: Vectorization with UDFs returns incorrect results
> From: bbowman...@gmail.com
> To: user@hive.apache.org
>
>
> Hive 0.13 & Hadoop 2.4
>
> I am having an issue when using the combination of vectorized query
> execution, BETWEEN, and a custom UDF.  When I have vectorization on, my
> query returns an empty set.  When I then turn vectorization off, my query
> returns the correct results.
>
> Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1
> - X) and UDF_1
>
> My UDFs seem to be working for everything else except this specific
> circumstance.  Is this a issue in the hive software or am I writing my UDFs
> in such a way that they do not work with vectorization?  If the latter,
> what is the correct way?
>
> I created a test scenario where I was able to reproduce this problem I am
> seeing:
>
> *TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  *
> package com.test;
>
> import org.apache.hadoop.hive.ql.exec.Description;
> import org.apache.hadoop.hive.ql.exec.UDF;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import java.lang.String;
> import java.lang.*;
>
> public class tenThousand extends UDF {
>
>   private final LongWritable result = new LongWritable();
>
>   public LongWritable evaluate() {
> result.set(1);
> return result;
>   }
> }
>
> *TEST DATA (test.input):*
> 1|CBCABC|12
> 2|DBCABC|13
> 3|EBCABC|14
> 4|ABCABC|15
> 5|BBCABC|16
> 6|CBCABC|17
>
> *CREATING ORC TABLE:*
> 0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint,
> second varchar(20), third int) partitioned by (range int) clustered by
> (first) sorted by (first) into 8 buckets stored as orc tblproperties
> ("orc.compress" = "SNAPPY", "orc.index" = "true");
>
> *CREATE LOADING TABLE:*
> 0: jdbc:hive2://server:10002/db> create table loadingDir (first bigint,
> second varchar(20), third int) partitioned by (range int) row format
> delimited fields terminated by '|' stored as textfile;
>
> *COPY IN DATA:*
> [root@server]#  hadoop fs -copyFromLocal /tmp/test.input /db/loading/.
>
> *ORC DATA:*
> [root@server]#  beeline -u jdbc:hive2://server:10002/db -n root
> --hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf
> hive.enforce.sorting=true -e "insert into table testTabOrc partition(range)
> select * from loadingDir;"
>
> *LOAD TEST FUNCTION:*
> 0: jdbc:hive2://server:10002/db>  add jar /opt/hadoop/lib/testFunction.jar
> 0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand
> as 'com.test.tenThousand';
>
> *TURN OFF VECTORIZATION:*
> 0: jdbc:hive2://server:10002/db>  set
> hive.vectorized.execution.enabled=false;
>
> *QUERY (RESULTS AS EXPECTED):*
> 0: jdbc:hive2://server:10002/db> select first from testTabOrc where first
> between ten_thousand()-1 and ten_thousand()-9995;
> ++
> | first  |
> ++
> | 1  |
> | 2  |
> | 3  |
> ++
> 3 rows selected (15.286 seconds)
>
> *TURN ON VECTORIZATION:*
> 0: jdbc:hive2://server:10002/db>  set
> hive.vectorized.execution.enabled=true;
>
> *QUERY AGAIN (WRONG RESULTS):*
> 0: jdbc:hive2://server:10002/db> s

RE: Vectorization with UDFs returns incorrect results

2014-05-30 Thread java8964
When you turn "vectorized" on, does the following query consistently return 
1 in the output?
select ten_thousand() from testTabOrc
Yong

Date: Fri, 30 May 2014 08:24:43 -0400
Subject: Vectorization with UDFs returns incorrect results
From: bbowman...@gmail.com
To: user@hive.apache.org

Hive 0.13 & Hadoop 2.4
I am having an issue when using the combination of vectorized query execution, 
BETWEEN, and a custom UDF.  When I have vectorization on, my query returns an 
empty set.  When I then turn vectorization off, my query returns the correct 
results. 

Example Query:  SELECT column_1 FROM table_1 WHERE column_1 BETWEEN (UDF_1 - X) 
and UDF_1
My UDFs seem to be working for everything else except this specific 
circumstance.  Is this a issue in the hive software or am I writing my UDFs in 
such a way that they do not work with vectorization?  If the latter, what is 
the correct way?

I created a test scenario where I was able to reproduce this problem I am 
seeing:
TEST UDF (SIMPLE FUNCTION THAT TAKES NO ARGUMENTS AND RETURNS 1):  
package com.test;

import org.apache.hadoop.hive.ql.exec.Description;import 
org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;import 
org.apache.hadoop.io.Text;import java.lang.String;
import java.lang.*;
public class tenThousand extends UDF {

  private final LongWritable result = new LongWritable();

  public LongWritable evaluate() {result.set(1);return result;
  }}
TEST DATA (test.input):1|CBCABC|12
2|DBCABC|133|EBCABC|144|ABCABC|155|BBCABC|16
6|CBCABC|17
CREATING ORC TABLE:
0: jdbc:hive2://server:10002/db> create table testTabOrc (first bigint, second 
varchar(20), third int) partitioned by (range int) clustered by (first) sorted 
by (first) into 8 buckets stored as orc tblproperties ("orc.compress" = 
"SNAPPY", "orc.index" = "true");

CREATE LOADING TABLE:0: jdbc:hive2://server:10002/db> create table loadingDir 
(first bigint, second varchar(20), third int) partitioned by (range int) row 
format delimited fields terminated by '|' stored as textfile;

COPY IN DATA:[root@server]#  hadoop fs -copyFromLocal /tmp/test.input 
/db/loading/.

ORC DATA:[root@server]#  beeline -u jdbc:hive2://server:10002/db -n root 
--hiveconf hive.exec.dynamic.partition.mode=nonstrict --hiveconf 
hive.enforce.sorting=true -e "insert into table testTabOrc partition(range) 
select * from loadingDir;"

LOAD TEST FUNCTION:0: jdbc:hive2://server:10002/db>  add jar 
/opt/hadoop/lib/testFunction.jar

0: jdbc:hive2://server:10002/db>  create temporary function ten_thousand as 
'com.test.tenThousand';


TURN OFF VECTORIZATION:0: jdbc:hive2://server:10002/db>  set 
hive.vectorized.execution.enabled=false;

QUERY (RESULTS AS EXPECTED):0: jdbc:hive2://server:10002/db> select first from 
testTabOrc where first between ten_thousand()-1 and ten_thousand()-9995;
++| first  |++
| 1  || 2  || 3  |++
3 rows selected (15.286 seconds)
TURN ON VECTORIZATION:
0: jdbc:hive2://server:10002/db>  set hive.vectorized.execution.enabled=true;
QUERY AGAIN (WRONG RESULTS):
0: jdbc:hive2://server:10002/db> select first from testTabOrc where first 
between ten_thousand()-1 and ten_thousand()-9995;
++| first  |++++
No rows selected (17.763 seconds)