RE: same query works with TEXTFILE and fails with ORC

2015-04-13 Thread Verhaeghe Philippe
Bug created in JIRA as HIVE-10316

-Message d'origine-
De : Gopal Vijayaraghavan [mailto:go...@hortonworks.com] De la part de Gopal 
Vijayaraghavan
Envoyé : Monday, April 13, 2015 11:46 PM
À : user@hive.apache.org
Objet : Re: same query works with TEXTFILE and fails with ORC

> I¹m getting an error in Hive when executing a query on a table in ORC
>format.

This is not an ORC bug, this looks like a vectorization issue.

Can you try comparing both query plans (³explain ²) for the Execution 
mode: vectorized markers?

TextFile queries are not vectorized today, since you cannot find if any column 
is marked as isRepeating=true in a row-major format.

> SELECT CONCAT(TO_DATE(datetime), '-'),   SUM(gpa)  FROM students_orc
>GROUP BY CONCAT(TO_DATE(datetime), '-Œ);

...
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
>Unsuported vector output type: StringGroup
>at
>org.apache.hadoop.hive.ql.exec.vector.VectorColumnSetInfo.addKey(Vector
>Col
>umnSetInfo.java:139)
>at
>org.apache.hadoop.hive.ql.exec.vector.VectorHashKeyWrapperBatch.compile
>Key
>WrapperBatch(VectorHashKeyWrapperBatch.java:521)
>at
>org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.initializeO
>p(V
>ectorGroupByOperator.java:786)

The correct fix would be to handle this query pattern for vectorization (or 
automatically disable vectorization, like it has to do for Unions).

Can you log a bug on Apache JIRA against the correct version of hive which 
threw this error up?

Cheers,
Gopal



Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
exclusif de ses destinataires. Il peut également être protégé par le secret 
professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant 
être assurée sur Internet, la responsabilité de Worldline ne pourra être 
recherchée quant au contenu de ce message. Bien que les meilleurs efforts 
soient faits pour maintenir cette transmission exempte de tout virus, 
l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, the Worldline liability cannot be triggered for the 
message content. Although the sender endeavours to maintain a computer 
virus-free network, the sender does not warrant that this transmission is 
virus-free and will not be liable for any damages resulting from any virus 
transmitted.


Re: same query works with TEXTFILE and fails with ORC

2015-04-13 Thread Gopal Vijayaraghavan
> I¹m getting an error in Hive when executing a query on a table in ORC
>format.

This is not an ORC bug, this looks like a vectorization issue.

Can you try comparing both query plans (³explain ²) for the
Execution mode: vectorized markers?

TextFile queries are not vectorized today, since you cannot find if any
column is marked as isRepeating=true in a row-major format.

> SELECT CONCAT(TO_DATE(datetime), '-'),   SUM(gpa)  FROM students_orc
>GROUP BY CONCAT(TO_DATE(datetime), '-Œ);

...
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unsuported
>vector output type: StringGroup
>at 
>org.apache.hadoop.hive.ql.exec.vector.VectorColumnSetInfo.addKey(VectorCol
>umnSetInfo.java:139)
>at 
>org.apache.hadoop.hive.ql.exec.vector.VectorHashKeyWrapperBatch.compileKey
>WrapperBatch(VectorHashKeyWrapperBatch.java:521)
>at 
>org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.initializeOp(V
>ectorGroupByOperator.java:786)

The correct fix would be to handle this query pattern for vectorization
(or automatically disable vectorization, like it has to do for Unions).

Can you log a bug on Apache JIRA against the correct version of hive which
threw this error up?

Cheers,
Gopal



same query works with TEXTFILE and fails with ORC

2015-04-13 Thread Verhaeghe Philippe

I'm getting an error in Hive when executing a query on a table in ORC format.
After several trials, I succeeded to run the same query on the same table in 
TEXTFILE format.
I 've been able to reproduce the error with the simple sql script below.
I create the same table in TEXFILE and in ORC and I run a SELECT ...GROUP BY on 
the tables.
The first SELECT issued on the TEXTFILE table succeeds.
The second SELECT issued on the ORC table fails.
NB : There is a CONCAT in the query. If I remove the CONCAT the query is 
running ok with both tables ...

Example script to reproduce the error :

USE pvr_temp;
DROP TABLE IF EXISTS students_text;
CREATE TABLE students_text (name VARCHAR(64), age INT, datetime TIMESTAMP, gpa 
DECIMAL(3, 2)) STORED AS TEXTFILE;
INSERT INTO TABLE students_text VALUES ('fred flintstone', 35, '2015-04-13 
13:40:00', 1.28), ('barney rubble', 32, '2015-04-13 13:40:00', 2.32);
SELECT CONCAT(TO_DATE(datetime), '-'),   SUM(gpa)  FROM students_text GROUP BY 
CONCAT(TO_DATE(datetime), '-');
DROP TABLE IF EXISTS students_orc;
CREATE TABLE students_orc (name VARCHAR(64), age INT, datetime TIMESTAMP, gpa 
DECIMAL(3, 2)) STORED AS ORC;
INSERT INTO TABLE students_orc VALUES ('fred flintstone', 35, '2015-04-13 
SELECT CONCAT(TO_DATE(datetime), '-'),   SUM(gpa)  FROM students_orc GROUP BY 
CONCAT(TO_DATE(datetime), '-');
13:40:00', 1.28), ('barney rubble', 32, '2015-04-13 13:40:00', 2.32);


Log where you can see the error :

[pvr@tpcalr01s ~]$ cat test.log
scan complete in 9ms
Connecting to jdbc:hive2://tpcrmm03s:1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/usr/hdp/2.2.0.0-2041/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/usr/hdp/2.2.0.0-2041/hive/lib/hive-jdbc-0.14.0.2.2.0.0-2041-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Connected to: Apache Hive (version 0.14.0.2.2.0.0-2041)
Driver: Hive JDBC (version 0.14.0.2.2.0.0-2041)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://tpcrmm03s:1> USE pvr_temp;
No rows affected (0.061 seconds)
0: jdbc:hive2://tpcrmm03s:1> DROP TABLE IF EXISTS students_text;
No rows affected (0.12 seconds)
0: jdbc:hive2://tpcrmm03s:1> CREATE TABLE students_text (name VARCHAR(64), 
age INT, datetime TIMESTAMP, gpa DECIMAL(3, 2)) STORED AS TEXTFILE;
No rows affected (0.057 seconds)
0: jdbc:hive2://tpcrmm03s:1> INSERT INTO TABLE students_text VALUES ('fred 
flintstone', 35, '2015-04-13 13:40:00', 1.28), ('barney rubble', 32, 
'2015-04-13 13:40:00', 2.32);
INFO  : Tez session hasn't been created yet. Opening session
INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id 
application_1428656093356_0047)

INFO  : Map 1: -/-
INFO  : Map 1: 0/1
No rows affected (14.134 seconds)
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Loading data to table pvr_temp.students_text from 
hdfs://tpcrmm01s.priv.atos.fr:8020/tmp/hive/hive/bf19c354-de67-45ae-a3e4-cd57d81acd71/hive_2015-04-13_14-15-08_445_2811483497310651606-20/-ext-1
INFO  : Table pvr_temp.students_text stats: [numFiles=1, numRows=2, 
totalSize=86, rawDataSize=84]
0: jdbc:hive2://tpcrmm03s:1> SELECT CONCAT(TO_DATE(datetime), '-'),   
SUM(gpa)  FROM students_text GROUP BY CONCAT(TO_DATE(datetime), '-');
INFO  : Session is already open
INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id 
application_1428656093356_0047)

INFO  : Map 1: -/-  Reducer 2: 0/1
INFO  : Map 1: 0/1  Reducer 2: 0/1
INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
INFO  : Map 1: 1/1  Reducer 2: 0(+1)/1
INFO  : Map 1: 1/1  Reducer 2: 1/1
+--+--+--+
| _c0  | _c1  |
+--+--+--+
| 2015-04-13-  | 3.6  |
+--+--+--+
1 row selected (3.258 seconds)
0: jdbc:hive2://tpcrmm03s:1> DROP TABLE IF EXISTS students_orc;
No rows affected (0.109 seconds)
0: jdbc:hive2://tpcrmm03s:1> CREATE TABLE students_orc (name VARCHAR(64), 
age INT, datetime TIMESTAMP, gpa DECIMAL(3, 2)) STORED AS ORC;
No rows affected (0.063 seconds)
0: jdbc:hive2://tpcrmm03s:1> INSERT INTO TABLE students_orc VALUES ('fred 
flintstone', 35, '2015-04-13 13:40:00', 1.28), ('barney rubble', 32, 
'2015-04-13 13:40:00', 2.32);
No rows affected (2.125 seconds)
INFO  : Session is already open
INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id 
application_1428656093356_0047)

INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Loading data to table pvr_temp.students_orc from 
hdfs://tpcrmm01s.priv.atos.fr:8020/tmp/hive/hive/bf19c354-de67-45ae-a3e4-cd57d81acd71/hive_2015-04-13_14-15-26_056_1247475009666467472-20/-ext-1
INFO  : Table pvr_temp.students_orc stats: [numFiles=1, numRows=2, 
totalSize=590, rawDataSize=508