Victoria Markman created DRILL-3895:
---------------------------------------

             Summary: Can not join on int96 column coming from two different 
sources: hive and impala
                 Key: DRILL-3895
                 URL: https://issues.apache.org/jira/browse/DRILL-3895
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Data Types
    Affects Versions: 1.2.0
            Reporter: Victoria Markman


I would think that cross source join on a column of int96 type should work in 
drill, since we convert it to varbinary internally. It is very possible that 
I'm hitting a bug related to varbinary data type ...

{code}
Hive generated parquet: dfs.`test/type_test`

[Mon Oct 05 09:37:37] # ~/parquet-tools/parquet-schema 000000_0
message hive_schema {
  optional int32 num;
  optional binary word (UTF8);
  optional int96 dtg;
  optional double dollar;
}

Implala generated parquet: dfs.`drill/testdata/subqueries/imp_t1`
[Mon Oct 05 09:38:40 ] # ~/parquet-tools/parquet-schema 
243293260064ba0-808af32a4ab4e487_393209663_data.0.parq
message schema {
  optional binary c_varchar (UTF8);
  optional int32 c_integer;
  optional int64 c_bigint;
  optional float c_float;
  optional double c_double;
  optional binary c_date (UTF8);
  optional binary c_time (UTF8);
  optional int96 c_timestamp;
  optional boolean c_boolean;
  optional double d9;
  optional double d18;
  optional double d28;
  optional double d38;
}

0: jdbc:drill:schema=dfs> select count(*) from `test/type_test`;
+---------+
| EXPR$0  |
+---------+
| 2       |
+---------+
1 row selected (0.3 seconds)


0: jdbc:drill:schema=dfs> select count(*) from 
`drill/testdata/subqueries/imp_t1`;
+---------+
| EXPR$0  |
+---------+
| 10000   |
+---------+
1 row selected (0.259 seconds)
{code}

*Join 'AS IS' on int96 column results in an error*

-- IN clause
{code}
0: jdbc:drill:schema=dfs> select count(*) from 
`drill/testdata/subqueries/imp_t1` where c_timestamp IN ( select c_timestamp 
from `test/type_test`);
Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts 
between 1. Numeric data
 2. Varchar, Varbinary data 3. Date, Timestamp data Left type: VARBINARY, Right 
type: INT. Add explicit casts to avoid this error
Fragment 0:0
[Error Id: 8bce65e1-0e8f-45fe-9990-eb980aeae53e on atsqa4-133.qa.lab:31010] 
(state=,code=0)
{code}

-- NOT IN clause
{code}
0: jdbc:drill:schema=dfs> select count(*) from 
`drill/testdata/subqueries/imp_t1` where c_timestamp NOT IN ( select 
c_timestamp from `test/type_test`);
Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts 
between 1. Numeric data
 2. Varchar, Varbinary data 3. Date, Timestamp data Left type: VARBINARY, Right 
type: INT. Add explicit casts to avoid this error
Fragment 0:0
[Error Id: 4307937f-fbc0-40c7-b2d4-8e4835e79ae8 on atsqa4-133.qa.lab:31010] 
(state=,code=0)
{code}

-- JOIN
{code}
0: jdbc:drill:schema=dfs> select count(*) from dfs.`test/type_test` a, 
dfs.`drill/testdata/subqueries/imp_t1` b where a.c_timestamp = b.c_timestamp;
Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts 
between 1. Numeric data
 2. Varchar, Varbinary data 3. Date, Timestamp data Left type: VARBINARY, Right 
type: INT. Add explicit casts to avoid this error
Fragment 0:0
[Error Id: e80225a3-eeb6-4c5b-bda1-a1f0d13d7edf on atsqa4-133.qa.lab:31010] 
(state=,code=0)
{code}

*Attempt to explicitly cast to varbinary type*(one of these queries should have 
returned non zero row count)

{code}
0: jdbc:drill:schema=dfs> select count(*) from 
dfs.`drill/testdata/subqueries/imp_t1` where cast(c_timestamp as varbinary(10)) 
NOT IN ( select cast(c_timestamp as varbinary(10)) from dfs.`test/type_test`);
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (0.422 seconds)

0: jdbc:drill:schema=dfs> select count(*) from 
dfs.`drill/testdata/subqueries/imp_t1` where cast(c_timestamp as varbinary(10)) 
IN ( select cast(c_timestamp as varbinary(10)) from dfs.`test/type_test`);
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
{code}

*Use CONVERT_FROM function* (one of these queries should have returned non zero 
row count)
{code}
0: jdbc:drill:schema=dfs> select count(*) from 
dfs.`drill/testdata/subqueries/imp_t1` where CONVERT_FROM(c_timestamp, 
'TIMESTAMP_IMPALA') IN ( select CONVERT_FROM(c_timestamp, 'TIMESTAMP_IMPALA') 
from dfs.`test/type_test`);
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (0.683 seconds)

0: jdbc:drill:schema=dfs> select count(*) from 
dfs.`drill/testdata/subqueries/imp_t1` where CONVERT_FROM(c_timestamp, 
'TIMESTAMP_IMPALA') NOT IN ( select CONVERT_FROM(c_timestamp, 
'TIMESTAMP_IMPALA') from dfs.`test/type_test`);
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (0.858 seconds)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to