Filters with IN clause are getting omitted

2019-04-23 Thread Rajat Khandelwal
Hi

I've recently noticed incorrect behaviour from Hive Query Planner. The
simplest example I could construct is as follows

SELECT
tbl3.col2 AS current_regularity_streak
FROM (select 1 col1) tbl1
LEFT JOIN
(select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
WHERE
tbl1.col1 in (select 1 col1 union all select 2)
AND
tbl3.col2 >= 2


The query should logically return no rows, but it does! It returns 1 row
with 1 column, with value = 1. The value=1 should have been filtered out by
the filter tbl3.col2 >= 2
  So  df

On further examination, I believe the culprit is the IN clause. If I remove
this, the query works correctly and returns 0 rows.

SELECT
tbl3.col2 AS current_regularity_streak
FROM (select 1 col1) tbl1
LEFT JOIN
(select 1 col1) tbl2 ON tbl1.col1 = tbl2.col1
LEFT JOIN (select 1 col1, 1 col2) tbl3 ON tbl1.col1 = tbl3.col1
WHERE
tbl3.col2 >= 2

Is this a known issue? I couldn't find anything on JIRA/Stack
overflow/Google.

I further analyzed using EXPLAIN FORMATTED and noticed that the plan of the
first query doesn't contain the >=2 predicate. The plan of the second query
does. I wonder how the planner could omit the filter clause altogether?

I'm using HDP 2.6.5.10-2.


Schema evolution in hive tables

2016-12-07 Thread Rajat Khandelwal
So far, my understanding has been that in Hive tables, each partition has a
schema and whenever you add a partition to a Hive table, the current table
schema is copied into the partition schema. This should allow a seamless
evolution of the schema. Recently I came across something that contradicts
this. Hence, looking for some clarification.

So we have a table, and we have fixed ORC format for it. The table has a
schema say (a,b,c,d). We added one partition. The data is stored in the
same order. When we query (a, b) from this partition, the data has the two
columns in the correct order. Now we go ahead and change the schema of the
*table* to (b,c,d,a). But the schema of the partition is still (a,b,c,d) as
verified by doing describe extended on the partition. Now we issue the same
query on the old partition projecting (a,b). Surprisingly, it projects (b,
c). Is this the expected behaviour or am I missing something obvious?

Coming back to the question of schema evolution, as business usecases grow,
there is a need to add fields in the table. So am I restricted by hive to
add my fields at the end only?

Thanks

-- 
Rajat Khandelwal
Software Engineer

-- 
_
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.