Hello, all.

Has anyone noticed that skew-joins aren't working on Hive 0.11 / Hadoop 0.23?

I've been running the TPC-h benchmarks against Hive 0.11, and I see that none 
of the queries run through if hive.optimize.skewjoin is set to true.

I initially ran into problems like the following:

<quote>
Ended Job = job_1371646843240_1214
java.io.FileNotFoundException: File 
hdfs://fstaxxx.yyy.yahoo.com/tmp/hive_2013-07-12_03-22-31_737_6843191588894968654/-mr-10004/hive_skew_join_bigkeys_0
 does not exist.
</quote> 

Patching Hive 0.11 with HIVE-4646 resolved that problem.

What I see now is that a couple of stages of the query run through 
successfully, after which I get the following message, and the remaining stages 
are skipped.

<quote>
2013-07-12 23:21:02,164 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 
15985.47 sec
MapReduce Total cumulative CPU time: 0 days 4 hours 26 minutes 25 seconds 470 
msec
Ended Job = job_1371646843240_1295
Stage-10 is filtered out by condition resolver.
MapReduce Jobs Launched:
Job 0: Map: 380  Reduce: 118   Cumulative CPU: 15900.35 sec   HDFS Read: 
24574270287 HDFS Write: 4925478398 SUCCESS
Total MapReduce CPU Time Spent: 0 days 4 hours 25 minutes 0 seconds 350 msec
OK
Time taken: 109.411 seconds
FAILED: SemanticException [Error 10001]: Line 10:5 Table not found 
'q16_tmp_cached'
</quote>

In this particular case, the query is q16_parts_supplier_relationship.hive, 
part of which looks like:

<quote>
create table q16_tmp_cached as
select
  p_brand, p_type, p_size, ps_suppkey
from
  partsupp ps join part p
  on
    p.p_partkey = ps.ps_partkey and p.p_brand <> 'Brand#45'
    and not p.p_type like 'MEDIUM POLISHED%'
  join supplier_tmp_cached s
  on
    ps.ps_suppkey = s.s_suppkey;
</quote>

If I can isolate the problem to a smaller test-case, I'll raise a JIRA. I was 
hoping one of you might have seen this already, or might have a better handle 
of how skew-joins work in Hive 11.

Many thanks,
Mithun

Reply via email to