Hello, all. Has anyone noticed that skew-joins aren't working on Hive 0.11 / Hadoop 0.23?
I've been running the TPC-h benchmarks against Hive 0.11, and I see that none of the queries run through if hive.optimize.skewjoin is set to true. I initially ran into problems like the following: <quote> Ended Job = job_1371646843240_1214 java.io.FileNotFoundException: File hdfs://fstaxxx.yyy.yahoo.com/tmp/hive_2013-07-12_03-22-31_737_6843191588894968654/-mr-10004/hive_skew_join_bigkeys_0 does not exist. </quote> Patching Hive 0.11 with HIVE-4646 resolved that problem. What I see now is that a couple of stages of the query run through successfully, after which I get the following message, and the remaining stages are skipped. <quote> 2013-07-12 23:21:02,164 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 15985.47 sec MapReduce Total cumulative CPU time: 0 days 4 hours 26 minutes 25 seconds 470 msec Ended Job = job_1371646843240_1295 Stage-10 is filtered out by condition resolver. MapReduce Jobs Launched: Job 0: Map: 380 Reduce: 118 Cumulative CPU: 15900.35 sec HDFS Read: 24574270287 HDFS Write: 4925478398 SUCCESS Total MapReduce CPU Time Spent: 0 days 4 hours 25 minutes 0 seconds 350 msec OK Time taken: 109.411 seconds FAILED: SemanticException [Error 10001]: Line 10:5 Table not found 'q16_tmp_cached' </quote> In this particular case, the query is q16_parts_supplier_relationship.hive, part of which looks like: <quote> create table q16_tmp_cached as select p_brand, p_type, p_size, ps_suppkey from partsupp ps join part p on p.p_partkey = ps.ps_partkey and p.p_brand <> 'Brand#45' and not p.p_type like 'MEDIUM POLISHED%' join supplier_tmp_cached s on ps.ps_suppkey = s.s_suppkey; </quote> If I can isolate the problem to a smaller test-case, I'll raise a JIRA. I was hoping one of you might have seen this already, or might have a better handle of how skew-joins work in Hive 11. Many thanks, Mithun