As part of fairly complex processing, I am executing a self join query using HiveContext against a Hive table to find the latest Transaction, oldest Transaction etc: for a given set of Attributes. I am still using v1.3.1 and so Window functions are not an option. The simplified query looks like below.
val df = hiveContext.sql("""SELECT TAB1.KEY1 ,TAB1.KEY2 ,MAX(CASE WHEN (TAB1.FLD10 = TAB2.min_FLD10) THEN TAB1.FLD11 ELSE -9999999 END) AS NEW_FLD FROM TAB1 INNER JOIN ( SELECT KEY1 ,KEY2 , MIN(FLD10) AS min_FLD10 FROM TAB1 WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01' GROUP BY KEY1 ,KEY2 ) TAB2 ON TAB1.KEY1= TAB2.KEY1AND TAB1.KEY2= TAB2.KEY1 WHERE partition_key >= '2015-01-01' and partition_key < '2015-07-01' GROUP BY TAB1.KEY1, TAB1.KEY2""") I see that ~18,000 HDFS blocks are read TWICE and then the Shuffle happens . Is there a way to avoid reading the same blocks TWICE during the Map Stage? Is there a way to try to avoid Shuffle? Thank You.