Re: Self Join reading the HDFS blocks TWICE

2015-09-29 Thread Michael Armbrust
You could try caching the table. This would avoid the double read, but not the shuffle (at least today with the current optimizer). On Tue, Sep 29, 2015 at 5:21 PM, Data Science Education < datasci...@gmail.com> wrote: > As part of fairly complex processing, I am executing a self join query > us

Self Join reading the HDFS blocks TWICE

2015-09-29 Thread Data Science Education
As part of fairly complex processing, I am executing a self join query using HiveContext against a Hive table to find the latest Transaction, oldest Transaction etc: for a given set of Attributes. I am still using v1.3.1 and so Window functions are not an option. The simplified query looks like bel