Bzip2 is splittable for text files.
Btw in Orc the question of splittable does not matter because each stripe is compressed individually. Have you tried tez? As far as I recall (at least it was in the first version of Hive) mr uses for order by a single reducer which is a bottleneck. Do you see some errors in the log file? > On 28 Jun 2016, at 23:53, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > > I have a simple join between table sales2 a compressed (snappy) ORC with 22 > million rows and another simple table sales_staging under a million rows > stored as a text file with no compression. > > The join is very simple > > val s2 = HiveContext.table("sales2").select("PROD_ID") > val s = HiveContext.table("sales_staging").select("PROD_ID") > > val rs = > s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println) > > > Now what is happening is it is sitting on SortMergeJoin operation on > ZippedPartitionRDD as shown in the DAG diagram below > > > <image.png> > > > And at this rate only 10% is done and will take for ever to finish :( > > Stage 3:==> (10 + 2) / > 200] > > Ok I understand that zipped files cannot be broken into blocks and operations > on them cannot be parallelized. > > Having said that what are the alternatives? Never use compression and live > with it. I emphasise that any operation on the compressed table itself is > pretty fast as it is a simple table scan. However, a join between two tables > on a column as above suggests seems to be problematic? > > Thanks > > P.S. the same is happening using Hive with MR > > select a.prod_id from sales2 a inner join sales_staging b on a.prod_id = > b.prod_id order by a.prod_id; > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. >