Hi,

I have a simple join between table sales2 a compressed (snappy) ORC with 22
million rows and another simple table sales_staging under a million rows
stored as a text file with no compression.

The join is very simple

  val s2 = HiveContext.table("sales2").select("PROD_ID")
  val s = HiveContext.table("sales_staging").select("PROD_ID")

  val rs =
s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println)


Now what is happening is it is sitting on SortMergeJoin operation
on ZippedPartitionRDD as shown in the DAG diagram below


[image: Inline images 1]


And at this rate  only 10% is done and will take for ever to finish :(

Stage 3:==>                                                     (10 + 2) /
200]

Ok I understand that zipped files cannot be broken into blocks and
operations on them cannot be parallelized.

Having said that what are the alternatives? Never use compression and live
with it. I emphasise that any operation on the compressed table itself is
pretty fast as it is a simple table scan. However, a join between two
tables on a column as above suggests seems to be problematic?

Thanks

P.S. the same is happening using Hive with MR

select a.prod_id from sales2 a inner join sales_staging b on a.prod_id =
b.prod_id order by a.prod_id;

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to