Bzip2 is splittable for text files.

Btw in Orc the question of splittable does not matter because each stripe is 
compressed individually.

Have you tried tez? As far as I recall (at least it was in the first version of 
Hive) mr uses for order by a single reducer which is a bottleneck.

Do you see some errors in the log file?

> On 28 Jun 2016, at 23:53, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> 
> I have a simple join between table sales2 a compressed (snappy) ORC with 22 
> million rows and another simple table sales_staging under a million rows 
> stored as a text file with no compression.
> 
> The join is very simple
> 
>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>   val s = HiveContext.table("sales_staging").select("PROD_ID")
> 
>   val rs = 
> s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println)
> 
> 
> Now what is happening is it is sitting on SortMergeJoin operation on 
> ZippedPartitionRDD as shown in the DAG diagram below
> 
> 
> <image.png>
> 
> 
> And at this rate  only 10% is done and will take for ever to finish :(
> 
> Stage 3:==>                                                     (10 + 2) / 
> 200]
> 
> Ok I understand that zipped files cannot be broken into blocks and operations 
> on them cannot be parallelized.
> 
> Having said that what are the alternatives? Never use compression and live 
> with it. I emphasise that any operation on the compressed table itself is 
> pretty fast as it is a simple table scan. However, a join between two tables 
> on a column as above suggests seems to be problematic?
> 
> Thanks
> 
> P.S. the same is happening using Hive with MR
> 
> select a.prod_id from sales2 a inner join sales_staging b on a.prod_id = 
> b.prod_id order by a.prod_id;
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  

Reply via email to