Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Michael Segel
Hi, I’m not sure I understand your initial question… Depending on the compression algo, you may or may not be able to split the file. So if its not splittable, you have a single long running thread. My guess is that you end up with a very long single partition. If so, if you repartition,

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
Does the same happen if all the tables are in ORC format? It might be just simpler to convert the text table to ORC since it is rather small > On 29 Jun 2016, at 15:14, Mich Talebzadeh wrote: > > Hi all, > > It finished in 2 hours 18 minutes! > > Started at >

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
I think the TEZ engine is much more maintained with respect to optimizations related to Orc , hive , vectorizing, querying than the mr engine. It will be definitely better to use it. Mr is also deprecated in hive 2.0. For me it does not make sense to use mr with hive larger than 1.1. As I

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Timur Shenkao
Hi, guys! As far as I remember, Spark does not use all peculiarities and optimizations of ORC. Moreover, the possibility to read ORC files appeared not so long time ago in Spark. So, despite "victorious" results announced in http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ ,

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Mich Talebzadeh
This is what I am getting in the container log for mr 2016-06-28 23:25:53,808 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Mich Talebzadeh
That is a good point. The ORC table property is as follows TBLPROPERTIES ( "orc.compress"="SNAPPY", "orc.stripe.size"="268435456", "orc.row.index.stride"="1") which puts each stripe at 256MB Just to clarify this is spark running on Hive tables. I don't think the use of TEZ, MR or Spark as

Re: Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Jörn Franke
Bzip2 is splittable for text files. Btw in Orc the question of splittable does not matter because each stripe is compressed individually. Have you tried tez? As far as I recall (at least it was in the first version of Hive) mr uses for order by a single reducer which is a bottleneck. Do you

Joining a compressed ORC table with a non compressed text table

2016-06-28 Thread Mich Talebzadeh
Hi, I have a simple join between table sales2 a compressed (snappy) ORC with 22 million rows and another simple table sales_staging under a million rows stored as a text file with no compression. The join is very simple val s2 = HiveContext.table("sales2").select("PROD_ID") val s =