Does the same happen if all the tables are in ORC format? It might be just 
simpler to convert the text table to ORC since it is rather small

> On 29 Jun 2016, at 15:14, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi all,
> 
> It finished in 2 hours 18 minutes!
> 
> Started at
> [29/06/2016 10:25:27.27]
> [148]
> [148]
> [148]
> [148]
> [148]
> Finished at
> [29/06/2016 12:43:33.33]
> 
> I need to dig in more. 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 29 June 2016 at 10:42, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>> Focusing on Spark job, as I mentioned before Spark is running in local mode 
>> with 8GB of memory for both the driver and executor memory.
>> 
>> However, I still see this enormous Duration time which indicates something 
>> is wrong badly!
>> 
>> Also I got rid of groupBy
>> 
>>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>>   val s = HiveContext.table("sales_staging").select("PROD_ID")
>>   val rs = 
>> s2.join(s,"prod_id").sort(desc("prod_id")).take(5).foreach(println)
>> 
>> 
>> <image.png>
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 29 June 2016 at 10:18, Jörn Franke <jornfra...@gmail.com> wrote:
>>> 
>>> I think the TEZ engine is much more maintained with respect to 
>>> optimizations related to Orc , hive , vectorizing, querying than the mr 
>>> engine. It will be definitely better to use it.
>>> Mr is also deprecated in hive 2.0.
>>> For me it does not make sense to use mr with hive larger than 1.1.
>>> 
>>> As I said, order by might be inefficient to use (not sure if this has 
>>> changed). You may want to use sort by.
>>> 
>>> That being said there are many optimizations methods.
>>> 
>>>> On 29 Jun 2016, at 00:27, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>> wrote:
>>>> 
>>>> That is a good point.
>>>> 
>>>> The ORC table property is as follows
>>>> 
>>>> TBLPROPERTIES ( "orc.compress"="SNAPPY",
>>>> "orc.stripe.size"="268435456",
>>>> "orc.row.index.stride"="10000")
>>>> 
>>>> which puts each stripe at 256MB
>>>> 
>>>> Just to clarify this is spark running on Hive tables. I don't think the 
>>>> use of TEZ, MR or Spark as execution engines is going to make any 
>>>> difference?
>>>> 
>>>> This is the same query with Hive on MR
>>>> 
>>>> select a.prod_id from sales2 a, sales_staging b where a.prod_id = 
>>>> b.prod_id order by a.prod_id;
>>>> 
>>>> 2016-06-28 23:23:51,203 Stage-1 map = 0%,  reduce = 0%
>>>> 2016-06-28 23:23:59,480 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
>>>> 7.32 sec
>>>> 2016-06-28 23:24:08,771 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>>>> 18.21 sec
>>>> 2016-06-28 23:24:11,860 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU 
>>>> 22.34 sec
>>>> 2016-06-28 23:24:18,021 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 
>>>> 30.33 sec
>>>> 2016-06-28 23:24:21,101 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>>>> 33.45 sec
>>>> 2016-06-28 23:24:24,181 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 
>>>> 37.5 sec
>>>> 2016-06-28 23:24:27,270 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 
>>>> 42.0 sec
>>>> 2016-06-28 23:24:30,349 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
>>>> 45.62 sec
>>>> 2016-06-28 23:24:33,441 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>>>> 49.69 sec
>>>> 2016-06-28 23:24:36,521 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 
>>>> 52.92 sec
>>>> 2016-06-28 23:24:39,605 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>>>> 56.78 sec
>>>> 2016-06-28 23:24:42,686 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
>>>> 60.36 sec
>>>> 2016-06-28 23:24:45,767 Stage-1 map = 81%,  reduce = 0%, Cumulative CPU 
>>>> 63.68 sec
>>>> 2016-06-28 23:24:48,842 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU 
>>>> 66.92 sec
>>>> 2016-06-28 23:24:51,918 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
>>>> 70.18 sec
>>>> 2016-06-28 23:25:52,354 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
>>>> 127.99 sec
>>>> 2016-06-28 23:25:57,494 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 
>>>> 134.64 sec
>>>> 2016-06-28 23:26:57,847 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 
>>>> 141.01 sec
>>>> 
>>>> which basically sits at 67% all day
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 28 June 2016 at 23:07, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>> 
>>>>> 
>>>>> Bzip2 is splittable for text files.
>>>>> 
>>>>> Btw in Orc the question of splittable does not matter because each stripe 
>>>>> is compressed individually.
>>>>> 
>>>>> Have you tried tez? As far as I recall (at least it was in the first 
>>>>> version of Hive) mr uses for order by a single reducer which is a 
>>>>> bottleneck.
>>>>> 
>>>>> Do you see some errors in the log file?
>>>>> 
>>>>>> On 28 Jun 2016, at 23:53, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> 
>>>>>> I have a simple join between table sales2 a compressed (snappy) ORC with 
>>>>>> 22 million rows and another simple table sales_staging under a million 
>>>>>> rows stored as a text file with no compression.
>>>>>> 
>>>>>> The join is very simple
>>>>>> 
>>>>>>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>>>>>>   val s = HiveContext.table("sales_staging").select("PROD_ID")
>>>>>> 
>>>>>>   val rs = 
>>>>>> s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println)
>>>>>> 
>>>>>> 
>>>>>> Now what is happening is it is sitting on SortMergeJoin operation on 
>>>>>> ZippedPartitionRDD as shown in the DAG diagram below
>>>>>> 
>>>>>> 
>>>>>> <image.png>
>>>>>> 
>>>>>> 
>>>>>> And at this rate  only 10% is done and will take for ever to finish :(
>>>>>> 
>>>>>> Stage 3:==>                                                     (10 + 2) 
>>>>>> / 200]
>>>>>> 
>>>>>> Ok I understand that zipped files cannot be broken into blocks and 
>>>>>> operations on them cannot be parallelized.
>>>>>> 
>>>>>> Having said that what are the alternatives? Never use compression and 
>>>>>> live with it. I emphasise that any operation on the compressed table 
>>>>>> itself is pretty fast as it is a simple table scan. However, a join 
>>>>>> between two tables on a column as above suggests seems to be problematic?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> P.S. the same is happening using Hive with MR
>>>>>> 
>>>>>> select a.prod_id from sales2 a inner join sales_staging b on a.prod_id = 
>>>>>> b.prod_id order by a.prod_id;
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>> damages arising from such loss, damage or destruction.
> 

Reply via email to