Re: the spark job is so slow - almost frozen

2016-07-21 Thread Gourav Sengupta
Andrew,

you have pretty much consolidated my entire experience, please give a
presentation in a meetup on this, and send across the links :)


Regards,
Gourav

On Wed, Jul 20, 2016 at 4:35 AM, Andrew Ehrlich  wrote:

> Try:
>
> - filtering down the data as soon as possible in the job, dropping columns
> you don’t need.
> - processing fewer partitions of the hive tables at a time
> - caching frequently accessed data, for example dimension tables, lookup
> tables, or other datasets that are repeatedly accessed
> - using the Spark UI to identify the bottlenecked resource
> - remove features or columns from the output data, until it runs, then add
> them back in one at a time.
> - creating a static dataset small enough to work, and editing the query,
> then retesting, repeatedly until you cut the execution time by a
> significant fraction
> - Using the Spark UI or spark shell to check the skew and make sure
> partitions are evenly distributed
>
> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu  > wrote:
>
> Thanks a lot for your reply .
>
> In effect , here we tried to run the sql on kettle, hive and spark hive
> (by HiveContext) respectively, the job seems frozen  to finish to run .
>
> In the 6 tables , need to respectively read the different columns in
> different tables for specific information , then do some simple calculation
> before output .
> join operation is used most in the sql .
>
> Best wishes!
>
>
>
>
> On Monday, July 18, 2016 6:24 PM, Chanh Le  wrote:
>
>
> Hi,
> What about the network (bandwidth) between hive and spark?
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show
> what going on.
>
>
>
>
>
>
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu  > wrote:
>
> the sql logic in the program is very much complex , so do not describe the
> detailed codes   here .
>
>
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <
> zchl.j...@yahoo.com.INVALID > wrote:
>
>
> Hi All,
>
> Here we have one application, it needs to extract different columns from 6
> hive tables, and then does some easy calculation, there is around 100,000
> number of rows in each table,
> finally need to output another table or file (with format of consistent
> columns) .
>
>  However, after lots of days trying, the spark hive job is unthinkably
> slow - sometimes almost frozen. There is 5 nodes for spark cluster.
>
> Could anyone offer some help, some idea or clue is also good.
>
> Thanks in advance~
>
> Zhiliang
>
>
>
>
>
>
>


Re: the spark job is so slow - almost frozen

2016-07-20 Thread Zhiliang Zhu
Thanks a lot for your kind help.  

On Wednesday, July 20, 2016 11:35 AM, Andrew Ehrlich  
wrote:
 

 Try:
- filtering down the data as soon as possible in the job, dropping columns you 
don’t need.- processing fewer partitions of the hive tables at a time- caching 
frequently accessed data, for example dimension tables, lookup tables, or other 
datasets that are repeatedly accessed- using the Spark UI to identify the 
bottlenecked resource- remove features or columns from the output data, until 
it runs, then add them back in one at a time.- creating a static dataset small 
enough to work, and editing the query, then retesting, repeatedly until you cut 
the execution time by a significant fraction- Using the Spark UI or spark shell 
to check the skew and make sure partitions are evenly distributed

On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu  wrote:
Thanks a lot for your reply .
In effect , here we tried to run the sql on kettle, hive and spark hive (by 
HiveContext) respectively, the job seems frozen  to finish to run .
In the 6 tables , need to respectively read the different columns in different 
tables for specific information , then do some simple calculation before output 
. join operation is used most in the sql . 
Best wishes! 

 

On Monday, July 18, 2016 6:24 PM, Chanh Le  wrote:
 

 Hi,What about the network (bandwidth) between hive and spark? Does it run in 
Hive before then you move to Spark?Because It's complex you can use something 
like EXPLAIN command to show what going on.



 
On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu  wrote:
the sql logic in the program is very much complex , so do not describe the 
detailed codes   here .  

On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu 
 wrote:
 

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive 
tables, and then does some easy calculation, there is around 100,000 number of 
rows in each table,finally need to output another table or file (with format of 
consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - 
sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone 
offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang 

   



   



  

Re: the spark job is so slow - almost frozen

2016-07-19 Thread Andrew Ehrlich
Try:

- filtering down the data as soon as possible in the job, dropping columns you 
don’t need.
- processing fewer partitions of the hive tables at a time
- caching frequently accessed data, for example dimension tables, lookup 
tables, or other datasets that are repeatedly accessed
- using the Spark UI to identify the bottlenecked resource
- remove features or columns from the output data, until it runs, then add them 
back in one at a time.
- creating a static dataset small enough to work, and editing the query, then 
retesting, repeatedly until you cut the execution time by a significant fraction
- Using the Spark UI or spark shell to check the skew and make sure partitions 
are evenly distributed

> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu  wrote:
> 
> Thanks a lot for your reply .
> 
> In effect , here we tried to run the sql on kettle, hive and spark hive (by 
> HiveContext) respectively, the job seems frozen  to finish to run .
> 
> In the 6 tables , need to respectively read the different columns in 
> different tables for specific information , then do some simple calculation 
> before output . 
> join operation is used most in the sql . 
> 
> Best wishes! 
> 
> 
> 
> 
> On Monday, July 18, 2016 6:24 PM, Chanh Le  wrote:
> 
> 
> Hi,
> What about the network (bandwidth) between hive and spark? 
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show what 
> going on.
> 
> 
> 
> 
>  
>> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu > > wrote:
>> 
>> the sql logic in the program is very much complex , so do not describe the 
>> detailed codes   here . 
>> 
>> 
>> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu > > wrote:
>> 
>> 
>> Hi All,  
>> 
>> Here we have one application, it needs to extract different columns from 6 
>> hive tables, and then does some easy calculation, there is around 100,000 
>> number of rows in each table,
>> finally need to output another table or file (with format of consistent 
>> columns) .
>> 
>>  However, after lots of days trying, the spark hive job is unthinkably slow 
>> - sometimes almost frozen. There is 5 nodes for spark cluster. 
>>  
>> Could anyone offer some help, some idea or clue is also good. 
>> 
>> Thanks in advance~
>> 
>> Zhiliang 
>> 
>> 
> 
> 
> 



Re: the spark job is so slow - almost frozen

2016-07-18 Thread Zhiliang Zhu
Thanks a lot for your reply .
In effect , here we tried to run the sql on kettle, hive and spark hive (by 
HiveContext) respectively, the job seems frozen  to finish to run .
In the 6 tables , need to respectively read the different columns in different 
tables for specific information , then do some simple calculation before output 
. join operation is used most in the sql . 
Best wishes! 

 

On Monday, July 18, 2016 6:24 PM, Chanh Le  wrote:
 

 Hi,What about the network (bandwidth) between hive and spark? Does it run in 
Hive before then you move to Spark?Because It's complex you can use something 
like EXPLAIN command to show what going on.



 
On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu  wrote:
the sql logic in the program is very much complex , so do not describe the 
detailed codes   here .  

On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu 
 wrote:
 

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive 
tables, and then does some easy calculation, there is around 100,000 number of 
rows in each table,finally need to output another table or file (with format of 
consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - 
sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone 
offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang 

   



  

Re: the spark job is so slow - almost frozen

2016-07-18 Thread Chanh Le
Hi,
What about the network (bandwidth) between hive and spark? 
Does it run in Hive before then you move to Spark?
Because It's complex you can use something like EXPLAIN command to show what 
going on.




 
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu  wrote:
> 
> the sql logic in the program is very much complex , so do not describe the 
> detailed codes   here . 
> 
> 
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu  
> wrote:
> 
> 
> Hi All,  
> 
> Here we have one application, it needs to extract different columns from 6 
> hive tables, and then does some easy calculation, there is around 100,000 
> number of rows in each table,
> finally need to output another table or file (with format of consistent 
> columns) .
> 
>  However, after lots of days trying, the spark hive job is unthinkably slow - 
> sometimes almost frozen. There is 5 nodes for spark cluster. 
>  
> Could anyone offer some help, some idea or clue is also good. 
> 
> Thanks in advance~
> 
> Zhiliang 
> 
> 



Re: the spark job is so slow - almost frozen

2016-07-18 Thread Zhiliang Zhu
the sql logic in the program is very much complex , so do not describe the 
detailed codes   here .  

On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu 
 wrote:
 

 Hi All,  
Here we have one application, it needs to extract different columns from 6 hive 
tables, and then does some easy calculation, there is around 100,000 number of 
rows in each table,finally need to output another table or file (with format of 
consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - 
sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone 
offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang 

  

the spark job is so slow - almost frozen

2016-07-18 Thread Zhiliang Zhu
Hi All,  
Here we have one application, it needs to extract different columns from 6 hive 
tables, and then does some easy calculation, there is around 100,000 number of 
rows in each table,finally need to output another table or file (with format of 
consistent columns) .
 However, after lots of days trying, the spark hive job is unthinkably slow - 
sometimes almost frozen. There is 5 nodes for spark cluster.  Could anyone 
offer some help, some idea or clue is also good. 
Thanks in advance~
Zhiliang