Try:

- filtering down the data as soon as possible in the job, dropping columns you 
don’t need.
- processing fewer partitions of the hive tables at a time
- caching frequently accessed data, for example dimension tables, lookup 
tables, or other datasets that are repeatedly accessed
- using the Spark UI to identify the bottlenecked resource
- remove features or columns from the output data, until it runs, then add them 
back in one at a time.
- creating a static dataset small enough to work, and editing the query, then 
retesting, repeatedly until you cut the execution time by a significant fraction
- Using the Spark UI or spark shell to check the skew and make sure partitions 
are evenly distributed

> On Jul 18, 2016, at 3:33 AM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID> wrote:
> 
> Thanks a lot for your reply .
> 
> In effect , here we tried to run the sql on kettle, hive and spark hive (by 
> HiveContext) respectively, the job seems frozen  to finish to run .
> 
> In the 6 tables , need to respectively read the different columns in 
> different tables for specific information , then do some simple calculation 
> before output . 
> join operation is used most in the sql . 
> 
> Best wishes! 
> 
> 
> 
> 
> On Monday, July 18, 2016 6:24 PM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> 
> Hi,
> What about the network (bandwidth) between hive and spark? 
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show what 
> going on.
> 
> 
> 
> 
>  
>> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID 
>> <mailto:zchl.j...@yahoo.com.invalid>> wrote:
>> 
>> the sql logic in the program is very much complex , so do not describe the 
>> detailed codes   here . 
>> 
>> 
>> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID 
>> <mailto:zchl.j...@yahoo.com.invalid>> wrote:
>> 
>> 
>> Hi All,  
>> 
>> Here we have one application, it needs to extract different columns from 6 
>> hive tables, and then does some easy calculation, there is around 100,000 
>> number of rows in each table,
>> finally need to output another table or file (with format of consistent 
>> columns) .
>> 
>>  However, after lots of days trying, the spark hive job is unthinkably slow 
>> - sometimes almost frozen. There is 5 nodes for spark cluster. 
>>  
>> Could anyone offer some help, some idea or clue is also good. 
>> 
>> Thanks in advance~
>> 
>> Zhiliang 
>> 
>> 
> 
> 
> 

Reply via email to