the spark job is so slow during shuffle - almost frozen

Zhiliang Zhu Mon, 18 Jul 2016 20:53:46 -0700

  Show original message


Hi  All , 
While referring to spark UI , displayed as  198/200 - almost frozen...during 
shuffle stage of one task, most of the executor is with 0 byte, but just  one 
executor is with 1 G .
moreover, in the several join operation , some case is like this, one table or 
pairrdd is only with 40 keys, but the other table is with 10, 000 number 
keys.....
Then, could it be decided some issue as data skew ...
Any help or comment will be deep appreciated .
Thanks in advance ~ 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Here we have one application, it needs to extract different columns from 6 
hive tables, and then does some easy calculation, there is around 100,000
 number of rows in each table, finally need to output another table or file 
(with format of consistent  columns) .

 However, after lots of days trying, the spark hive job is unthinkably slow - 
sometimes almost frozen. There is 5 nodes for spark cluster.

 Could anyone offer some help, some idea or clue is also good.

 Thanks in advance~



On Tuesday, July 19, 2016 11:05 AM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:
  Show original message 

 

 Hi Mungeol,
Thanks a lot for your help. I will try that. 

    On Tuesday, July 19, 2016 9:21 AM, Mungeol Heo <mungeol....@gmail.com> 
wrote:
 

 Try to run a action at a Intermediate stage of your job process. Like
save, insertInto, etc.
Wish it can help you out.

On Mon, Jul 18, 2016 at 7:33 PM, Zhiliang Zhu
<zchl.j...@yahoo.com.invalid> wrote:
> Thanks a lot for your reply .
>
> In effect , here we tried to run the sql on kettle, hive and spark hive (by
> HiveContext) respectively, the job seems frozen  to finish to run .
>
> In the 6 tables , need to respectively read the different columns in
> different tables for specific information , then do some simple calculation
> before output .
> join operation is used most in the sql .
>
> Best wishes!
>
>
>
>
> On Monday, July 18, 2016 6:24 PM, Chanh Le <giaosu...@gmail.com> wrote:
>
>
> Hi,
> What about the network (bandwidth) between hive and spark?
> Does it run in Hive before then you move to Spark?
> Because It's complex you can use something like EXPLAIN command to show what
> going on.
>
>
>
>
>
>
> On Jul 18, 2016, at 5:20 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID>
> wrote:
>
> the sql logic in the program is very much complex , so do not describe the
> detailed codes  here .
>
>
> On Monday, July 18, 2016 6:04 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID>
> wrote:
>
>
> Hi All,
>
> Here we have one application, it needs to extract different columns from 6
> hive tables, and then does some easy calculation, there is around 100,000
> number of rows in each table,
> finally need to output another table or file (with format of consistent
> columns) .
>
>  However, after lots of days trying, the spark hive job is unthinkably slow
> - sometimes almost frozen. There is 5 nodes for spark cluster.
>
> Could anyone offer some help, some idea or clue is also good.
>
> Thanks in advance~
>
> Zhiliang
>

the spark job is so slow during shuffle - almost frozen

Reply via email to