hello,kylin community!

sometimes my jobs stop accidenttly.It is can stop by any step.


kylin log is like :
2017-02-13 23:27:01,549 DEBUG [pool-8-thread-8] hbase.HBaseResourceStore:262 : 
Update row /execute_output/48dee96e-10fd-472b-b466-39505b6e57c0-02 from oldTs: 
1486999611524, to newTs: 1486999621545, operation result: true
2017-02-13 23:27:13,384 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying 
connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 0 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:14,387 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying 
connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 1 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,388 INFO  [pool-8-thread-8] ipc.Client:842 : Retrying 
connect to server: jxhdp1datanode29/10.180.212.61:50504. Already tried 2 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, 
sleepTime=1000 MILLISECONDS)
2017-02-13 23:27:15,495 INFO  [pool-8-thread-8] 
mapred.ClientServiceDelegate:273 : Application state is completed. 
FinalApplicationStatus=KILLED. Redirecting to job history server
2017-02-13 23:27:15,539 DEBUG [pool-8-thread-8] dao.ExecutableDao:210 : 
updating job output, id: 48dee96e-10fd-472b-b466-39505b6e57c0-02



CM log is like:
Job Name:       Kylin_Cube_Builder_user_all_cube_2_only_msisdn
User Name:      tmn
Queue:  root.tmn
State:  KILLED
Uberized:       false
Submitted:      Sun Feb 12 19:19:24 CST 2017
Started:        Sun Feb 12 19:19:38 CST 2017
Finished:       Sun Feb 12 20:30:13 CST 2017
Elapsed:        1hrs, 10mins, 35sec
Diagnostics:    
Kill job job_1486825738076_4205 received from tmn (auth:SIMPLE) at 10.180.212.38
Job received Kill while in RUNNING state.
Average Map Time        24mins, 48sec



mapreduce job log
Task KILL is received. Killing attempt!


and when this happened ,by resume job,the job can resume success! I mean  it is 
not stop by error!


what's the problem?


My hadoop cluster is very busy,this situation happens very often.


can I set retry time and retry  Interval?

Reply via email to