Tuo Zhu created KYLIN-4000:
------------------------------

             Summary: Merge task hang at 
                 Key: KYLIN-4000
                 URL: https://issues.apache.org/jira/browse/KYLIN-4000
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
    Affects Versions: v2.6.1
            Reporter: Tuo Zhu


Auto triggered merge tasks all hanged at step: Convert Cuboid Data to HFile. 
The last reduce step in "Kylin_HFile_Generator_mycube_Step".

Hadoop UI shows job status:
 attempt_1555603539483_1251_r_000000_0 78.56 RUNNING reduce > reduce

>From the syslog, following outputs are printed at first few hours:

{code}
 2019-05-10 22:27:48,699 INFO [IPC Server handler 27 on 46084] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1555603539483_1149_r_000000_0 is : 0.8629564
 2019-05-10 22:28:24,743 INFO [Socket Reader #1 for port 46084|#1 for port 
46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:28:24,746 INFO [IPC Server handler 23 on 46084] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1555603539483_1149_r_000000_0 is : 0.86303306
 2019-05-10 22:29:00,794 INFO [Socket Reader #1 for port 46084|#1 for port 
46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:29:00,796 INFO [IPC Server handler 20 on 46084] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1555603539483_1149_r_000000_0 is : 0.86305153
 2019-05-10 22:29:36,833 INFO [Socket Reader #1 for port 46084|#1 for port 
46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:29:36,836 INFO [IPC Server handler 3 on 46084] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1555603539483_1149_r_000000_0 is : 0.8634469
 2019-05-10 22:30:12,883 INFO [Socket Reader #1 for port 46084|#1 for port 
46084] SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for 
job_1555603539483_1149 (auth:SIMPLE)
 2019-05-10 22:30:12,887 INFO [IPC Server handler 18 on 46084] 
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt 
attempt_1555603539483_1149_r_000000_0 is : 0.8643631
{code}


 After that, only ZK logs keeps printed following lines constantly:

{code}
2019-05-11 00:02:23,445 WARN [Thread-1388778] 
org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but not 
chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
 2019-05-11 00:02:23,452 INFO [main] 
org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Connect 0x29758ab9 to 
boron:2181,carbon:2181,neon:2181 with session timeout=90000ms, retries 6, retry 
interval 1000ms, keepAlive=60000ms
 2019-05-11 00:02:23,452 INFO 
[ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9] 
org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=bo
 ron:2181,carbon:2181,neon:2181 sessionTimeout=90000 
watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$24/171414890@19fbce98
 2019-05-11 00:02:23,452 INFO 
[ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
 org.apache.zookeeper.ClientCnxn: Opening socket connection to server 
boron/172.16.211.20:2181. Will not attempt to authenticate using SASL (unknown 
error)
 2019-05-11 00:02:23,453 INFO 
[ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
 org.apache.zookeeper.ClientCnxn: Socket connection established, initiating 
session, client: /172.16.211.8:52214, server: boron/172.16.211.20:2181
 2019-05-11 00:02:23,453 INFO 
[ReadOnlyZKClient-boron:2181,carbon:2181,neon:2181@0x29758ab9-SendThread(boron:2181)]
 org.apache.zookeeper.ClientCnxn: Session establishment complete on server 
boron/172.16.211.20:2181, sessionid = 0x16a061f0770a354, negotiated timeout = 
60000
 2019-05-11 00:02:23,455 INFO [main] 
org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient: Close zookeeper connection 
0x29758ab9 to boron:2181,carbon:2181,neon:2181
 2019-05-11 00:02:23,455 INFO [main] 
org.apache.hadoop.hbase.io.hfile.CacheConfig: Created cacheConfig: 
CacheConfig:disabled
 2019-05-11 00:02:23,456 INFO [main] 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2: 
Writer=hdfs://berylium:8020/kylin/kylin_metadata/kylin-772502e8-c126-fec2-5808-08aa2df1fdea/pbs_uv_daily/hfile/_temporary/1/_temporary/attempt_1555603539483_1149_r_000000_0/F1/fdbf1bd8933242f2bd1da40a03c7a635,
 wrote=43
 2019-05-11 00:02:23,457 WARN [Thread-1388781] 
org.apache.hadoop.hdfs.DataStreamer: These favored nodes were specified but not 
chosen: [fluorine:16020] Specified favored nodes: [fluorine:16020]
{code}

It lasts for tens of hours. No error, no exceptions. In the end, it created too 
many blocks in HDFS. It only happens when merge task is triggered. Daily build 
jobs went very good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to