Attila Sasvari created MAPREDUCE-7003: -----------------------------------------
Summary: Indefinite retries of getJobSummary() if a job summary file is corrupt Key: MAPREDUCE-7003 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7003 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobhistoryserver Reporter: Attila Sasvari Having a corrupt job summary file in the {{/user/history/done_intermediate}} directory in HDFS, e.g. {{/user/history/done_intermediate/oozie/job_1111111111111_111111.summary}} before moving it to {{/user/history/done}}, results in indefinite retries of {{org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary()}}. JHS will log recurring exceptions like: {code} 2017-11-03 01:01:01,124 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader. java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/ABC.DEF.GHI:JKLMN, remote=/ABC.DEF.GHI:JKLMN, for file /user/history/done_intermediate/admin/job_1111111111111_1111.summary, for pool XX-999999999-ABC.DEF.GHI-1111111111111 block 1111111111_22222 at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:881) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:759) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:376) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:652) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:879) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:932) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:732) at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:337) at java.io.DataInputStream.readUTF(DataInputStream.java:589) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary(HistoryFileManager.java:1059) {code} (INFO and ERROR logs are omitted) To reproduce it: - start JHS in debug mode (use JVM parameter {{-agentlib:jdwp=transport=dt_socket,server=y,address=45555,suspend=n}} when starting it) - attach debugger to the process and add a break point to stop in {{org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getJobSummary()}} - start a map reduce job and wait until breakpoint is hit - delete or rename physical block on the datanode(s) for the job summary file (e.g. use {{hdfs fsck /user/history/done_intermediate/oozie/job_1111111111111_111111.summary -blocks -locations -files}} to get the block name; search for the block the on datanode(s) and remove/ rename it) - detach debugger - examine JHS log files -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org