[ https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hongbing Wang updated HDFS-15779: --------------------------------- Description: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} So, why is the writer null? Let's track when the writer is initialized and when reconstruct() is called, as follows: {code:java} // StripedBlockReconstructor#run public void run() { try { initDecoderIfNecessary(); getStripedReader().init(); stripedWriter.init(); //① reconstruct(); //② stripedWriter.endTargetBlocks(); } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); // ...{code} They are called at ① and ② above respectively. `stripedWriter.init()` -> `initTargetStreams()`, as follows: {code:java} // StripedWriter#initTargetStreams int initTargetStreams() { int nSuccess = 0; for (short i = 0; i < targets.length; i++) { try { writers[i] = createWriter(i); nSuccess++; targetsStatus[i] = true; } catch (Throwable e) { LOG.warn(e.getMessage()); } } return nSuccess; } {code} NPE occurs when createWriter() gets an exception and 0 < nSuccess < targets.length. was: The NullPointerException in DN log as follows: {code:java} 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY //... 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Connection timed out 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 010 {code} NPE occurs at `writer.getTargetBuffer()` in codes: {code:java} // StripedWriter#clearBuffers void clearBuffers() { for (StripedBlockWriter writer : writers) { ByteBuffer targetBuffer = writer.getTargetBuffer(); if (targetBuffer != null) { targetBuffer.clear(); } } } {code} So, why is the writer null? Let's track when the writer is initialized and when reconstruct() is called, as follows: {code:java} // StripedBlockReconstructor#run public void run() { try { initDecoderIfNecessary(); getStripedReader().init(); stripedWriter.init(); //① reconstruct(); //② stripedWriter.endTargetBlocks(); } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); // ...{code} They are called at ① and ② above respectively. `stripedWriter.init()` -> `initTargetStreams()`, as follows: {code:java} // StripedWriter#initTargetStreams int initTargetStreams() { int nSuccess = 0; for (short i = 0; i < targets.length; i++) { try { writers[i] = createWriter(i); nSuccess++; targetsStatus[i] = true; } catch (Throwable e) { LOG.warn(e.getMessage()); } } return nSuccess; } {code} NPE occurs when createWriter(i) gets an exception and 0 < nSuccess < targets.length. > EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block > ------------------------------------------------------------------------- > > Key: HDFS-15779 > URL: https://issues.apache.org/jira/browse/HDFS-15779 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.2.0 > Reporter: Hongbing Wang > Assignee: Hongbing Wang > Priority: Major > > The NullPointerException in DN log as follows: > {code:java} > 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY > //... > 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Connection timed out > 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Failed to reconstruct striped block: > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving > BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 > src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50 > 010 > {code} > NPE occurs at `writer.getTargetBuffer()` in codes: > {code:java} > // StripedWriter#clearBuffers > void clearBuffers() { > for (StripedBlockWriter writer : writers) { > ByteBuffer targetBuffer = writer.getTargetBuffer(); > if (targetBuffer != null) { > targetBuffer.clear(); > } > } > } > {code} > So, why is the writer null? Let's track when the writer is initialized and > when reconstruct() is called, as follows: > {code:java} > // StripedBlockReconstructor#run > public void run() { > try { > initDecoderIfNecessary(); > getStripedReader().init(); > stripedWriter.init(); //① > reconstruct(); //② > stripedWriter.endTargetBlocks(); > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > // ...{code} > They are called at ① and ② above respectively. `stripedWriter.init()` -> > `initTargetStreams()`, as follows: > {code:java} > // StripedWriter#initTargetStreams > int initTargetStreams() { > int nSuccess = 0; > for (short i = 0; i < targets.length; i++) { > try { > writers[i] = createWriter(i); > nSuccess++; > targetsStatus[i] = true; > } catch (Throwable e) { > LOG.warn(e.getMessage()); > } > } > return nSuccess; > } > {code} > NPE occurs when createWriter() gets an exception and 0 < nSuccess < > targets.length. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org