[ https://issues.apache.org/jira/browse/HBASE-25898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351903#comment-17351903 ]
Hudson commented on HBASE-25898: -------------------------------- Results for branch branch-1 [build #131 on builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/131/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/131//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/131//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/131//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > RS getting aborted due to NPE in Replication WALEntryStream > ----------------------------------------------------------- > > Key: HBASE-25898 > URL: https://issues.apache.org/jira/browse/HBASE-25898 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Anoop Sam John > Assignee: Anoop Sam John > Priority: Critical > Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.3.6, 2.4.4 > > > Below sequence of events happened in a customer cluster > An empty WAL file got roll req. > The close of file failed at HDFS side but as there file had all edits > synced, we continue. > New WAL file is created and old rolled. > This old WAL file got archived to oldWAL > {code} > 2021-05-13 13:38:46.000 Riding over failed WAL close of > hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678, > cause="Unexpected EOF while trying to read response from server", errors=1; > THIS FILE WAS NOT CLOSED BUT ALL EDITS SYNCED SO SHOULD BE OK > 2021-05-13 13:38:46.000 Rolled WAL > /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 > with entries=0, filesize=90 B; new WAL > /xx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620913126549 > 2021-05-13 13:38:46.000 Archiving > hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 > to hdfs://xxx/oldWALs/xxxt%2C16020%2C1620828102351.1620910673678 > 2021-05-13 13:38:46.000 Log > hdfs://xxx/WALs/xxx,16020,1620828102351/xxx%2C16020%2C1620828102351.1620910673678 > was moved to hdfs://xxx/oldWALs/xxx%2C16020%2C1620828102351.1620910673678 > {code} > As there was move of file, the WALEntryStream got IOE and we will recreate > the stream . > {code} > ReplicationSourceWALReader#run > while (isReaderRunning()) { > try { > entryStream = > new WALEntryStream(logQueue, conf, currentPosition, > source.getWALFileLengthProvider(), > source.getServerWALsBelongTo(), source.getSourceMetrics(), > walGroupId); > while (isReaderRunning()) { > ... > ... > } catch (IOException e) { // stream related > if (handleEofException(e, batch)) { > sleepMultiplier = 1; > } else { > LOG.warn("Failed to read stream of replication entries", e); > if (sleepMultiplier < maxRetriesMultiplier) { > sleepMultiplier++; > } > Threads.sleep(sleepForRetries * sleepMultiplier); > } > } > {code} > eofAutoRecovery is turned off anyways. So it will go to outer while loop and > create new WALEntryStream object > Then we do readWALEntries > {code} > protected WALEntryBatch readWALEntries(WALEntryStream entryStream, > WALEntryBatch batch) throws IOException, InterruptedException { > Path currentPath = entryStream.getCurrentPath(); > if (!entryStream.hasNext()) { > {code} > Here the currentPath will be still null. > WALEntryStream#hasNext -> tryAdvanceEntry -> checkReader -> openNextLog > {code} > private boolean openNextLog() throws IOException { > PriorityBlockingQueue<Path> queue = logQueue.getQueue(walGroupId); > Path nextPath = queue.peek(); > if (nextPath != null) { > openReader(nextPath); > > private void openReader(Path path) throws IOException { > try { > // Detect if this is a new file, if so get a new reader else > // reset the current reader so that we see the new data > if (reader == null || !getCurrentPath().equals(path)) { > closeReader(); > reader = WALFactory.createReader(fs, path, conf); > seek(); > setCurrentPath(path); > } else { > resetReader(); > } > } catch (FileNotFoundException fnfe) { > handleFileNotFound(path, fnfe); > } catch (RemoteException re) { > IOException ioe = re.unwrapRemoteException(FileNotFoundException.class); > if (!(ioe instanceof FileNotFoundException)) { > throw ioe; > } > handleFileNotFound(path, (FileNotFoundException)ioe); > } catch (LeaseNotRecoveredException lnre) { > // HBASE-15019 the WAL was not closed due to some hiccup. > LOG.warn("Try to recover the WAL lease " + currentPath, lnre); > recoverLease(conf, currentPath); > reader = null; > } catch (NullPointerException npe) { > // Workaround for race condition in HDFS-4380 > // which throws a NPE if we open a file before any data node has the > most recent block > // Just sleep and retry. Will require re-reading compressed WALs for > compressionContext. > LOG.warn("Got NPE opening reader, will retry."); > reader = null; > } > } > {code} > Here the call to WALFactory.createReader is not able to complete because of > issue from HDFS. (Ya its on the same WAL file for which the close had an > issue) We have a retry mechanism there in createReader () for 5 mns. But > eventually it throws LeaseNotRecoveredException. ya we try handle it. > But the problem here is in that call we pass the state variable currentPath > which is still null here! > This will throw NPE > {code} > java.lang.NullPointerException > at org.apache.hadoop.fs.FileSystem.fixRelativePart(FileSystem.java:2635) > at > org.apache.hadoop.hdfs.DistributedFileSystem.fixRelativePart(DistributedFileSystem.java:3087) > at > org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:297) > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283) > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216) > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.recoverLease(WALEntryStream.java:387) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:370) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.handleFileNotFound(WALEntryStream.java:343) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:362) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138) > {code} > This will kill the RS! > When next RS takes this Replication Q source from this failed RS, even that > also got NPE. > Here in the call recoverLease() we should have been passing the method param > 'path'.. All other calls from this method do that way but seems this was > miss! -- This message was sent by Atlassian Jira (v8.3.4#803005)