Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4666100537 Hi @Apache9 just a gentle reminder on this PR when you get a moment 🙂 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4645132591 Hi @Apache9 , I have raised a new commit to address review concerns. Now persist exceptions are only handled by run() method. Please take a look and share your thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4601529436 Failing test is not related to this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4601526915 Hi @Apache9 , I have raised a new commit to address review concerns. Now persist exceptions are only handled by run() method. Please take a look and share your thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r412317
##
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceShipperRestart.java:
##
@@ -0,0 +1,320 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hbase.replication.regionserver;
+
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.ArgumentMatchers.anyLong;
+import static org.mockito.ArgumentMatchers.eq;
+import static org.mockito.Mockito.atLeastOnce;
+import static org.mockito.Mockito.doNothing;
+import static org.mockito.Mockito.doReturn;
+import static org.mockito.Mockito.doThrow;
+import static org.mockito.Mockito.inOrder;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.never;
+import static org.mockito.Mockito.spy;
+import static org.mockito.Mockito.times;
+import static org.mockito.Mockito.verify;
+import static org.mockito.Mockito.when;
+
+import java.io.IOException;
+import java.util.Collections;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.TableName;
+import org.apache.hadoop.hbase.replication.ReplicationEndpoint;
+import org.apache.hadoop.hbase.replication.ReplicationQueueId;
+import org.apache.hadoop.hbase.testclassification.ReplicationTests;
+import org.apache.hadoop.hbase.testclassification.SmallTests;
+import org.apache.hadoop.hbase.wal.WAL.Entry;
+import org.apache.hadoop.hbase.wal.WALEdit;
+import org.apache.hadoop.hbase.wal.WALKeyImpl;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.mockito.InOrder;
+
+@Category({ ReplicationTests.class, SmallTests.class })
Review Comment:
Sure, I'll migrate the new test class to JUnit 5 style and replace the
remaining JUnit 4 constructs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r407853
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -134,6 +186,15 @@ public final void run() {
}
private void noMoreData() {
+try {
+ // Flush any delayed offset before finishing queue
+ persistLogPosition();
+} catch (IOException e) {
+ LOG.error("Failed to persist final replication offset for
walGroupId={}", walGroupId, e);
+ abortAndRestart(e);
Review Comment:
Agreed. I will remove handling from here. Only run() will handle all persist
exceptions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r402803
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,38 +103,77 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+this.offsetUpdateIntervalMs =
+ conf.getLong(OFFSET_UPDATE_INTERVAL_MS_KEY,
DEFAULT_OFFSET_UPDATE_INTERVAL_MS);
+this.offsetUpdateSizeThresholdBytes =
+ conf.getLong(OFFSET_UPDATE_SIZE_THRESHOLD_KEY,
DEFAULT_OFFSET_UPDATE_SIZE_THRESHOLD);
}
@Override
public final void run() {
setWorkerState(WorkerState.RUNNING);
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
-// Loop until we close down
-while (isActive()) {
- // Sleep until replication is enabled again
- if (!source.isPeerEnabled()) {
-// The peer enabled check is in memory, not expensive, so do not need
to increase the
-// sleep interval as it may cause a long lag when we enable the peer.
-sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
-continue;
- }
- try {
-WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
-LOG.debug("Shipper from source {} got entry batch from reader: {}",
source.getQueueId(),
- entryBatch);
-if (entryBatch == null) {
+try {
+ // Loop until we close down
+ while (isActive()) {
+// Sleep until replication is enabled again
+if (!source.isPeerEnabled()) {
+ // The peer enabled check is in memory, not expensive, so do not
need to increase the
+ // sleep interval as it may cause a long lag when we enable the peer.
+ sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
continue;
}
-// the NO_MORE_DATA instance has no path so do not call shipEdits
-if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
- noMoreData();
-} else {
- shipEdits(entryBatch);
+try {
+ // check time-based offset persistence
+ if (shouldPersistLogPosition()) {
+persistLogPosition();
+ }
+
+ long pollTimeout = getEntriesTimeout;
+ if (offsetUpdateIntervalMs != Long.MAX_VALUE) {
+long elapsed = EnvironmentEdgeManager.currentTime() -
lastOffsetUpdateTime;
+long remaining = offsetUpdateIntervalMs - elapsed;
+if (remaining > 0) {
+ pollTimeout = Math.min(getEntriesTimeout, remaining);
+}
+ }
+ WALEntryBatch entryBatch = entryReader.poll(pollTimeout);
+ LOG.debug("Shipper from source {} got entry batch from reader: {}",
source.getQueueId(),
+entryBatch);
+
+ if (entryBatch == null) {
+continue;
+ }
+ // the NO_MORE_DATA instance has no path so do not call shipEdits
+ if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
+noMoreData();
+ } else {
+shipEdits(entryBatch);
+ }
+} catch (InterruptedException | ReplicationRuntimeException e) {
+ LOG.warn("Interrupted while waiting for next replication entry
batch", e);
+ Thread.currentThread().interrupt();
+} catch (IOException ioe) {
+ // During source shutdown / peer removal we can see interrupted IOEs
+ // from replication queue updates. Do not restart in this case.
+ if (!source.isSourceActive() || isInterrupted() ||
!source.isPeerEnabled()) {
+LOG.info("Ignoring persist failure during shutdown for
walGroupId={}", walGroupId, ioe);
+break;
+ }
+ LOG.error("Shipper {} failed to persist offset, restarting",
walGroupId, ioe);
+ abortAndRestart(ioe);
+ break;
+}
+ }
+} finally {
+ // Flush buffered offsets before worker exits
+ if (!isFinished() && !abortingForRestart) {
+try {
+ persistLogPosition();
+} catch (IOException e) {
+ LOG.warn("Failed persisting final offset during shutdown for
walGroupId={}", walGroupId,
Review Comment:
Intention here was that the worker is already exiting, so this final persist
is best-effort cleanup. I will add comment in code
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructu
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r397281
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,38 +103,77 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+this.offsetUpdateIntervalMs =
+ conf.getLong(OFFSET_UPDATE_INTERVAL_MS_KEY,
DEFAULT_OFFSET_UPDATE_INTERVAL_MS);
+this.offsetUpdateSizeThresholdBytes =
+ conf.getLong(OFFSET_UPDATE_SIZE_THRESHOLD_KEY,
DEFAULT_OFFSET_UPDATE_SIZE_THRESHOLD);
}
@Override
public final void run() {
setWorkerState(WorkerState.RUNNING);
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
-// Loop until we close down
-while (isActive()) {
- // Sleep until replication is enabled again
- if (!source.isPeerEnabled()) {
-// The peer enabled check is in memory, not expensive, so do not need
to increase the
-// sleep interval as it may cause a long lag when we enable the peer.
-sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
-continue;
- }
- try {
-WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
-LOG.debug("Shipper from source {} got entry batch from reader: {}",
source.getQueueId(),
- entryBatch);
-if (entryBatch == null) {
+try {
+ // Loop until we close down
+ while (isActive()) {
+// Sleep until replication is enabled again
+if (!source.isPeerEnabled()) {
+ // The peer enabled check is in memory, not expensive, so do not
need to increase the
+ // sleep interval as it may cause a long lag when we enable the peer.
+ sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
continue;
}
-// the NO_MORE_DATA instance has no path so do not call shipEdits
-if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
- noMoreData();
-} else {
- shipEdits(entryBatch);
+try {
+ // check time-based offset persistence
+ if (shouldPersistLogPosition()) {
+persistLogPosition();
+ }
+
+ long pollTimeout = getEntriesTimeout;
+ if (offsetUpdateIntervalMs != Long.MAX_VALUE) {
+long elapsed = EnvironmentEdgeManager.currentTime() -
lastOffsetUpdateTime;
+long remaining = offsetUpdateIntervalMs - elapsed;
+if (remaining > 0) {
+ pollTimeout = Math.min(getEntriesTimeout, remaining);
+}
+ }
+ WALEntryBatch entryBatch = entryReader.poll(pollTimeout);
+ LOG.debug("Shipper from source {} got entry batch from reader: {}",
source.getQueueId(),
+entryBatch);
+
+ if (entryBatch == null) {
+continue;
+ }
+ // the NO_MORE_DATA instance has no path so do not call shipEdits
+ if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
+noMoreData();
+ } else {
+shipEdits(entryBatch);
+ }
+} catch (InterruptedException | ReplicationRuntimeException e) {
+ LOG.warn("Interrupted while waiting for next replication entry
batch", e);
+ Thread.currentThread().interrupt();
+} catch (IOException ioe) {
+ // During source shutdown / peer removal we can see interrupted IOEs
+ // from replication queue updates. Do not restart in this case.
+ if (!source.isSourceActive() || isInterrupted() ||
!source.isPeerEnabled()) {
+LOG.info("Ignoring persist failure during shutdown for
walGroupId={}", walGroupId, ioe);
+break;
+ }
+ LOG.error("Shipper {} failed to persist offset, restarting",
walGroupId, ioe);
+ abortAndRestart(ioe);
+ break;
+}
+ }
+} finally {
+ // Flush buffered offsets before worker exits
+ if (!isFinished() && !abortingForRestart) {
Review Comment:
Yes, this can be reached whenever the shipper thread exits its main loop
without reaching the normal finished state, for example during peer removal,
source termination, peer disable, RS shutdown, or other worker shutdown paths.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4591709407 > And how do we deal with the failure for cleanupHFileRefs? IIRC i've comment on this part but I can not find the comment... For `cleanupHFileRefs`, try-catch block inside `shipEdits`() handles exception from `cleanupHFileRefs` as it was earlier and under retry logic for same batch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4591698528 > I still have the concern that, let's throw the persist exception to the top layer and process it there, so we do not need to add exiting logic hack everywhere in code? Good point. I have now removed persist failure handing from noMoreData, now noMoreData also throws IOException so run() only handles persist failure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3270997641
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -134,6 +186,15 @@ public final void run() {
}
private void noMoreData() {
+try {
+ // Flush any delayed offset before finishing queue
+ persistLogPosition();
+} catch (IOException e) {
+ LOG.error("Failed to persist final replication offset for
walGroupId={}", walGroupId, e);
+ abortAndRestart(e);
Review Comment:
IIRC I've mentioned somewhere in this PR, we'd better throw the exception to
the top layer and then decide what to do, I'm afraid we call abortAndRestart
here but in the upper layer we still do some other things which mess up the
state and here we just return normally...
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,38 +103,77 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+this.offsetUpdateIntervalMs =
+ conf.getLong(OFFSET_UPDATE_INTERVAL_MS_KEY,
DEFAULT_OFFSET_UPDATE_INTERVAL_MS);
+this.offsetUpdateSizeThresholdBytes =
+ conf.getLong(OFFSET_UPDATE_SIZE_THRESHOLD_KEY,
DEFAULT_OFFSET_UPDATE_SIZE_THRESHOLD);
}
@Override
public final void run() {
setWorkerState(WorkerState.RUNNING);
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
-// Loop until we close down
-while (isActive()) {
- // Sleep until replication is enabled again
- if (!source.isPeerEnabled()) {
-// The peer enabled check is in memory, not expensive, so do not need
to increase the
-// sleep interval as it may cause a long lag when we enable the peer.
-sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
-continue;
- }
- try {
-WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
-LOG.debug("Shipper from source {} got entry batch from reader: {}",
source.getQueueId(),
- entryBatch);
-if (entryBatch == null) {
+try {
+ // Loop until we close down
+ while (isActive()) {
+// Sleep until replication is enabled again
+if (!source.isPeerEnabled()) {
+ // The peer enabled check is in memory, not expensive, so do not
need to increase the
+ // sleep interval as it may cause a long lag when we enable the peer.
+ sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
continue;
}
-// the NO_MORE_DATA instance has no path so do not call shipEdits
-if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
- noMoreData();
-} else {
- shipEdits(entryBatch);
+try {
+ // check time-based offset persistence
+ if (shouldPersistLogPosition()) {
+persistLogPosition();
+ }
+
+ long pollTimeout = getEntriesTimeout;
+ if (offsetUpdateIntervalMs != Long.MAX_VALUE) {
+long elapsed = EnvironmentEdgeManager.currentTime() -
lastOffsetUpdateTime;
+long remaining = offsetUpdateIntervalMs - elapsed;
+if (remaining > 0) {
+ pollTimeout = Math.min(getEntriesTimeout, remaining);
+}
+ }
+ WALEntryBatch entryBatch = entryReader.poll(pollTimeout);
+ LOG.debug("Shipper from source {} got entry batch from reader: {}",
source.getQueueId(),
+entryBatch);
+
+ if (entryBatch == null) {
+continue;
+ }
+ // the NO_MORE_DATA instance has no path so do not call shipEdits
+ if (entryBatch == WALEntryBatch.NO_MORE_DATA) {
+noMoreData();
+ } else {
+shipEdits(entryBatch);
+ }
+} catch (InterruptedException | ReplicationRuntimeException e) {
+ LOG.warn("Interrupted while waiting for next replication entry
batch", e);
+ Thread.currentThread().interrupt();
+} catch (IOException ioe) {
+ // During source shutdown / peer removal we can see interrupted IOEs
+ // from replication queue updates. Do not restart in this case.
+ if (!source.isSourceActive() || isInterrupted() ||
!source.isPeerEnabled()) {
+LOG.info("Ignoring persist failure during shutdown for
walGroupId={}", walGroupId, ioe);
+break;
+ }
+ LOG.error("Shipper {} failed to persist offset, restarting",
walGroupId, ioe);
+ abortAndRestart(ioe);
+ break;
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4490171644 Hi @Apache9 @wchevreuil just a gentle reminder on review of this PR when you get a moment 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4455309131 @wchevreuil @Apache9 Can we move on with this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4437420695 Hi @Apache9 @wchevreuil @anmolnar, please help me with review of this PR when you get a chance. I’ve addressed the review comments. Failing test is unrelated to changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4399200980 Update: with the current PR, `TestBulkLoadReplicationHFileRefs` gets hung when the entire suite is run, although individual test runs are passing. So I suspect there is still some missing scenario/cleanup gap in the shipper restart lifecycle logic. I analysed the thread dump and the hang is happening during `RemovePeerProcedure`, specifically inside: `ReplicationPeerManager.removeAllQueuesAndHFileRefs()` → `TableReplicationQueueStorage.removeAllQueues()` The procedure appears to be waiting indefinitely while scanning/removing replication queues. One important observation from the dump is that there are no active `ReplicationSourceShipper` or `ReplicationSourceWALReader` threads at the time of the hang, which suggests this is likely not an active thread deadlock, but rather inconsistent/orphaned replication queue state left behind during the custom restart flow. Current suspicion is that restarting only the shipper thread is insufficient and we likely need coordinated restart/cleanup of the entire WAL group pipeline (reader + shipper + buffered batches + queue bookkeeping) before recreating the worker. I am currently validating this by adding a coordinated `restartWalGroup(...)` flow which: * stops reader + shipper together * clears pending WAL entry batches * releases buffer accounting correctly * removes old worker atomically * recreates a fresh reader/shipper pipeline Will update once I validate whether this resolves the suite-level hang. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4382186641 Thanks @ankitsol Could you please take a look at the build failures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3142213317
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +275,9 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+throw new ReplicationRuntimeException(
Review Comment:
Why not just throw IOException out? This is a private method, we are free to
change the method signature.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +275,9 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+throw new ReplicationRuntimeException(
Review Comment:
OK, checked locally, cleanUpHFileRefs also throws IOException...
So here the implementation is incorrect.
Before the PR there is no critical exception thrown out(in updateLogPosition
we will call abort directly), so we just catch Exception and retry.
In the new logic, we need to make try catch section smaller, for example,
only wrap cleanUpHFileRefs and retry, and for persistLogPosition, just throw
the IOException out, and the upper layer decide what to do.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617: URL: https://github.com/apache/hbase/pull/7617#discussion_r3142204697 ## hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java: ## Review Comment: What I mean is that, is it possible that we missed some scenario where we need to restart the worker? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4288308867 Hi @Apache9 @wchevreuil, just a gentle ping on this PR when you get a chance. I’ve addressed the review comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4251464552 Hi @Apache9 @wchevreuil, just a gentle ping on this PR when you get a chance. I’ve addressed the review comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617: URL: https://github.com/apache/hbase/pull/7617#discussion_r3085931097 ## hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java: ## Review Comment: Yes, we intentionally do not restart here—this path can be reached during normal lifecycle transitions (e.g., config updates), while restarts are only triggered explicitly on failure via abortAndRestart(). I have address below comments as per your suggestions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3085912904
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +277,12 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this shipper since it
might come from
+// beforePersistingReplicationOffset. So abort and restart the
Shipper, and WAL reading
+// will resume from the last successfully persisted offset
+abortAndRestart(ioe);
Review Comment:
Good point, I’ve updated the code to rethrow the exception and handle
abortAndRestart in the top-level run() loop
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617: URL: https://github.com/apache/hbase/pull/7617#discussion_r3080258595 ## hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java: ## Review Comment: I might be missing something, but to my understanding we don't restart the worker when we enter here, because that's the "finished" branch. Worker is already in finished state, because there's no more data process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3080238529
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
Review Comment:
Okay, we can leave this safety check here, but still unclear to me that if a
single `walGroupId` is strictly handled by a single thread in the worker pool,
how is it possible that single thread calls for restart two times?
I think what you say "multiple threads trigger restart around the same time"
doesn't seem to be possible, because what I mentioned above: exactly one thread
is dealing with a `walGroupId`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3080238529
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
Review Comment:
Okay, we can leave this safety check here, but still unclear to me that a
single `walGroupId` is strictly handled by a single thread in the worker pool.
How is it possible that single thread calls for restart two times?
I think what you say "multiple threads trigger restart around the same time"
doesn't seem to be possible, because what I mentioned above: exactly one thread
is dealing with a `walGroupId`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3074515880
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
Review Comment:
One scenario is concurrent restart attempts, very very rare chance though
imho. For example, if the same worker encounters failure paths close together
(or multiple threads trigger restart around the same time), one call to
restartShipper may already replace the worker in workerThreads. When the second
call executes, the current entry is no longer oldWorker, so we skip replacing
it again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3044610064
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
Review Comment:
What I can recall is that for some reasons like updating replication config,
we may terminate a replication source and recreate it before the replication
source finishes, so maybe we do not always need to restart the worker when
entering here, but I'm still a bit nervous that we miss some scenarios where we
need to restart the worker if we only call abortAndRestart when catching an
exception...
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +277,12 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this shipper since it
might come from
+// beforePersistingReplicationOffset. So abort and restart the
Shipper, and WAL reading
+// will resume from the last successfully persisted offset
+abortAndRestart(ioe);
Review Comment:
Better rethrow the exception to the top layer, and call abortAndRestart
there? After returning from this method, we may still have other logics in the
parent method and may cause inconsistency since the new shipper may already be
started...
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
+ }
+
+ LOG.warn("Restarting shipper for walGroupId={}", walGroupId);
+
+ try {
+ReplicationSourceShipper newWorker = createNewShipper(walGroupId);
+startShipper(newWorker);
+return newWorker;
+ } catch (Exception e) {
+LOG.error("Failed to restart shipper for walGroupId={}", walGroupId,
e);
+return current; // retry later
Review Comment:
Checked che code, createNewShipper and startShipper both do not throw any
exception, so do we really need to catch exception here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4210586624 > Looking at this. I don't have additional comments, but waiting to approve until existing comments are resolved. Thanks @charlesconnell , much appreciated. Patiently waiting for @Apache9 to comment on latest changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4210591233 @taklwu @wchevreuil Would you guys like to update your previous reviews? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3043017939
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +255,13 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this worker since it
might come from
Review Comment:
Hi @Apache9, just a gentle reminder to please take a look at the code review
when you get a chance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3016292503
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
Review Comment:
How is this possible?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
+ }
+
+ LOG.warn("Restarting shipper for walGroupId={}", walGroupId);
+
+ try {
+ReplicationSourceShipper newWorker = createNewShipper(walGroupId);
Review Comment:
You might want to reuse `tryStartNewShipper(walGroupId)` here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
taklwu commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3012835303
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepForRetries() {
return sleepForRetries;
}
+
+ void restartShipper(String walGroupId, ReplicationSourceShipper oldWorker) {
+workerThreads.compute(walGroupId, (key, current) -> {
+ if (current != oldWorker) {
+return current; // already replaced
+ }
+
+ LOG.warn("Restarting shipper for walGroupId={}", walGroupId);
+
+ try {
+ReplicationSourceShipper newWorker = createNewShipper(walGroupId);
+startShipper(newWorker);
+return newWorker;
+ } catch (Exception e) {
+LOG.error("Failed to restart shipper for walGroupId={}", walGroupId,
e);
+return current; // retry later
Review Comment:
can you check if this comment is right? I used the cursor AI editor also
confirmed this may be a potential bug.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Copilot commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3012357137
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +297,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (accumulatedSizeSinceLastUpdate == 0 || lastShippedBatch == null) {
+ return false;
+}
+
+// Default behaviour to update offset immediately after replicate()
+if (offsetUpdateSizeThresholdBytes == -1 && offsetUpdateIntervalMs ==
Long.MAX_VALUE) {
+ return true;
+}
+
+return (accumulatedSizeSinceLastUpdate >= offsetUpdateSizeThresholdBytes)
+ || (EnvironmentEdgeManager.currentTime() - lastOffsetUpdateTime >=
offsetUpdateIntervalMs);
Review Comment:
`shouldPersistLogPosition` treats `offsetUpdateSizeThresholdBytes == -1` as
a real threshold, so `accumulatedSizeSinceLastUpdate >= -1` will be true
whenever any data has been shipped. This makes it impossible to enable
*time-based only* persistence by setting just
`hbase.replication.shipper.offset.update.interval.ms` (you will still persist
immediately after every batch). Consider treating negative/zero thresholds as
"disabled" for size-based persistence and only applying the size comparison
when the configured threshold is > 0.
```suggestion
// Treat non-positive size thresholds as "disabled" for size-based
persistence.
boolean sizeThresholdReached = offsetUpdateSizeThresholdBytes > 0
&& accumulatedSizeSinceLastUpdate >= offsetUpdateSizeThresholdBytes;
boolean timeThresholdReached =
EnvironmentEdgeManager.currentTime() - lastOffsetUpdateTime >=
offsetUpdateIntervalMs;
return sizeThresholdReached || timeThresholdReached;
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -118,11 +154,24 @@ public final void run() {
} else {
shipEdits(entryBatch);
}
Review Comment:
With deferred offset persistence enabled (size/interval thresholds),
receiving `WALEntryBatch.NO_MORE_DATA` can transition the worker to `FINISHED`
without ever calling `persistLogPosition()` for the last shipped batch. This
can drop the final offset update and any deferred HFile-ref cleanup, causing
re-replication/recovery work. Consider persisting (if
`accumulatedSizeSinceLastUpdate > 0`) before calling `noMoreData()` / before
setting state to `FINISHED`.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +252,13 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+accumulatedSizeSinceLastUpdate += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
Review Comment:
`entriesForCleanUpHFileRefs.addAll(entries)` retains full `WAL.Entry`
objects across batches until the next offset persist. With large
thresholds/intervals this can hold a large amount of WAL data in memory
(including edits/cells) even though cleanup only needs bulk-load-related
metadata. Consider recording only the minimal information needed for cleanup
(e.g., extracted bulk-load store file refs) or only retaining entries that
actually contain `WALEdit.BULK_LOAD` cells.
```suggestion
// Only retain WAL entries that contain bulk load cells for later
HFile ref cleanup
for (Entry entry : entries) {
WALEdit edit = entry.getEdit();
if (edit == null) {
continue;
}
for (Cell cell : edit.getCells()) {
if (CellUtil.matchFamily(cell, WALEdit.BULK_LOAD)) {
entriesForCleanUpHFileRefs.add(entry);
break;
}
}
}
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,11 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ /**
+ * Hook invoked before persisting replication offsets. Eg: Buffered
endpoints can flush/close WALs
Review Comment:
Javadoc typo: use "e.g.," instead of "Eg:".
```suggestion
* Hook invoked before persisting replication offsets, e.g., buffered
endpoints can flush/close WALs
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -866,4 +866,24 @@ public long getTotalReplicatedEdits() {
long getSleepF
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r3011386961
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +255,13 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this worker since it
might come from
Review Comment:
Hi @Apache9 just a gentle reminder on this new commit when you get a moment.
Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4134778608 Failiing test `org.apache.hadoop.hbase.master.procedure.TestSnapshotProcedureRSCrashes.testRegionServerCrashWhileVerifyingSnapshot` is unrelated to this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2994837149
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,6 +97,10 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+this.offsetUpdateIntervalMs =
+ conf.getLong("hbase.replication.shipper.offset.update.interval.ms",
Long.MAX_VALUE);
+this.offsetUpdateSizeThresholdBytes =
+ conf.getLong("hbase.replication.shipper.offset.update.size.threshold",
-1L);
Review Comment:
Extracted these constants as a static final field and added defaults too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2994846425
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -106,9 +117,25 @@ public final void run() {
continue;
}
try {
-WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
+// check time-based offset persistence
+if (shouldPersistLogPosition()) {
+ // Trigger offset persistence via existing retry/backoff mechanism
in shipEdits()
+ WALEntryBatch emptyBatch = createEmptyBatchForTimeBasedFlush();
+ if (emptyBatch != null) shipEdits(emptyBatch);
Review Comment:
I was thinking to add tests to the final PR against our continuous backup
feature branch once I get approval on code implementation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2994831828
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +255,13 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this worker since it
might come from
Review Comment:
> Better introduce a abortAndRestart method in ReplicationSourceShipper to
call a method in ReplicationSource to replace the current shipper with a new
one. This can avoid introducing a new monitor thread. And also, we can deal
with InterruptedException with the same way, as usually if we want to shutdown,
we will first change a running or close flag and then interrupt, so when we
want to reschedule, we can check the flag to determine whether we should quit
now.
@Apache9 @anmolnar @taklwu I have implemented this approach, please help me
with review these code changes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2943848541
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,6 +97,10 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+this.offsetUpdateIntervalMs =
+ conf.getLong("hbase.replication.shipper.offset.update.interval.ms",
Long.MAX_VALUE);
+this.offsetUpdateSizeThresholdBytes =
+ conf.getLong("hbase.replication.shipper.offset.update.size.threshold",
-1L);
Review Comment:
HConstants is IA.Public so generally we do not want to add new fields to it
unless we can confirm they should be accessed by user code directly. But I
agree that we should extract these constants as a static final field.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +255,13 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this worker since it
might come from
Review Comment:
Better introduce a abortAndRestart method in ReplicationSourceShipper to
call a method in ReplicationSource to replace the current shipper with a new
one. This can avoid introducing a new monitor thread. And also, we can deal
with InterruptedException with the same way, as usually if we want to shutdown,
we will first change a running or close flag and then interrupt, so when we
want to reschedule, we can check the flag to determine whether we should quit
now.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java:
##
@@ -148,6 +151,12 @@ public class ReplicationSource implements
ReplicationSourceInterface {
public static final int DEFAULT_WAIT_ON_ENDPOINT_SECONDS = 30;
private int waitOnEndpointSeconds = -1;
+ public static final String SHIPPER_MONITOR_INTERVAL =
Review Comment:
Since the code in shipper is all implemented by us, we can just reschedule a
new shipper when it is dead? I mean, we do not need a monitor thread to do this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
taklwu commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2943368507
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,6 +97,10 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+this.offsetUpdateIntervalMs =
+ conf.getLong("hbase.replication.shipper.offset.update.interval.ms",
Long.MAX_VALUE);
+this.offsetUpdateSizeThresholdBytes =
+ conf.getLong("hbase.replication.shipper.offset.update.size.threshold",
-1L);
Review Comment:
nit: why isn't this feature flag added in `HConstants` ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
taklwu commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2943109855
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -106,9 +117,25 @@ public final void run() {
continue;
}
try {
-WALEntryBatch entryBatch = entryReader.poll(getEntriesTimeout);
+// check time-based offset persistence
+if (shouldPersistLogPosition()) {
+ // Trigger offset persistence via existing retry/backoff mechanism
in shipEdits()
+ WALEntryBatch emptyBatch = createEmptyBatchForTimeBasedFlush();
+ if (emptyBatch != null) shipEdits(emptyBatch);
Review Comment:
nit better to keep bracket even if this is a one-liner
```suggestion
if (emptyBatch != null) {
shipEdits(emptyBatch);
}
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
@ankitsol is this align/update in the latest patch ?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -215,6 +255,13 @@ private void shipEdits(WALEntryBatch entryBatch) {
entryBatch.getNbOperations(), (endTimeNs - startTimeNs) / 100);
}
break;
+ } catch (IOException ioe) {
+// Offset-Persist failure is treated as fatal to this worker since it
might come from
+// beforePersistingReplicationOffset.
+// ReplicationSource will restart the Shipper, and WAL reading
+// will resume from the last successfully persisted offset
+throw new ReplicationRuntimeException(
Review Comment:
k, it looks like the previous updateLogPosition issue
https://github.com/apache/hbase/pull/7617/changes#r2802692158 has been handled
here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-4031807503 @ankitsol Have you checked the CI output? I see a lot of unit test failures related to replication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-402300 @Apache9 just a gentle ping on my previous comment to check if I'm heading in the right direction. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3996883509 @Apache9 @anmolnar I have now added a periodic shipper monitoring task in ReplicationSource (`shipperMonitorExecutor`) which checks for dead (non-FINISHED) shipper threads and recreates them. When the new shipper starts, it reads the replication offset from the persisted offset storage and resumes WAL reading from that point, so any WAL entries whose offset was not successfully persisted will be replicated again. So now if the endpoint (for example an S3-based implementation) fails during `beforePersistingReplicationOffset()` while committing/flushing data, `persistLogPosition()` throws an `IOException`. In `shipEdits()` this `IOException` is caught and rethrown as a `ReplicationRuntimeException`. This causes the ReplicationSourceShipper worker thread to exit (the run() method interrupts and terminates the thread). The monitor then detects the dead worker and recreates the shipper, which resumes replication from the last persisted offset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2844329304
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
I'm not sure if the current logic in ReplicationSource can handle shipper
abort, but I think this is a possible way to deal with the problem.
When restarting, we read from the external replication offset storage and
restart from the offset.
I think this could also be used to deal with the updateLogPosition
exception, so we do not need to abort the whole region server.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2840781191
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
Hi @Apache9, just a gentle ping on my previous comment to check if I'm
heading in the right direction. Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2827260596
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
run() method has this catch block to handle `ReplicationRuntimeException`
```
catch (InterruptedException | ReplicationRuntimeException e) {
// It is interrupted and needs to quit.
LOG.warn("Interrupted while waiting for next replication entry
batch", e);
Thread.currentThread().interrupt();
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2827254931
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
Currently in `shipEdits()`, the only place where an IOException can be
raised is from `persistLogPosition()` (via
`beforePersistingReplicationOffset()` or `cleanUpHFileRefs()`).
Based on your earlier comment about restarting from the last persisted
offset and resending WAL entries, I am thinking of splitting the existing catch
block into two parts:
```
catch (IOException ioe) {
// Persist failure → fatal → restart worker → WAL replay from last
persisted offset
throw new ReplicationRuntimeException(
"Failed to persist replication offset; restarting worker to replay
WAL", ioe);
} catch (Exception ex) {
// Replication/transient failure → retry current batch (existing behaviour)
source.getSourceMetrics().incrementFailedBatches();
LOG.warn("{} threw unknown exception:",
source.getReplicationEndpoint().getClass().getName(), ex);
if (sleepForRetries("ReplicationEndpoint threw exception",
sleepForRetries, sleepMultiplier, maxRetriesMultiplier)) {
sleepMultiplier++;
}
}
```
The intention is that propagating `ReplicationRuntimeException` will cause
the shipper worker to exit, and ReplicationSource will recreate it, so WAL
reading resumes from the last persisted offset and all batches since then are
resent.
Does this approach match what you had in mind? @Apache9
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2819270897
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
Can we just abort this shipper?
Is `ReplicationSource` going to create a new one if it's aborted?
In that case it will pick up at the last persisted position and retry
correctly, won't it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2819270897
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
Can we just abort this shipper?
Is `ReplicationSource` going to create a new one if it's aborted?
In that case it will pick up at last persisted position and retry correctly,
won't it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2817632340
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +272,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (accumulatedSizeSinceLastUpdate == 0 || lastShippedBatch == null) {
+ return false;
+}
+
+// Default behaviour to update offset immediately after replicate()
+if (offsetUpdateSizeThresholdBytes == -1 && offsetUpdateIntervalMs ==
Long.MAX_VALUE) {
Review Comment:
Yes, this is not needed, I just added here to make this default behaviour
explicit
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2817553228
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,6 +99,22 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+ReplicationPeer peer = source.getPeer();
+if (peer != null && peer.getPeerConfig() != null) {
Review Comment:
Thank you for pointing this, in ReplicationSourceManager.java this merge of
peer config happens
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2802689035
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +272,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (accumulatedSizeSinceLastUpdate == 0 || lastShippedBatch == null) {
+ return false;
+}
+
+// Default behaviour to update offset immediately after replicate()
+if (offsetUpdateSizeThresholdBytes == -1 && offsetUpdateIntervalMs ==
Long.MAX_VALUE) {
Review Comment:
This is a bit strange...
If offsetUpdateSizeThresholdBytes is -1, then the below
accumulatedSizeSinceLastUpdate >= offsetUpdateSizeThresholdBytes will always
returns true, so we do not need this check here?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -105,10 +130,17 @@ public final void run() {
sleepForRetries("Replication is disabled", sleepForRetries, 1,
maxRetriesMultiplier);
continue;
}
+ // check time-based offset persistence
+ if (shouldPersistLogPosition()) {
Review Comment:
So this wants to solve the problem that, since now we may not always persist
the offset and shipping, it is possible that we shipped out a batch, and then
did not get a new batch for a very long time, so if we do not add a check here,
there is no way for us to persist the offset?
I think we also need to change the poll timeout below, we should not wait
longer than the remaining time of update offset interval?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -90,6 +99,22 @@ public ReplicationSourceShipper(Configuration conf, String
walGroupId, Replicati
this.conf.getInt("replication.source.getEntries.timeout",
DEFAULT_TIMEOUT);
this.shipEditsTimeout =
this.conf.getInt(HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT,
HConstants.REPLICATION_SOURCE_SHIPEDITS_TIMEOUT_DFAULT);
+ReplicationPeer peer = source.getPeer();
+if (peer != null && peer.getPeerConfig() != null) {
Review Comment:
The conf here already contains all the configurations in peer config. You
can see the implementation of ReplicationPeers for more details, we will create
a CompoundConfiguration and apply the peerConfig.getConfiguration() to it.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
Why move this here?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -154,15 +196,16 @@ protected void postFinish() {
private void shipEdits(WALEntryBatch entryBatch) {
List entries = entryBatch.getWalEntries();
int sleepMultiplier = 0;
-if (entries.isEmpty()) {
- updateLogPosition(entryBatch);
- return;
-}
int currentSize = (int) entryBatch.getHeapSize();
source.getSourceMetrics()
.setTimeStampNextToReplicate(entries.get(entries.size() -
1).getKey().getWriteTime());
while (isActive()) {
try {
+if (entries.isEmpty()) {
Review Comment:
OK, I guess I know why you move this here, in the past updateLogPosition can
not fail(it will lead to an abort so we can assume it never fails), but now
since we have introduced a callback method, it could fail.
Then I do not think this is the correct way to deal with this. Consider the
S3 based solution, if you fail to commit the file on S3, then the correct way
is to send the data again? But here we just retry committing... I think we
should restart from the previous persist offset and send data again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3894748231 Will take a look this afternoon. Please give me some time. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2789440736
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+long maxBufferSize = endpoint.getMaxBufferSize();
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+if (maxBufferSize == -1) {
+ return true;
+}
+return stagedWalSize >= maxBufferSize
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ private void persistLogPosition() throws IOException {
+if (lastShippedBatch == null) {
Review Comment:
We cannot. This method is not getting called if lastShippedBatch is null in
the current implementation. We can still keep this check here for safety
reasons.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3860012449
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 37s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 3s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 34s | | master passed |
| +1 :green_heart: | compile | 0m 46s | | master passed |
| +1 :green_heart: | javadoc | 0m 24s | | master passed |
| +1 :green_heart: | shadedjars | 4m 26s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 17s | | the patch passed |
| +1 :green_heart: | compile | 0m 47s | | the patch passed |
| +1 :green_heart: | javac | 0m 47s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 21s | | the patch passed |
| +1 :green_heart: | shadedjars | 4m 21s | | patch has no errors when
building our shaded downstream artifacts. |
_ Other Tests _ |
| -1 :x: | unit | 28m 34s |
[/patch-unit-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/artifact/yetus-jdk17-hadoop3-check/output/patch-unit-hbase-server.txt)
| hbase-server in the patch failed. |
| | | 46m 44s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux fda2279b6c5b 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / 640308bf96121abbabd3f9fcad9ba094699af23d |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/testReport/
|
| Max. process+thread count | 1419 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3859957472
:confetti_ball: **+1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 11s | | Docker mode activated. |
_ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | hbaseanti | 0m 0s | | Patch does not have any
anti-patterns. |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 55s | | master passed |
| +1 :green_heart: | compile | 2m 54s | | master passed |
| +1 :green_heart: | checkstyle | 0m 51s | | master passed |
| +1 :green_heart: | spotbugs | 1m 33s | | master passed |
| +1 :green_heart: | spotless | 0m 46s | | branch has no errors when
running spotless:check. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 30s | | the patch passed |
| +1 :green_heart: | compile | 2m 56s | | the patch passed |
| +1 :green_heart: | javac | 2m 56s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| -0 :warning: | checkstyle | 0m 48s |
[/results-checkstyle-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/artifact/yetus-general-check/output/results-checkstyle-hbase-server.txt)
| hbase-server: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total
(was 0) |
| +1 :green_heart: | spotbugs | 1m 42s | | the patch passed |
| +1 :green_heart: | hadoopcheck | 9m 21s | | Patch does not cause any
errors with Hadoop 3.3.6 3.4.1. |
| +1 :green_heart: | spotless | 0m 39s | | patch has no errors when
running spotless:check. |
_ Other Tests _ |
| +1 :green_heart: | asflicense | 0m 10s | | The patch does not
generate ASF License warnings. |
| | | 33m 19s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/artifact/yetus-general-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | dupname asflicense javac spotbugs checkstyle codespell
detsecrets compile hadoopcheck hbaseanti spotless |
| uname | Linux 867472c88f7c 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / 640308bf96121abbabd3f9fcad9ba094699af23d |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Max. process+thread count | 85 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/8/console
|
| versions | git=2.34.1 maven=3.9.8 spotbugs=4.7.3 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3858344435
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 32s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 5s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 38s | | master passed |
| +1 :green_heart: | compile | 1m 1s | | master passed |
| +1 :green_heart: | javadoc | 0m 33s | | master passed |
| +1 :green_heart: | shadedjars | 5m 57s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| -1 :x: | mvninstall | 1m 44s |
[/patch-mvninstall-root.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-jdk17-hadoop3-check/output/patch-mvninstall-root.txt)
| root in the patch failed. |
| -1 :x: | compile | 0m 59s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-jdk17-hadoop3-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| -0 :warning: | javac | 0m 59s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-jdk17-hadoop3-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| +1 :green_heart: | javadoc | 0m 27s | | the patch passed |
| -1 :x: | shadedjars | 4m 29s | | patch has 16 errors when building
our shaded downstream artifacts. |
_ Other Tests _ |
| -1 :x: | unit | 1m 0s |
[/patch-unit-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-jdk17-hadoop3-check/output/patch-unit-hbase-server.txt)
| hbase-server in the patch failed. |
| | | 21m 27s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.43 ServerAPI=1.43 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux d235f58c2160 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May
23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / cfc9d908eb46fd943e83f0f6288d1eecc3ce16ad |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| shadedjars |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-jdk17-hadoop3-check/output/patch-shadedjars.txt
|
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/testReport/
|
| Max. process+thread count | 80 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3858309287
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 12s | | Docker mode activated. |
_ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | hbaseanti | 0m 0s | | Patch does not have any
anti-patterns. |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 15s | | master passed |
| +1 :green_heart: | compile | 2m 34s | | master passed |
| +1 :green_heart: | checkstyle | 0m 46s | | master passed |
| +1 :green_heart: | spotbugs | 1m 14s | | master passed |
| +1 :green_heart: | spotless | 0m 34s | | branch has no errors when
running spotless:check. |
_ Patch Compile Tests _ |
| -1 :x: | mvninstall | 1m 15s |
[/patch-mvninstall-root.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/patch-mvninstall-root.txt)
| root in the patch failed. |
| -1 :x: | compile | 1m 24s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| -0 :warning: | javac | 1m 24s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks
issues. |
| -0 :warning: | checkstyle | 0m 45s |
[/results-checkstyle-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/results-checkstyle-hbase-server.txt)
| hbase-server: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total
(was 0) |
| -1 :x: | spotbugs | 0m 38s |
[/patch-spotbugs-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/patch-spotbugs-hbase-server.txt)
| hbase-server in the patch failed. |
| -1 :x: | hadoopcheck | 1m 28s | | The patch causes 16 errors with
Hadoop v3.3.6. |
| -1 :x: | hadoopcheck | 2m 57s | | The patch causes 16 errors with
Hadoop v3.4.1. |
| +1 :green_heart: | spotless | 0m 33s | | patch has no errors when
running spotless:check. |
_ Other Tests _ |
| +1 :green_heart: | asflicense | 0m 6s | | The patch does not
generate ASF License warnings. |
| | | 16m 26s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | dupname asflicense javac spotbugs checkstyle codespell
detsecrets compile hadoopcheck hbaseanti spotless |
| uname | Linux 8c4a7b145330 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / cfc9d908eb46fd943e83f0f6288d1eecc3ce16ad |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| hadoopcheck |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/patch-javac-3.3.6.txt
|
| hadoopcheck |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/artifact/yetus-general-check/output/patch-javac-3.4.1.txt
|
| Max. process+thread count | 85 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/7/console
|
| versions | git=2.34.1 maven=3.9.8 spotbugs=4.7.3 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3853691977
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 11s | | Docker mode activated. |
_ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | hbaseanti | 0m 0s | | Patch does not have any
anti-patterns. |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 18s | | master passed |
| +1 :green_heart: | compile | 2m 38s | | master passed |
| +1 :green_heart: | checkstyle | 0m 46s | | master passed |
| +1 :green_heart: | spotbugs | 1m 14s | | master passed |
| +1 :green_heart: | spotless | 0m 35s | | branch has no errors when
running spotless:check. |
_ Patch Compile Tests _ |
| -1 :x: | mvninstall | 1m 15s |
[/patch-mvninstall-root.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/patch-mvninstall-root.txt)
| root in the patch failed. |
| -1 :x: | compile | 1m 23s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| -0 :warning: | javac | 1m 23s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| -0 :warning: | checkstyle | 0m 45s |
[/results-checkstyle-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/results-checkstyle-hbase-server.txt)
| hbase-server: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total
(was 0) |
| -1 :x: | spotbugs | 0m 39s |
[/patch-spotbugs-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/patch-spotbugs-hbase-server.txt)
| hbase-server in the patch failed. |
| -1 :x: | hadoopcheck | 1m 29s | | The patch causes 16 errors with
Hadoop v3.3.6. |
| -1 :x: | hadoopcheck | 2m 57s | | The patch causes 16 errors with
Hadoop v3.4.1. |
| +1 :green_heart: | spotless | 0m 33s | | patch has no errors when
running spotless:check. |
_ Other Tests _ |
| +1 :green_heart: | asflicense | 0m 6s | | The patch does not
generate ASF License warnings. |
| | | 16m 27s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | dupname asflicense javac spotbugs checkstyle codespell
detsecrets compile hadoopcheck hbaseanti spotless |
| uname | Linux 5ca63016945b 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / f7e81880f2fb14a4277b9ce69ca597e82e3b7854 |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| hadoopcheck |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/patch-javac-3.3.6.txt
|
| hadoopcheck |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-general-check/output/patch-javac-3.4.1.txt
|
| Max. process+thread count | 84 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/console
|
| versions | git=2.34.1 maven=3.9.8 spotbugs=4.7.3 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3853686338
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 12s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 4s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 37s | | master passed |
| +1 :green_heart: | compile | 0m 47s | | master passed |
| +1 :green_heart: | javadoc | 0m 23s | | master passed |
| +1 :green_heart: | shadedjars | 4m 28s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| -1 :x: | mvninstall | 1m 16s |
[/patch-mvninstall-root.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-jdk17-hadoop3-check/output/patch-mvninstall-root.txt)
| root in the patch failed. |
| -1 :x: | compile | 0m 46s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-jdk17-hadoop3-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| -0 :warning: | javac | 0m 46s |
[/patch-compile-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-jdk17-hadoop3-check/output/patch-compile-hbase-server.txt)
| hbase-server in the patch failed. |
| +1 :green_heart: | javadoc | 0m 21s | | the patch passed |
| -1 :x: | shadedjars | 3m 18s | | patch has 16 errors when building
our shaded downstream artifacts. |
_ Other Tests _ |
| -1 :x: | unit | 0m 46s |
[/patch-unit-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-jdk17-hadoop3-check/output/patch-unit-hbase-server.txt)
| hbase-server in the patch failed. |
| | | 15m 39s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux 542437302e36 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / f7e81880f2fb14a4277b9ce69ca597e82e3b7854 |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| shadedjars |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/artifact/yetus-jdk17-hadoop3-check/output/patch-shadedjars.txt
|
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/testReport/
|
| Max. process+thread count | 75 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/6/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3853556817 I have moved both size and time based configs to Replication Endpoint peerConfig and defaults in ReplicationSourceShipper I am now re-using ReplicationSourceShipper#shipEdits() retry mechanism for handling exceptions Please help me with the code review @Apache9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2739954159
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
Review Comment:
Sure, so does it make sense to handle this via Peer Configuration?
I can add the timeout/size logic to ReplicationSourceShipper, but have the
thresholds default to -1 (preserving existing behavior). For my custom
endpoint, I’ll pass specific values via the Peer Config.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736942086
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -98,6 +103,14 @@ public final void run() {
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
// Loop until we close down
while (isActive()) {
+ // Whether to persist replication offsets based on size/time thresholds
+ if (shouldPersistLogPosition()) {
+try {
+ persistLogPosition();
+} catch (IOException e) {
+ LOG.warn("Exception while persisting replication state", e);
Review Comment:
In persistLogPosition, we will call the callback method in
ReplicationEndpoint, and for S3 based replication endpoint, you will close the
S3 file right? This operation could throw exceptions?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736930436
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+long maxBufferSize = endpoint.getMaxBufferSize();
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+if (maxBufferSize == -1) {
+ return true;
+}
+return stagedWalSize >= maxBufferSize
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ private void persistLogPosition() throws IOException {
+if (lastShippedBatch == null) {
Review Comment:
Ah, OK, you do not rese the lastShippedBatch when reading a new batch. But
it still makes me a bit nervous that how can we get here when lastShippedBatch
is null...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736924354
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
Review Comment:
We only need this configuration in shipper, as persisting the log position
is the duty of shipper, not endpoint. And there is no correctness problem if
you implement the endpoint correctly since when persisting the log position,
you will flush the files in endpoint. The problem is that the performance may
be bad. So when using different replication endpoint implementation, you shoud
tune the configuration in shipper about the persistency interval.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736370914
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -98,6 +103,14 @@ public final void run() {
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
// Loop until we close down
while (isActive()) {
+ // Whether to persist replication offsets based on size/time thresholds
+ if (shouldPersistLogPosition()) {
+try {
+ persistLogPosition();
+} catch (IOException e) {
+ LOG.warn("Exception while persisting replication state", e);
Review Comment:
persistLogPosition() throws Exception only from cleanUpHFileRefs() where
getBulkLoadDescriptor() can throw exception, hence I thought retry is not
necessary. I am thinking to keep this as it is, but move cleanUpHFileRefs()
towards the end of persistLogPosition() definition (after we update offset).
Please let me know your thoughts
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736351890
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
Review Comment:
Ok, so we need to have these conf (time and size ones) at both
(ReplicationSourceShipper and ReplicationEndpoint) ends, so we can achieve the
tuning you mention in your previous comment:
https://github.com/apache/hbase/pull/7617#discussion_r2719322657
> For me, I do not think we need to expose this information to shipper?
>
> The design here is that, when using different ReplicationEndpoint, you
need to tune the shipper configuration by your own, as the parameters are not
only affected by ReplicationEndpoint, they also depend on the shipper side.
>
> For example, when you want to reduce the pressure on recording the offset,
you should increase the record interval, i.e, increase batch size, increase the
number of ship times between recording offset, etc. And if you want to reduce
the pressure on memory and the target receiver, you should decrease the batch
size, and for S3 based replication endpoint, there is also a trade off, if you
increase the flush interval, you can get better performance and less files on
S3, but failover will be more complicated as you need to start from long before.
>
> So, this should be in the documentation, just exposing some configuration
from ReplicationEndpoint can not handle all the above situations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736357408
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
Review Comment:
@Apache9 Please let me know if my understanding is correct here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736351890
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
Review Comment:
Ok, so we need to have these conf (time and size ones) at both
(ReplicationSourceShipper and ReplicationEndpoint) ends, so we can achieve the
tuning you mention in your previous comment:
https://github.com/apache/hbase/pull/7617#discussion_r2719322657
`The design here is that, when using different ReplicationEndpoint, you need
to tune the shipper configuration by your own, as the parameters are not only
affected by ReplicationEndpoint, they also depend on the shipper side.
For example, when you want to reduce the pressure on recording the offset,
you should increase the record interval, i.e, increase batch size, increase the
number of ship times between recording offset, etc. And if you want to reduce
the pressure on memory and the target receiver, you should decrease the batch
size, and for S3 based replication endpoint, there is also a trade off, if you
increase the flush interval, you can get better performance and less files on
S3, but failover will be more complicated as you need to start from long
before.`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736314130
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+long maxBufferSize = endpoint.getMaxBufferSize();
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+if (maxBufferSize == -1) {
+ return true;
+}
+return stagedWalSize >= maxBufferSize
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ private void persistLogPosition() throws IOException {
+if (lastShippedBatch == null) {
Review Comment:
As far as I understand, lastShippedBatch 'null' means no batch has been
replicated yet, so we don't need to update offset. Please correct me if I am
wrong here
lastShippedBatch is by default 'null' during ReplicationSourceShipper
initialisation and as soon as a batch is replicated it is updated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
wchevreuil commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736093265
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,35 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ /**
+ * @return true if this endpoint buffers WAL entries and requires explicit
flush control before
+ * persisting replication offsets.
+ */
+ default boolean isBufferedReplicationEndpoint() {
+return false;
+ }
+
+ /**
+ * Maximum WAL size (bytes) to buffer before forcing a flush. Only
meaningful when
+ * isBufferedReplicationEndpoint() == true.
+ */
+ default long getMaxBufferSize() {
+return -1L;
+ }
+
+ /**
+ * Maximum time (ms) to wait before forcing a flush. Only meaningful when
+ * isBufferedReplicationEndpoint() == true.
+ */
+ default long maxFlushInterval() {
+return Long.MAX_VALUE;
+ }
+
+ /**
+ * Hook invoked before persisting replication offsets. Buffered endpoints
should flush/close WALs
+ * here.
+ */
+ default void beforePersistingReplicationOffset() {
Review Comment:
Ok, so maybe call it `beforeUpdatingLogPosition`? And let's be more detailed
in the javadoc comments, like saying this is called by the shipper when it's
about to move the offset forward in the wal reader and potentially make wal
files that were fully read eligible for deletion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
wchevreuil commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736079118
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+long maxBufferSize = endpoint.getMaxBufferSize();
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+if (maxBufferSize == -1) {
+ return true;
+}
+return stagedWalSize >= maxBufferSize
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ private void persistLogPosition() throws IOException {
+if (lastShippedBatch == null) {
+ return;
+}
+
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+endpoint.beforePersistingReplicationOffset();
+
+// Clean up hfile references
+for (Entry entry : entriesForCleanUpHFileRefs) {
+ cleanUpHFileRefs(entry.getEdit());
+ LOG.trace("shipped entry {}: ", entry);
Review Comment:
Please provide a more accurate message. This is not what really happened
here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2736040903
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
Review Comment:
Here we should use configuration values instead of getting from
ReplicationEndpoint. We can have default configuration values to keep the old
behavior for normal replication.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +247,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+long maxBufferSize = endpoint.getMaxBufferSize();
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+if (maxBufferSize == -1) {
+ return true;
+}
+return stagedWalSize >= maxBufferSize
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ private void persistLogPosition() throws IOException {
+if (lastShippedBatch == null) {
Review Comment:
Since we could cumulate different batches in the above loop, a null batch
does not mean we haven't shipped anything out? Why here we just return if
lastShippedBatch is null?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -98,6 +103,14 @@ public final void run() {
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
// Loop until we close down
while (isActive()) {
+ // Whether to persist replication offsets based on size/time thresholds
+ if (shouldPersistLogPosition()) {
+try {
+ persistLogPosition();
+} catch (IOException e) {
+ LOG.warn("Exception while persisting replication state", e);
Review Comment:
This is not enough for handling the exception? Typically we should restart
from the last persistent offset and replicate again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
wchevreuil commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2735996225
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
> Are you referring to your original comment about passing the entire
entryBatch object?
No, I meant [this check
](https://github.com/apache/hbase/pull/7617/changes/8c828290cc7ab36b9db75a72916d6bbfd1dc4e47#diff-1907878c880a164eafc9eb5fb412a376d1d504bf54bfe91192e36693e60dfaf8L251)that
was being done inside "shouldPersistLogPosition", which would cause the buffer
size to be only considered for specific endpoint types.
@ankitsol has already addressed it on a recent commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3807670294
:confetti_ball: **+1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 30s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 3s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 33s | | master passed |
| +1 :green_heart: | compile | 1m 7s | | master passed |
| +1 :green_heart: | javadoc | 0m 31s | | master passed |
| +1 :green_heart: | shadedjars | 5m 58s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 6s | | the patch passed |
| +1 :green_heart: | compile | 0m 59s | | the patch passed |
| +1 :green_heart: | javac | 0m 59s | | the patch passed |
| -0 :warning: | javadoc | 0m 27s |
[/results-javadoc-javadoc-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/5/artifact/yetus-jdk17-hadoop3-check/output/results-javadoc-javadoc-hbase-server.txt)
| hbase-server generated 1 new + 65 unchanged - 0 fixed = 66 total (was 65) |
| +1 :green_heart: | shadedjars | 5m 52s | | patch has no errors when
building our shaded downstream artifacts. |
_ Other Tests _ |
| +1 :green_heart: | unit | 230m 20s | | hbase-server in the patch
passed. |
| | | 257m 36s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.43 ServerAPI=1.43 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/5/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux a908a4083825 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May
23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / e74eee1698d3bff65b1d1b05e4b1982fe450c72b |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/5/testReport/
|
| Max. process+thread count | 4270 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/5/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3806617854
:confetti_ball: **+1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 11s | | Docker mode activated. |
_ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | hbaseanti | 0m 0s | | Patch does not have any
anti-patterns. |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 15s | | master passed |
| +1 :green_heart: | compile | 2m 35s | | master passed |
| +1 :green_heart: | checkstyle | 0m 45s | | master passed |
| +1 :green_heart: | spotbugs | 1m 13s | | master passed |
| +1 :green_heart: | spotless | 0m 34s | | branch has no errors when
running spotless:check. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 17s | | the patch passed |
| +1 :green_heart: | compile | 2m 31s | | the patch passed |
| +1 :green_heart: | javac | 2m 31s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 0m 45s | | the patch passed |
| +1 :green_heart: | spotbugs | 1m 19s | | the patch passed |
| +1 :green_heart: | hadoopcheck | 8m 30s | | Patch does not cause any
errors with Hadoop 3.3.6 3.4.1. |
| +1 :green_heart: | spotless | 0m 34s | | patch has no errors when
running spotless:check. |
_ Other Tests _ |
| +1 :green_heart: | asflicense | 0m 8s | | The patch does not
generate ASF License warnings. |
| | | 29m 3s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/5/artifact/yetus-general-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | dupname asflicense javac spotbugs checkstyle codespell
detsecrets compile hadoopcheck hbaseanti spotless |
| uname | Linux 3133a65ee296 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / e74eee1698d3bff65b1d1b05e4b1982fe450c72b |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Max. process+thread count | 85 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/5/console
|
| versions | git=2.34.1 maven=3.9.8 spotbugs=4.7.3 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2733050407
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ @Nullable
+ // Returns IOException instead of throwing so callers can decide
+ // whether to retry (shipEdits) or best-effort log (run()).
+ private IOException persistLogPosition() {
+if (lastShippedBatch == null) {
+ return null;
+}
+
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+
+if (endpoint.isBufferedReplicationEndpoint() && stagedWalSize > 0) {
+ source.getReplicationEndpoint().beforePersistingReplicationOffset();
+}
Review Comment:
I think Duo's
[comment](https://github.com/apache/hbase/pull/7617#discussion_r2721491163)
applies here as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2733047546
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
> That would be fine for me, but then we would need to adjust the
shouldPersistLogPosition method accordingly.
Are you referring to your original comment about passing the entire
`entryBatch` object?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3805694243
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 11s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 3s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 23s | | master passed |
| +1 :green_heart: | compile | 0m 48s | | master passed |
| +1 :green_heart: | javadoc | 0m 21s | | master passed |
| +1 :green_heart: | shadedjars | 4m 25s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 17s | | the patch passed |
| +1 :green_heart: | compile | 0m 46s | | the patch passed |
| +1 :green_heart: | javac | 0m 46s | | the patch passed |
| -0 :warning: | javadoc | 0m 21s |
[/results-javadoc-javadoc-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/artifact/yetus-jdk17-hadoop3-check/output/results-javadoc-javadoc-hbase-server.txt)
| hbase-server generated 1 new + 65 unchanged - 0 fixed = 66 total (was 65) |
| +1 :green_heart: | shadedjars | 4m 25s | | patch has no errors when
building our shaded downstream artifacts. |
_ Other Tests _ |
| -1 :x: | unit | 216m 8s |
[/patch-unit-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/artifact/yetus-jdk17-hadoop3-check/output/patch-unit-hbase-server.txt)
| hbase-server in the patch failed. |
| | | 236m 1s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.52 ServerAPI=1.52 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux 127c6853cf95 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / 8c828290cc7ab36b9db75a72916d6bbfd1dc4e47 |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/testReport/
|
| Max. process+thread count | 6323 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3804659205
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 11s | | Docker mode activated. |
_ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | hbaseanti | 0m 0s | | Patch does not have any
anti-patterns. |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 8s | | master passed |
| +1 :green_heart: | compile | 2m 52s | | master passed |
| +1 :green_heart: | checkstyle | 0m 53s | | master passed |
| +1 :green_heart: | spotbugs | 1m 26s | | master passed |
| +1 :green_heart: | spotless | 0m 43s | | branch has no errors when
running spotless:check. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 49s | | the patch passed |
| +1 :green_heart: | compile | 2m 47s | | the patch passed |
| +1 :green_heart: | javac | 2m 47s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 0m 51s | | the patch passed |
| +1 :green_heart: | spotbugs | 1m 30s | | the patch passed |
| +1 :green_heart: | hadoopcheck | 9m 26s | | Patch does not cause any
errors with Hadoop 3.3.6 3.4.1. |
| -1 :x: | spotless | 0m 32s |
[/patch-spotless.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/artifact/yetus-general-check/output/patch-spotless.txt)
| patch has 24 errors when running spotless:check, run spotless:apply to fix.
|
_ Other Tests _ |
| +1 :green_heart: | asflicense | 0m 10s | | The patch does not
generate ASF License warnings. |
| | | 33m 9s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.53 ServerAPI=1.53 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/artifact/yetus-general-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | dupname asflicense javac spotbugs checkstyle codespell
detsecrets compile hadoopcheck hbaseanti spotless |
| uname | Linux 11fb3d023937 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / 8c828290cc7ab36b9db75a72916d6bbfd1dc4e47 |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Max. process+thread count | 84 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/4/console
|
| versions | git=2.34.1 maven=3.9.8 spotbugs=4.7.3 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
wchevreuil commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2727222631
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
Oh, so your idea is to allow shipper to decide persist log position based on
time and/or stg usage by wals regarldess of the endpoint implementation? That
would be fine for me, but then we would need to adjust the
shouldPersistLogPosition method accordingly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2724276907
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
I think time based and size based persistency is enough for most cases? If
in the future we have some special endpoint which needs new type of decision
way, we can add new mechanism, no problem.
The problem here why we do not want to only trigger persistency from
endpoint is that, we have other considerations about when to persist the log
position, like the trade off between failover and pressure on replication
storage. So here I suggest that we introduce general mechanisms to control the
behavior of persistency of log position, users can tune it based on different
approach.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
wchevreuil commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2721937683
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
IMHO, it doesn't look much cohesive. Shipper seems to be taking decisions
based on specific endpoint implementations. What if new endpoint impls with
different logic for updating log position are thought in the future, we would
need to revisit the shipper again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2721491163
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
We want to determine whether we need to persist the log position in shipper,
based on some configurations, not triggered by replication endpoint. Users can
choose different configuration values based on different replication endpoint
implementations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2721486401
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
Review Comment:
Please see my comment above to get more context
> For me, I do not think we need to expose this information to shipper?
>
>The design here is that, when using different ReplicationEndpoint, you need
to tune the shipper configuration by your own, as the parameters are not only
affected by ReplicationEndpoint, they also depend on the shipper side.
>
>For example, when you want to reduce the pressure on recording the offset,
you should increase the record interval, i.e, increase batch size, increase the
number of ship times between recording offset, etc. And if you want to reduce
the pressure on memory and the target receiver, you should decrease the batch
size, and for S3 based replication endpoint, there is also a trade off, if you
increase the flush interval, you can get better performance and less files on
S3, but failover will be more complicated as you need to start from long before.
>
>So, this should be in the documentation, just exposing some configuration
from ReplicationEndpoint can not handle all the above situations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2721481232
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,35 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ /**
+ * @return true if this endpoint buffers WAL entries and requires explicit
flush control before
+ * persisting replication offsets.
+ */
+ default boolean isBufferedReplicationEndpoint() {
+return false;
+ }
+
+ /**
+ * Maximum WAL size (bytes) to buffer before forcing a flush. Only
meaningful when
+ * isBufferedReplicationEndpoint() == true.
+ */
+ default long getMaxBufferSize() {
+return -1L;
+ }
+
+ /**
+ * Maximum time (ms) to wait before forcing a flush. Only meaningful when
+ * isBufferedReplicationEndpoint() == true.
+ */
+ default long maxFlushInterval() {
+return Long.MAX_VALUE;
+ }
+
+ /**
+ * Hook invoked before persisting replication offsets. Buffered endpoints
should flush/close WALs
+ * here.
+ */
+ default void beforePersistingReplicationOffset() {
Review Comment:
Sorry, this is exactly what we want to avoid here. This method just tells
the endpoint what the shipper going to do, and the endpoint can do anything it
wants. For our normal replication framework, there is no flush and close
operations, as we will always send everything out and return until we get acks,
so basically we do not need to implement this method. And for S3 based
replication endpoint, we need to close the file to persist it on S3. Maybe in
the future we have other types of replication endpoint which may do other
works, so we do not want to name it `flushAndCloseWAL`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
wchevreuil commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2720780067
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ @Nullable
+ // Returns IOException instead of throwing so callers can decide
+ // whether to retry (shipEdits) or best-effort log (run()).
+ private IOException persistLogPosition() {
+if (lastShippedBatch == null) {
+ return null;
+}
+
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+
+if (endpoint.isBufferedReplicationEndpoint() && stagedWalSize > 0) {
+ source.getReplicationEndpoint().beforePersistingReplicationOffset();
+}
Review Comment:
Also should be in the endpoint.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
Review Comment:
This should be in the Endpoint, as the decision varies according to the type
of endpoint. Today we have basically two types, buffered and non-buffered. If
we have new endpoint types in the future, we might again need to come here and
add the related logic.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,35 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ /**
+ * @return true if this endpoint buffers WAL entries and requires explicit
flush control before
+ * persisting replication offsets.
+ */
+ default boolean isBufferedReplicationEndpoint() {
+return false;
+ }
+
+ /**
+ * Maximum WAL size (bytes) to buffer before forcing a flush. Only
meaningful when
+ * isBufferedReplicationEndpoint() == true.
+ */
+ default long getMaxBufferSize() {
+return -1L;
+ }
+
+ /**
+ * Maximum time (ms) to wait before forcing a flush. Only meaningful when
+ * isBufferedReplicationEndpoint() == true.
+ */
+ default long maxFlushInterval() {
+return Long.MAX_VALUE;
+ }
+
+ /**
+ * Hook invoked before persisting replication offsets. Buffered endpoints
should flush/close WALs
+ * here.
+ */
+ default void beforePersistingReplicationOffset() {
Review Comment:
From the endpoints view, this method is just a flush/close of the given wal,
so lets name it accordingly.
```suggestion
default void flushAndCloseWAL() {
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -190,13 +204,16 @@ private void shipEdits(WALEntryBatch entryBatch) {
} else {
sleepMultiplier = Math.max(sleepMultiplier - 1, 0);
}
-// Clean up hfile references
-for (Entry entry : entries) {
- cleanUpHFileRefs(entry.getEdit());
- LOG.trace("shipped entry {}: ", entry);
+
+stagedWalSize += currentSize;
+entriesForCleanUpHFileRefs.addAll(entries);
+lastShippedBatch = entryBatch;
+if (shouldPersistLogPosition()) {
Review Comment:
Rather than having these `stagedWalSize` and `lastShippedBatch` as global
variables, we should just pass the `entryBatch` along to
`shouldPersistLogPosition()` (which should be defined/implemented in the
endpoints, btw, see my other comment related) and `persistLogPosition()`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2719322657
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,35 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ /**
+ * @return true if this endpoint buffers WAL entries and requires explicit
flush control before
+ * persisting replication offsets.
+ */
+ default boolean isBufferedReplicationEndpoint() {
Review Comment:
For me, I do not think we need to expose this information to shipper?
The design here is that, when using different ReplicationEndpoint, you need
to tune the shipper configuration by your own, as the parameters are not only
affected by ReplicationEndpoint, they also depend on the shipper side.
For example, when you want to reduce the pressure on recording the offset,
you should increase the record interval, i.e, increase batch size, increase the
number of ship times between recording offset, etc. And if you want to reduce
the pressure on memory and the target receiver, you should decrease the batch
size, and for S3 based replication endpoint, there is also a trade off, if you
increase the flush interval, you can get better performance and less files on
S3, but failover will be more complicated as you need to start from long before.
So, this should be in the documentation, just exposing some configuration
from ReplicationEndpoint can not handle all the above situations.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -98,6 +104,13 @@ public final void run() {
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
// Loop until we close down
while (isActive()) {
+ // check if flush needed for WAL backup, this is need for timeout based
flush
Review Comment:
This is not designed for WAL backup only, I need to say. Here, in shipper,
we just follow the configured limit, i.e, time based or size based, to
determine whether we need to record the log position, and there is a callback
to ReplicationEndpoint before recording, the replication endpoint can use this
callback to do something, the shipper does not know the detail.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ @Nullable
+ // Returns IOException instead of throwing so callers can decide
+ // whether to retry (shipEdits) or best-effort log (run()).
+ private IOException persistLogPosition() {
Review Comment:
+1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
taklwu commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2718075417
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ @Nullable
+ // Returns IOException instead of throwing so callers can decide
+ // whether to retry (shipEdits) or best-effort log (run()).
+ private IOException persistLogPosition() {
+if (lastShippedBatch == null) {
+ return null;
+}
+
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+
+if (endpoint.isBufferedReplicationEndpoint() && stagedWalSize > 0) {
+ source.getReplicationEndpoint().beforePersistingReplicationOffset();
+}
+
+stagedWalSize = 0;
+lastStagedFlushTs = EnvironmentEdgeManager.currentTime();
+
+// Clean up hfile references
+try {
+ for (Entry entry : entriesForCleanUpHFileRefs) {
+cleanUpHFileRefs(entry.getEdit());
+LOG.trace("shipped entry {}: ", entry);
+ }
+} catch (IOException e) {
+ LOG.warn("{} threw exception while cleaning up hfile refs",
endpoint.getClass().getName(), e);
+ return e;
Review Comment:
```suggestion
throw e;
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -98,6 +104,13 @@ public final void run() {
LOG.info("Running ReplicationSourceShipper Thread for wal group: {}",
this.walGroupId);
// Loop until we close down
while (isActive()) {
+ // check if flush needed for WAL backup, this is need for timeout based
flush
+ if (shouldPersistLogPosition()) {
+IOException error = persistLogPosition();
+if (error != null) {
+ LOG.warn("Exception while persisting replication state", error);
+}
+ }
Review Comment:
nit: I don't think you need to change the return type as `IOException`,
instead you can just use `try-catch` if it's really needed to handle any
Exception differently.
```suggestion
if (shouldPersistLogPosition()) {
persistLogPosition();
}
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +246,53 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+ReplicationEndpoint endpoint = source.getReplicationEndpoint();
+if (!endpoint.isBufferedReplicationEndpoint()) {
+ // Non-buffering endpoints persist immediately
+ return true;
+}
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return stagedWalSize >= endpoint.getMaxBufferSize()
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs >=
endpoint.maxFlushInterval());
+ }
+
+ @Nullable
+ // Returns IOException instead of throwing so callers can decide
+ // whether to retry (shipEdits) or best-effort log (run()).
+ private IOException persistLogPosition() {
Review Comment:
```suggestion
private void persistLogPosition() throws IOException {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3786077391
:confetti_ball: **+1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 29s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 3s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 27s | | master passed |
| +1 :green_heart: | compile | 1m 6s | | master passed |
| +1 :green_heart: | javadoc | 0m 30s | | master passed |
| +1 :green_heart: | shadedjars | 5m 55s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 8s | | the patch passed |
| +1 :green_heart: | compile | 1m 5s | | the patch passed |
| +1 :green_heart: | javac | 1m 5s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 27s | | the patch passed |
| +1 :green_heart: | shadedjars | 5m 51s | | patch has no errors when
building our shaded downstream artifacts. |
_ Other Tests _ |
| +1 :green_heart: | unit | 241m 59s | | hbase-server in the patch
passed. |
| | | 269m 23s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.43 ServerAPI=1.43 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/3/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux d4d0c9ba21ef 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May
23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / daded968719a4ba7ae0af58cfc0b2b1083f35ad7 |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/3/testReport/
|
| Max. process+thread count | 3736 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/3/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3781194007 @wchevreuil @Apache9 Thank you guys for the suggestion. I'm happy to merge it either branch once it's accepted. Please also provide feedback on the patch itself. Does it match the approach that you suggested on the feature branch @Apache9 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache9 commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3777600615 > Why are we targeting this to master? Shouldn't it be on the feature branch? > > Also, as already mentioned by @anmolnar and @taklwu , we should refrain from adding logic that is specific to the continuous backup replication in the generic replication interfaces/classes. This is my suggestion that we should target to master branch to add this feature first, and then reimplement the continuous backup feature on top of this change. Of course, we do not need to merge this to master first, this is just for better reviewing, we can land this to a feature branch and then rebase the continus backup branch. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2709926770
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +239,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return (stagedWalSize >=
source.getReplicationEndpoint().getMaxBufferSize())
Review Comment:
I can agree with this comment too. You introduced two new properties:
`getMaxBufferSize()` and `maxFlushInterval()` with default values which should
mimic the original behavior. I think you documented it in `ReplicationEndpoint`
interface properly and it would be useful to explicitly check for it too.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,29 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
+ // file. So we return config value CONF_BACKUP_MAX_WAL_SIZE for
+ // ContinuousBackupReplicationEndpoint
+ // and -1 for other ReplicationEndpoint since they don't buffer.
+ // For other ReplicationEndpoint, everytime a WALEntryBatch is shipped, we
update replication
+ // offset. Please check ReplicationSourceShipper#shouldFlushStagedWal()
Review Comment:
@ankitsol It seems that this comment is valid.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +239,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return (stagedWalSize >=
source.getReplicationEndpoint().getMaxBufferSize())
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs
+ >= source.getReplicationEndpoint().maxFlushInterval());
+ }
+
+ private void persistLogPosition() {
+if (lastShippedBatch == null) {
+ return;
+}
+if (stagedWalSize > 0) {
+ source.getReplicationEndpoint().beforePersistingReplicationOffset();
+}
+stagedWalSize = 0;
+lastStagedFlushTs = EnvironmentEdgeManager.currentTime();
+
+// Clean up hfile references
+for (Entry entry : entriesForCleanUpHFileRefs) {
+ try {
+cleanUpHFileRefs(entry.getEdit());
+ } catch (IOException e) {
+LOG.warn("{} threw unknown exception:",
+ source.getReplicationEndpoint().getClass().getName(), e);
+ }
Review Comment:
Agree. Either the exception should be rethrown here or don't need to catch
at all.
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,29 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
+ // file. So we return config value CONF_BACKUP_MAX_WAL_SIZE for
+ // ContinuousBackupReplicationEndpoint
+ // and -1 for other ReplicationEndpoint since they don't buffer.
+ // For other ReplicationEndpoint, everytime a WALEntryBatch is shipped, we
update replication
+ // offset. Please check ReplicationSourceShipper#shouldFlushStagedWal()
+ default long getMaxBufferSize() {
+return -1;
+ }
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
+ // file. So we return config value CONF_STAGED_WAL_FLUSH_INTERVAL for
+ // ContinuousBackupReplicationEndpoint
+ // and Long.MAX_VALUE for other ReplicationEndpoint since they don't buffer.
+ // For other ReplicationEndpoint, everytime a WALEntryBatch is shipped, we
update replication
+ // offset. Please check ReplicationSourceShipper#shouldFlushStagedWal()
Review Comment:
Same here.
##
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceShipperBufferedFlush.java:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
anmolnar commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2709831534
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,29 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
+ // file. So we return config value CONF_BACKUP_MAX_WAL_SIZE for
+ // ContinuousBackupReplicationEndpoint
+ // and -1 for other ReplicationEndpoint since they don't buffer.
+ // For other ReplicationEndpoint, everytime a WALEntryBatch is shipped, we
update replication
+ // offset. Please check ReplicationSourceShipper#shouldFlushStagedWal()
+ default long getMaxBufferSize() {
+return -1;
+ }
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
Review Comment:
@ankitsol I agree. Please remove these references from this patch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3774098500 Hi @Apache9 @anmolnar @vinayakphegde just a gentle reminder on this PR when you get a moment 🙂 Thanks! This PR is followup to https://github.com/apache/hbase/pull/7591 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
ankitsol commented on PR #7617: URL: https://github.com/apache/hbase/pull/7617#issuecomment-3772997503 > can you also check if these two failures are related? > > TestHRegionWithInMemoryFlush.testParallelIncrementWithMemStoreFlush TestTags.testFlushAndCompactionwithCombinations These are passing locally -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
taklwu commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2684251451
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,29 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
+ // file. So we return config value CONF_BACKUP_MAX_WAL_SIZE for
+ // ContinuousBackupReplicationEndpoint
+ // and -1 for other ReplicationEndpoint since they don't buffer.
+ // For other ReplicationEndpoint, everytime a WALEntryBatch is shipped, we
update replication
+ // offset. Please check ReplicationSourceShipper#shouldFlushStagedWal()
+ default long getMaxBufferSize() {
+return -1;
+ }
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
Review Comment:
nit: `ContinuousBackupReplicationEndpoint` is part of
https://github.com/apache/hbase/pull/7591/ and it's yet committing to master,
should we mention early in this change?
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +239,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return (stagedWalSize >=
source.getReplicationEndpoint().getMaxBufferSize())
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs
+ >= source.getReplicationEndpoint().maxFlushInterval());
+ }
+
+ private void persistLogPosition() {
+if (lastShippedBatch == null) {
+ return;
+}
+if (stagedWalSize > 0) {
+ source.getReplicationEndpoint().beforePersistingReplicationOffset();
+}
+stagedWalSize = 0;
+lastStagedFlushTs = EnvironmentEdgeManager.currentTime();
+
+// Clean up hfile references
+for (Entry entry : entriesForCleanUpHFileRefs) {
+ try {
+cleanUpHFileRefs(entry.getEdit());
+ } catch (IOException e) {
+LOG.warn("{} threw unknown exception:",
+ source.getReplicationEndpoint().getClass().getName(), e);
+ }
Review Comment:
nit: will this be a behavior change because previously when
`cleanUpHFileRefs` failed, it's throwing thru the function but here we're
logging it only.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Copilot commented on code in PR #7617:
URL: https://github.com/apache/hbase/pull/7617#discussion_r2684440781
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +239,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return (stagedWalSize >=
source.getReplicationEndpoint().getMaxBufferSize())
Review Comment:
The condition checks if stagedWalSize is greater than or equal to
getMaxBufferSize(), but getMaxBufferSize() can return -1 for non-buffering
endpoints (as documented in ReplicationEndpoint). This means the comparison
'stagedWalSize >= -1' would always be true when stagedWalSize > 0, causing
immediate flushes for non-buffering endpoints. While this preserves existing
behavior, the intent is unclear and could be confusing. Consider explicitly
checking for -1 to make the logic more explicit.
```suggestion
long maxBufferSize = source.getReplicationEndpoint().getMaxBufferSize();
// For non-buffering endpoints, getMaxBufferSize() returns a negative
value (e.g., -1).
// In that case, we always trigger a flush based on size as soon as
there is staged data.
boolean sizeBasedFlush =
(maxBufferSize < 0) || (stagedWalSize >= maxBufferSize);
return sizeBasedFlush
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipper.java:
##
@@ -229,6 +239,41 @@ private void shipEdits(WALEntryBatch entryBatch) {
}
}
+ private boolean shouldPersistLogPosition() {
+if (stagedWalSize == 0 || lastShippedBatch == null) {
+ return false;
+}
+return (stagedWalSize >=
source.getReplicationEndpoint().getMaxBufferSize())
+ || (EnvironmentEdgeManager.currentTime() - lastStagedFlushTs
+ >= source.getReplicationEndpoint().maxFlushInterval());
+ }
+
+ private void persistLogPosition() {
+if (lastShippedBatch == null) {
+ return;
+}
+if (stagedWalSize > 0) {
+ source.getReplicationEndpoint().beforePersistingReplicationOffset();
+}
+stagedWalSize = 0;
+lastStagedFlushTs = EnvironmentEdgeManager.currentTime();
+
+// Clean up hfile references
+for (Entry entry : entriesForCleanUpHFileRefs) {
+ try {
+cleanUpHFileRefs(entry.getEdit());
+ } catch (IOException e) {
Review Comment:
The error handling for IOException has been changed to catch and log the
exception instead of propagating it. This silently suppresses IOException
failures during cleanup, which could hide serious issues like file system
problems. If cleanup failures should be non-fatal, this should be explicitly
documented, or consider at least incrementing a failure metric to track these
errors.
```suggestion
} catch (IOException e) {
// Cleanup failures are intentionally treated as non-fatal:
replication has already
// succeeded for these entries, so we log the failure and continue.
```
##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/ReplicationEndpoint.java:
##
@@ -283,4 +283,29 @@ public int getTimeout() {
* @throws IllegalStateException if this service's state isn't FAILED.
*/
Throwable failureCause();
+
+ // WAL entries are buffered in ContinuousBackupReplicationEndpoint before
flushing to WAL backup
+ // file. So we return config value CONF_BACKUP_MAX_WAL_SIZE for
+ // ContinuousBackupReplicationEndpoint
+ // and -1 for other ReplicationEndpoint since they don't buffer.
+ // For other ReplicationEndpoint, everytime a WALEntryBatch is shipped, we
update replication
+ // offset. Please check ReplicationSourceShipper#shouldFlushStagedWal()
Review Comment:
The comment references 'shouldFlushStagedWal()' but the actual method name
in ReplicationSourceShipper is 'shouldPersistLogPosition()'. This inconsistency
will confuse developers trying to understand the interaction between these
components.
##
hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestReplicationSourceShipperBufferedFlush.java:
##
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANT
Re: [PR] HBASE-29823 Control WAL flush and offset persistence from ReplicationSourceShipper [hbase]
Apache-HBase commented on PR #7617:
URL: https://github.com/apache/hbase/pull/7617#issuecomment-3740810146
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|::|--:|:|::|:---:|
| +0 :ok: | reexec | 0m 14s | | Docker mode activated. |
| -0 :warning: | yetus | 0m 3s | | Unprocessed flag(s):
--brief-report-file --spotbugs-strict-precheck --author-ignore-list
--blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck |
_ Prechecks _ |
_ master Compile Tests _ |
| +1 :green_heart: | mvninstall | 3m 19s | | master passed |
| +1 :green_heart: | compile | 0m 59s | | master passed |
| +1 :green_heart: | javadoc | 0m 30s | | master passed |
| +1 :green_heart: | shadedjars | 6m 20s | | branch has no errors when
building our shaded downstream artifacts. |
_ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 2m 55s | | the patch passed |
| +1 :green_heart: | compile | 1m 0s | | the patch passed |
| +1 :green_heart: | javac | 1m 0s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 27s | | the patch passed |
| +1 :green_heart: | shadedjars | 6m 20s | | patch has no errors when
building our shaded downstream artifacts. |
_ Other Tests _ |
| -1 :x: | unit | 251m 26s |
[/patch-unit-hbase-server.txt](https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/2/artifact/yetus-jdk17-hadoop3-check/output/patch-unit-hbase-server.txt)
| hbase-server in the patch failed. |
| | | 277m 59s | | |
| Subsystem | Report/Notes |
|--:|:-|
| Docker | ClientAPI=1.52 ServerAPI=1.52 base:
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/2/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
|
| GITHUB PR | https://github.com/apache/hbase/pull/7617 |
| Optional Tests | javac javadoc unit compile shadedjars |
| uname | Linux 82b2e4b8e507 6.14.0-1018-aws #18~24.04.1-Ubuntu SMP Mon Nov
24 19:46:27 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | master / 95000e024dbdf7543366f7e883db615c10d43d49 |
| Default Java | Eclipse Adoptium-17.0.11+9 |
| Test Results |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/2/testReport/
|
| Max. process+thread count | 4718 (vs. ulimit of 3) |
| modules | C: hbase-server U: hbase-server |
| Console output |
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7617/2/console
|
| versions | git=2.34.1 maven=3.9.8 |
| Powered by | Apache Yetus 0.15.0 https://yetus.apache.org |
This message was automatically generated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
