[jira] [Resolved] (HBASE-28756) RegionSizeCalculator ignored the size of memstore, which leads Spark miss data
[ https://issues.apache.org/jira/browse/HBASE-28756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-28756. - Fix Version/s: 3.0.0-beta-2 2.6.1 2.5.11 Resolution: Fixed > RegionSizeCalculator ignored the size of memstore, which leads Spark miss data > -- > > Key: HBASE-28756 > URL: https://issues.apache.org/jira/browse/HBASE-28756 > Project: HBase > Issue Type: Bug > Components: mapreduce >Affects Versions: 2.6.0, 3.0.0-beta-1, 2.5.10 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0-beta-2, 2.6.1, 2.5.11 > > > RegionSizeCalculator only considers the size of StoreFile and ignores the > size of MemStore. For a new region that has only been written to MemStore and > has not been flushed, will consider its size to be 0. > When we use TableInputFormat to read HBase table data in Spark. > {code:java} > spark.sparkContext.newAPIHadoopRDD( > conf, > classOf[TableInputFormat], > classOf[ImmutableBytesWritable], > classOf[Result]) > }{code} > Spark defaults to ignoring empty InputSplits, which is determined by the > configuration "{{{}spark.hadoopRDD.ignoreEmptySplits{}}}". > {code:java} > private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS = > ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits") > .internal() > .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for > empty input splits.") > .version("2.3.0") > .booleanConf > .createWithDefault(true) {code} > The above reasons lead to Spark missing data. So we should consider both the > size of the StoreFile and the MemStore in the RegionSizeCalculator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28756) RegionSizeCalculator ignored the size of memstore, which leads Spark miss data
[ https://issues.apache.org/jira/browse/HBASE-28756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868881#comment-17868881 ] Sun Xin commented on HBASE-28756: - Thanks [~stoty] for the reminder, based on what you mentioned in SPARK-37660 {quote}I have encountered this. There are several issues: - Hbase returns the HBase Region size, instead of the split size, which may not be the same. - HBase rounds the size to Megabytes. - Even if it didn't round to Megabytes, I suspect that it only tallies HFiles, so for new tables the size may still be zero until the first HFile is written.{quote} This issue doesn't solve this problem completely. When we fetch data from HBase in Spark, we can only use scan directly instead of newAPIHadoopRDD, or set {color:#172b4d}spark.hadoopRDD.ignoreEmptySplits {color}to false. > RegionSizeCalculator ignored the size of memstore, which leads Spark miss data > -- > > Key: HBASE-28756 > URL: https://issues.apache.org/jira/browse/HBASE-28756 > Project: HBase > Issue Type: Bug > Components: mapreduce >Affects Versions: 2.6.0, 3.0.0-beta-1, 2.5.10 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Labels: pull-request-available > > RegionSizeCalculator only considers the size of StoreFile and ignores the > size of MemStore. For a new region that has only been written to MemStore and > has not been flushed, will consider its size to be 0. > When we use TableInputFormat to read HBase table data in Spark. > {code:java} > spark.sparkContext.newAPIHadoopRDD( > conf, > classOf[TableInputFormat], > classOf[ImmutableBytesWritable], > classOf[Result]) > }{code} > Spark defaults to ignoring empty InputSplits, which is determined by the > configuration "{{{}spark.hadoopRDD.ignoreEmptySplits{}}}". > {code:java} > private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS = > ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits") > .internal() > .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for > empty input splits.") > .version("2.3.0") > .booleanConf > .createWithDefault(true) {code} > The above reasons lead to Spark missing data. So we should consider both the > size of the StoreFile and the MemStore in the RegionSizeCalculator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28756) RegionSizeCalculator ignored the size of memstore, which leads Spark miss data
[ https://issues.apache.org/jira/browse/HBASE-28756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17868879#comment-17868879 ] Sun Xin commented on HBASE-28756: - Pushed to all active branch. Thanks [~zhangduo] [~ashwinpankaj] for reviewing. > RegionSizeCalculator ignored the size of memstore, which leads Spark miss data > -- > > Key: HBASE-28756 > URL: https://issues.apache.org/jira/browse/HBASE-28756 > Project: HBase > Issue Type: Bug > Components: mapreduce >Affects Versions: 2.6.0, 3.0.0-beta-1, 2.5.10 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Labels: pull-request-available > > RegionSizeCalculator only considers the size of StoreFile and ignores the > size of MemStore. For a new region that has only been written to MemStore and > has not been flushed, will consider its size to be 0. > When we use TableInputFormat to read HBase table data in Spark. > {code:java} > spark.sparkContext.newAPIHadoopRDD( > conf, > classOf[TableInputFormat], > classOf[ImmutableBytesWritable], > classOf[Result]) > }{code} > Spark defaults to ignoring empty InputSplits, which is determined by the > configuration "{{{}spark.hadoopRDD.ignoreEmptySplits{}}}". > {code:java} > private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS = > ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits") > .internal() > .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for > empty input splits.") > .version("2.3.0") > .booleanConf > .createWithDefault(true) {code} > The above reasons lead to Spark missing data. So we should consider both the > size of the StoreFile and the MemStore in the RegionSizeCalculator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28756) RegionSizeCalculator ignored the size of memstore, which leads Spark miss data
Sun Xin created HBASE-28756: --- Summary: RegionSizeCalculator ignored the size of memstore, which leads Spark miss data Key: HBASE-28756 URL: https://issues.apache.org/jira/browse/HBASE-28756 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 2.5.10, 3.0.0-beta-1, 2.6.0 Reporter: Sun Xin Assignee: Sun Xin RegionSizeCalculator only considers the size of StoreFile and ignores the size of MemStore. For a new region that has only been written to MemStore and has not been flushed, will consider its size to be 0. When we use TableInputFormat to read HBase table data in Spark. {code:java} spark.sparkContext.newAPIHadoopRDD( conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) }{code} Spark defaults to ignoring empty InputSplits, which is determined by the configuration "{{{}spark.hadoopRDD.ignoreEmptySplits{}}}". {code:java} private[spark] val HADOOP_RDD_IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.hadoopRDD.ignoreEmptySplits") .internal() .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for empty input splits.") .version("2.3.0") .booleanConf .createWithDefault(true) {code} The above reasons lead to Spark missing data. So we should consider both the size of the StoreFile and the MemStore in the RegionSizeCalculator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28749) Remove the duplicate configurations named hbase.wal.batch.size
[ https://issues.apache.org/jira/browse/HBASE-28749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-28749. - Resolution: Fixed > Remove the duplicate configurations named hbase.wal.batch.size > -- > > Key: HBASE-28749 > URL: https://issues.apache.org/jira/browse/HBASE-28749 > Project: HBase > Issue Type: Improvement > Components: wal >Affects Versions: 3.0.0-beta-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0-beta-2 > > > The following code appears in two places: AsyncFSWAL and AbstractFSWAL > {code:java} > public static final String WAL_BATCH_SIZE = "hbase.wal.batch.size"; > public static final long DEFAULT_WAL_BATCH_SIZE = 64L * 1024; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28749) Remove the duplicate configurations named hbase.wal.batch.size
[ https://issues.apache.org/jira/browse/HBASE-28749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867921#comment-17867921 ] Sun Xin commented on HBASE-28749: - Thanks [~zhangduo] and [~pankajkumar] for reviewing. > Remove the duplicate configurations named hbase.wal.batch.size > -- > > Key: HBASE-28749 > URL: https://issues.apache.org/jira/browse/HBASE-28749 > Project: HBase > Issue Type: Improvement > Components: wal >Affects Versions: 3.0.0-beta-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0-beta-2 > > > The following code appears in two places: AsyncFSWAL and AbstractFSWAL > {code:java} > public static final String WAL_BATCH_SIZE = "hbase.wal.batch.size"; > public static final long DEFAULT_WAL_BATCH_SIZE = 64L * 1024; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28749) Remove the duplicate configurations named hbase.wal.batch.size
[ https://issues.apache.org/jira/browse/HBASE-28749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867920#comment-17867920 ] Sun Xin commented on HBASE-28749: - Pushed to master and branch-3. > Remove the duplicate configurations named hbase.wal.batch.size > -- > > Key: HBASE-28749 > URL: https://issues.apache.org/jira/browse/HBASE-28749 > Project: HBase > Issue Type: Improvement > Components: wal >Affects Versions: 3.0.0-beta-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0-beta-2 > > > The following code appears in two places: AsyncFSWAL and AbstractFSWAL > {code:java} > public static final String WAL_BATCH_SIZE = "hbase.wal.batch.size"; > public static final long DEFAULT_WAL_BATCH_SIZE = 64L * 1024; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28749) Remove the duplicate configurations named hbase.wal.batch.size
Sun Xin created HBASE-28749: --- Summary: Remove the duplicate configurations named hbase.wal.batch.size Key: HBASE-28749 URL: https://issues.apache.org/jira/browse/HBASE-28749 Project: HBase Issue Type: Improvement Components: wal Affects Versions: 3.0.0-beta-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-beta-2 The following code appears in two places: AsyncFSWAL and AbstractFSWAL {code:java} public static final String WAL_BATCH_SIZE = "hbase.wal.batch.size"; public static final long DEFAULT_WAL_BATCH_SIZE = 64L * 1024; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28330) TestUnknownServers.testListUnknownServers is flaky in branch-2
[ https://issues.apache.org/jira/browse/HBASE-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-28330. - Fix Version/s: 2.6.0 2.5.8 Resolution: Fixed Pushed to branch-2, branch-2.5, branch-2.6. Thanks for the review [~zhangduo] > TestUnknownServers.testListUnknownServers is flaky in branch-2 > -- > > Key: HBASE-28330 > URL: https://issues.apache.org/jira/browse/HBASE-28330 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 2.5.7 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 2.6.0, 2.5.8 > > > {code:java} > [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.913 > s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestUnknownServers > [ERROR] > org.apache.hadoop.hbase.master.TestUnknownServers.testListUnknownServers > Time elapsed: 0.204 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> {code} > The value of TestUnknownServers.SLAVES is different between > [branch-2|https://github.com/apache/hbase/blob/68bc533f7116cedc681704b82319e5793b827621/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestUnknownServers.java#L44] > and > [master|https://github.com/apache/hbase/blob/b87b05c847f00c292664d894c21f83c73d48460d/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestUnknownServers.java#L43]. > It is 1 in master but 2 in branch-2. > The RegionServer marked UNKNOWN_SERVER is the one that *holds regions* but is > not tracked by the ServerManager. > Please see HMaster.getUnknownServers > {code:java} > private List getUnknownServers() { > if (serverManager != null) { > final Set serverNames = > getAssignmentManager().getRegionStates().getRegionStates() > .stream().map(RegionState::getServerName).collect(Collectors.toSet()); > final List unknownServerNames = serverNames.stream() > .filter(sn -> sn != null && > serverManager.isServerUnknown(sn)).collect(Collectors.toList()); > return unknownServerNames; > } > return null; > } {code} > In UT TestUnknownServers.testListUnknownServers, we start a HBase cluster > with 2 RegionServer, if all region are assigned to ONE server, then only that > server is called UNKNOWN_SERVER, the UT will fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-28330) TestUnknownServers.testListUnknownServers is flaky in branch-2
[ https://issues.apache.org/jira/browse/HBASE-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-28330: Description: {code:java} [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.913 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestUnknownServers [ERROR] org.apache.hadoop.hbase.master.TestUnknownServers.testListUnknownServers Time elapsed: 0.204 s <<< FAILURE! java.lang.AssertionError: expected:<1> but was:<2> {code} The value of TestUnknownServers.SLAVES is different between [branch-2|https://github.com/apache/hbase/blob/68bc533f7116cedc681704b82319e5793b827621/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestUnknownServers.java#L44] and [master|https://github.com/apache/hbase/blob/b87b05c847f00c292664d894c21f83c73d48460d/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestUnknownServers.java#L43]. It is 1 in master but 2 in branch-2. The RegionServer marked UNKNOWN_SERVER is the one that *holds regions* but is not tracked by the ServerManager. Please see HMaster.getUnknownServers {code:java} private List getUnknownServers() { if (serverManager != null) { final Set serverNames = getAssignmentManager().getRegionStates().getRegionStates() .stream().map(RegionState::getServerName).collect(Collectors.toSet()); final List unknownServerNames = serverNames.stream() .filter(sn -> sn != null && serverManager.isServerUnknown(sn)).collect(Collectors.toList()); return unknownServerNames; } return null; } {code} In UT TestUnknownServers.testListUnknownServers, we start a HBase cluster with 2 RegionServer, if all region are assigned to ONE server, then only that server is called UNKNOWN_SERVER, the UT will fail. was: {code:java} [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.913 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestUnknownServers [ERROR] org.apache.hadoop.hbase.master.TestUnknownServers.testListUnknownServers Time elapsed: 0.204 s <<< FAILURE! java.lang.AssertionError: expected:<1> but was:<2> {code} The value of TestUnknownServers.SLAVES is different between branch-2 and master. It is 1 in master but 2 in branch-2. The RegionServer marked UNKNOWN_SERVER is the one that *holds regions* but is not tracked by the ServerManager. Please see HMaster.getUnknownServers {code:java} private List getUnknownServers() { if (serverManager != null) { final Set serverNames = getAssignmentManager().getRegionStates().getRegionStates() .stream().map(RegionState::getServerName).collect(Collectors.toSet()); final List unknownServerNames = serverNames.stream() .filter(sn -> sn != null && serverManager.isServerUnknown(sn)).collect(Collectors.toList()); return unknownServerNames; } return null; } {code} In UT TestUnknownServers.testListUnknownServers, we start a HBase cluster with 2 RegionServer, if all region are assigned to ONE server, then only that server is called UNKNOWN_SERVER, the UT will fail. > TestUnknownServers.testListUnknownServers is flaky in branch-2 > -- > > Key: HBASE-28330 > URL: https://issues.apache.org/jira/browse/HBASE-28330 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 2.5.7 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > {code:java} > [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.913 > s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestUnknownServers > [ERROR] > org.apache.hadoop.hbase.master.TestUnknownServers.testListUnknownServers > Time elapsed: 0.204 s <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<2> {code} > The value of TestUnknownServers.SLAVES is different between > [branch-2|https://github.com/apache/hbase/blob/68bc533f7116cedc681704b82319e5793b827621/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestUnknownServers.java#L44] > and > [master|https://github.com/apache/hbase/blob/b87b05c847f00c292664d894c21f83c73d48460d/hbase-server/src/test/java/org/apache/hadoop/hbase/master/TestUnknownServers.java#L43]. > It is 1 in master but 2 in branch-2. > The RegionServer marked UNKNOWN_SERVER is the one that *holds regions* but is > not tracked by the ServerManager. > Please see HMaster.getUnknownServers > {code:java} > private List getUnknownServers() { > if (serverManager != null) { > final Set serverNames = > getAssignmentManager().getRegionStates().getRegionStates() > .stream().map(RegionState::getServerName).collect(Collectors.toSet()); > final List unknownServerNames = serverNames.stream() > .filter(sn -> sn != null && > serverManager.isServerUnknown(sn)).collect(Collectors.toList()); > return unknownServerNames; > } > return null; > }
[jira] [Created] (HBASE-28330) TestUnknownServers.testListUnknownServers is flaky in branch-2
Sun Xin created HBASE-28330: --- Summary: TestUnknownServers.testListUnknownServers is flaky in branch-2 Key: HBASE-28330 URL: https://issues.apache.org/jira/browse/HBASE-28330 Project: HBase Issue Type: Bug Components: test Affects Versions: 2.5.7 Reporter: Sun Xin Assignee: Sun Xin {code:java} [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.913 s <<< FAILURE! - in org.apache.hadoop.hbase.master.TestUnknownServers [ERROR] org.apache.hadoop.hbase.master.TestUnknownServers.testListUnknownServers Time elapsed: 0.204 s <<< FAILURE! java.lang.AssertionError: expected:<1> but was:<2> {code} The value of TestUnknownServers.SLAVES is different between branch-2 and master. It is 1 in master but 2 in branch-2. The RegionServer marked UNKNOWN_SERVER is the one that *holds regions* but is not tracked by the ServerManager. Please see HMaster.getUnknownServers {code:java} private List getUnknownServers() { if (serverManager != null) { final Set serverNames = getAssignmentManager().getRegionStates().getRegionStates() .stream().map(RegionState::getServerName).collect(Collectors.toSet()); final List unknownServerNames = serverNames.stream() .filter(sn -> sn != null && serverManager.isServerUnknown(sn)).collect(Collectors.toList()); return unknownServerNames; } return null; } {code} In UT TestUnknownServers.testListUnknownServers, we start a HBase cluster with 2 RegionServer, if all region are assigned to ONE server, then only that server is called UNKNOWN_SERVER, the UT will fail. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28324) TestRegionNormalizerWorkQueue#testTake is flaky
[ https://issues.apache.org/jira/browse/HBASE-28324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-28324. - Fix Version/s: 2.6.0 2.4.18 2.5.8 3.0.0-beta-2 Resolution: Fixed Pushed to all active branches. Thanks for the review [~zhangduo] > TestRegionNormalizerWorkQueue#testTake is flaky > --- > > Key: HBASE-28324 > URL: https://issues.apache.org/jira/browse/HBASE-28324 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 3.0.0-beta-1, 2.5.7 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28324) TestRegionNormalizerWorkQueue#testTake is flaky
Sun Xin created HBASE-28324: --- Summary: TestRegionNormalizerWorkQueue#testTake is flaky Key: HBASE-28324 URL: https://issues.apache.org/jira/browse/HBASE-28324 Project: HBase Issue Type: Bug Components: test Affects Versions: 2.5.7, 3.0.0-beta-1 Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27469) IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when dropping a table
[ https://issues.apache.org/jira/browse/HBASE-27469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-27469. - Fix Version/s: 2.5.2 2.4.16 (was: 2.6.0) Resolution: Fixed > IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when > dropping a table > > > Key: HBASE-27469 > URL: https://issues.apache.org/jira/browse/HBASE-27469 > Project: HBase > Issue Type: Bug > Components: snapshots >Affects Versions: 3.0.0-alpha-3, 2.5.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-4, 2.5.2, 2.4.16 > > > If enabled the feature about scan snapshot and grant the permissions of a > table and a namespace to the same user, an IllegalArgumentException will be > thrown when droping tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27469) IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when dropping a table
[ https://issues.apache.org/jira/browse/HBASE-27469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634166#comment-17634166 ] Sun Xin commented on HBASE-27469: - Thanks [~zhangduo] for reviewing. Pushed to branch-2, branch-2.4, branch-2.5. > IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when > dropping a table > > > Key: HBASE-27469 > URL: https://issues.apache.org/jira/browse/HBASE-27469 > Project: HBase > Issue Type: Bug > Components: snapshots >Affects Versions: 3.0.0-alpha-3, 2.5.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > > If enabled the feature about scan snapshot and grant the permissions of a > table and a namespace to the same user, an IllegalArgumentException will be > thrown when droping tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27476) Recovered replication may be blocked if enabled hbase.separate.oldlogdir.by.regionserver
Sun Xin created HBASE-27476: --- Summary: Recovered replication may be blocked if enabled hbase.separate.oldlogdir.by.regionserver Key: HBASE-27476 URL: https://issues.apache.org/jira/browse/HBASE-27476 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.4.15, 3.0.0-alpha-3 Reporter: Sun Xin Assignee: Sun Xin In other PR, I got a failed UT {code:java} [ERROR] Failures: [ERROR] org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSWithSeparateOldWALs.killOneMasterRS [ERROR] Run 1: TestReplicationKillMasterRSWithSeparateOldWALs>TestReplicationKillMasterRS.killOneMasterRS:47->TestReplicationKillRS.loadTableAndKillRS:84 Waited too much time for queueFailover replication. Waited 61065ms. [ERROR] Run 2: TestReplicationKillMasterRSWithSeparateOldWALs>TestReplicationKillMasterRS.killOneMasterRS:47->TestReplicationKillRS.loadTableAndKillRS:84 Waited too much time for queueFailover replication. Waited 58864ms. [ERROR] Run 3: TestReplicationKillMasterRSWithSeparateOldWALs>TestReplicationKillMasterRS.killOneMasterRS:47->TestReplicationKillRS.loadTableAndKillRS:84 Waited too much time for queueFailover replication. Waited 57103ms. {code} This should be caused by a bug. If enabled {_}hbase.separate.oldlogdir.by.regionserver{_}, old wals will be moved into different dir by regionserver name like root/oldWALs/server1/wal1 . For recovered replication, can't convert wal path(like root/oldWALs/wal1) into such paths, and throws FileNotFoundException. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Work started] (HBASE-27469) IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when dropping a table
[ https://issues.apache.org/jira/browse/HBASE-27469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-27469 started by Sun Xin. --- > IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when > dropping a table > > > Key: HBASE-27469 > URL: https://issues.apache.org/jira/browse/HBASE-27469 > Project: HBase > Issue Type: Bug > Components: snapshots >Affects Versions: 3.0.0-alpha-3, 2.5.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4 > > > If enabled the feature about scan snapshot and grant the permissions of a > table and a namespace to the same user, an IllegalArgumentException will be > thrown when droping tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27469) IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when dropping a table
Sun Xin created HBASE-27469: --- Summary: IllegalArgumentException is thrown by SnapshotScannerHDFSAclController when dropping a table Key: HBASE-27469 URL: https://issues.apache.org/jira/browse/HBASE-27469 Project: HBase Issue Type: Bug Components: snapshots Affects Versions: 2.5.1, 3.0.0-alpha-3 Reporter: Sun Xin Assignee: Sun Xin Fix For: 2.6.0, 3.0.0-alpha-4 If enabled the feature about scan snapshot and grant the permissions of a table and a namespace to the same user, an IllegalArgumentException will be thrown when droping tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HBASE-27354) EOF thrown by WALEntryStream causes replication blocking
[ https://issues.apache.org/jira/browse/HBASE-27354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-27354: Description: In [WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257], it is possible that we read uncommitted data. If we read beyond the committed file length, then reopen inputStream and seek back. In our use, we found that the position where seek back may be exactly the length of the file being written, which may cause EOF. The thrown EOF is finally caught [ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158], but [totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78] is not cleanup up. After a long run, all peers will go slow and eventually block completely. was: In [WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257], it is possible that we read uncommitted data. If we read beyond the committed file length, then reopen the inputStream and seek back. In our use, we found that the position where seek back may be exactly the length of the file being written, which may cause EOF. The thrown EOF is finally caught [ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158], but [totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78] is not cleanup up. After a long run, all peers will go slow and eventually block completely. > EOF thrown by WALEntryStream causes replication blocking > > > Key: HBASE-27354 > URL: https://issues.apache.org/jira/browse/HBASE-27354 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.5.0, 2.6.0, 3.0.0-alpha-3, 2.4.14 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > In > [WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257], > it is possible that we read uncommitted data. If we read beyond the > committed file length, then reopen inputStream and seek back. > In our use, we found that the position where seek back may be exactly the > length of the file being written, which may cause EOF. > The thrown EOF is finally caught > [ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158], > but > [totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78] > is not cleanup up. > After a long run, all peers will go slow and eventually block completely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27354) EOF thrown by WALEntryStream causes replication blocking
Sun Xin created HBASE-27354: --- Summary: EOF thrown by WALEntryStream causes replication blocking Key: HBASE-27354 URL: https://issues.apache.org/jira/browse/HBASE-27354 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.4.14, 3.0.0-alpha-3, 2.5.0, 2.6.0 Reporter: Sun Xin Assignee: Sun Xin In [WALEntryStream#readNextEntryAndRecordReaderPosition|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L257], it is possible that we read uncommitted data. If we read beyond the committed file length, then reopen the inputStream and seek back. In our use, we found that the position where seek back may be exactly the length of the file being written, which may cause EOF. The thrown EOF is finally caught [ReplicationSourceWALReader.run|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L158], but [totalBufferUsed|https://github.com/apache/hbase/blob/308cd729d23329e6d8d4b9c17a645180374b5962/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L78] is not cleanup up. After a long run, all peers will go slow and eventually block completely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-26956) ExportSnapshot tool supports removing TTL
[ https://issues.apache.org/jira/browse/HBASE-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-26956. - Fix Version/s: 2.5.0 2.6.0 Resolution: Done Pushed to branch-2 and branch-2.5 > ExportSnapshot tool supports removing TTL > - > > Key: HBASE-26956 > URL: https://issues.apache.org/jira/browse/HBASE-26956 > Project: HBase > Issue Type: New Feature > Components: snapshots >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 2.5.0, 2.6.0, 3.0.0-alpha-3 > > > In our scenario, we use ExportSnapshot to copy snapshots to cold storage like > S3. But when we restored back to HBase cluster, it will be deleted directly > because TTL is set. > So we need ExportSnapshot tool support removing TTL. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Reopened] (HBASE-26956) ExportSnapshot tool supports removing TTL
[ https://issues.apache.org/jira/browse/HBASE-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin reopened HBASE-26956: - Will close this issue after porting to branch-2.x. > ExportSnapshot tool supports removing TTL > - > > Key: HBASE-26956 > URL: https://issues.apache.org/jira/browse/HBASE-26956 > Project: HBase > Issue Type: New Feature >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-3 > > > In our scenario, we use ExportSnapshot to copy snapshots to cold storage like > S3. But when we restored back to HBase cluster, it will be deleted directly > because TTL is set. > So we need ExportSnapshot tool support removing TTL. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-26956) ExportSnapshot tool supports removing TTL
[ https://issues.apache.org/jira/browse/HBASE-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554518#comment-17554518 ] Sun Xin commented on HBASE-26956: - Using this issue is OK, I submitted PR to port to branch-2. > ExportSnapshot tool supports removing TTL > - > > Key: HBASE-26956 > URL: https://issues.apache.org/jira/browse/HBASE-26956 > Project: HBase > Issue Type: New Feature >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-3 > > > In our scenario, we use ExportSnapshot to copy snapshots to cold storage like > S3. But when we restored back to HBase cluster, it will be deleted directly > because TTL is set. > So we need ExportSnapshot tool support removing TTL. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HBASE-26956) ExportSnapshot tool supports removing TTL
[ https://issues.apache.org/jira/browse/HBASE-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554439#comment-17554439 ] Sun Xin commented on HBASE-26956: - [~zhangduo] We need, I'll port to branch-2.x later > ExportSnapshot tool supports removing TTL > - > > Key: HBASE-26956 > URL: https://issues.apache.org/jira/browse/HBASE-26956 > Project: HBase > Issue Type: New Feature >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-3 > > > In our scenario, we use ExportSnapshot to copy snapshots to cold storage like > S3. But when we restored back to HBase cluster, it will be deleted directly > because TTL is set. > So we need ExportSnapshot tool support removing TTL. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HBASE-26956) ExportSnapshot tool supports removing TTL
[ https://issues.apache.org/jira/browse/HBASE-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-26956. - Fix Version/s: 3.0.0-alpha-4 Release Note: ExportSnapshot tool support removing TTL of snapshot. If we use the ExportSnapshot tool to recover snapshot with TTL from cold storage to hbase cluster, we can set `-reset-ttl` to prevent snapshot from being deleted immediately. Resolution: Done Thanks for the review.[~zhangduo] > ExportSnapshot tool supports removing TTL > - > > Key: HBASE-26956 > URL: https://issues.apache.org/jira/browse/HBASE-26956 > Project: HBase > Issue Type: New Feature >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-4 > > > In our scenario, we use ExportSnapshot to copy snapshots to cold storage like > S3. But when we restored back to HBase cluster, it will be deleted directly > because TTL is set. > So we need ExportSnapshot tool support removing TTL. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HBASE-26956) ExportSnapshot tool supports removing TTL
Sun Xin created HBASE-26956: --- Summary: ExportSnapshot tool supports removing TTL Key: HBASE-26956 URL: https://issues.apache.org/jira/browse/HBASE-26956 Project: HBase Issue Type: New Feature Reporter: Sun Xin Assignee: Sun Xin In our scenario, we use ExportSnapshot to copy snapshots to cold storage like S3. But when we restored back to HBase cluster, it will be deleted directly because TTL is set. So we need ExportSnapshot tool support removing TTL. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HBASE-26406) Can not add peer replicating to non-HBase
[ https://issues.apache.org/jira/browse/HBASE-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-26406. - Fix Version/s: 2.4.9 3.0.0-alpha-2 Resolution: Fixed Pushed to master and 2.x branchs. Thank all for reviewing. > Can not add peer replicating to non-HBase > - > > Key: HBASE-26406 > URL: https://issues.apache.org/jira/browse/HBASE-26406 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.4.0 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-2, 2.4.9 > > > Failed to add a peer replicating to non-HBase(like MQ) by implementing custom > ReplicationEndpoint, got exception like this in my UT: > {code:java} > 2021-10-29T15:14:47,632 INFO [RPCClient-NioEventLoopGroup-5-3] > client.RawAsyncHBaseAdmin$ReplicationProcedureBiConsumer(2761): Operation: > ADD_REPLICATION_PEER, peerId: 1 failed with Invalid cluster key: , should not > replicate to itself for > HBaseInterClusterReplicationEndpoint2021-10-29T15:14:47,632 INFO > [RPCClient-NioEventLoopGroup-5-3] > client.RawAsyncHBaseAdmin$ReplicationProcedureBiConsumer(2761): Operation: > ADD_REPLICATION_PEER, peerId: 1 failed with Invalid cluster key: , should not > replicate to itself for HBaseInterClusterReplicationEndpoint > org.apache.hadoop.hbase.DoNotRetryIOException: Invalid cluster key: , should > not replicate to itself for HBaseInterClusterReplicationEndpoint > at java.lang.Thread.getStackTrace(Thread.java:1559) at > org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:130) > at org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:149) at > org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186) at > org.apache.hadoop.hbase.client.Admin.addReplicationPeer(Admin.java:1948) at > org.apache.hadoop.hbase.client.Admin.addReplicationPeer(Admin.java:1936) at > org.apache.hadoop.hbase.replication.TestNonHBaseReplicationEndpoint.test(TestNonHBaseReplicationEndpoint.java:97) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at > org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) at Future.get(Unknown > Source) at > org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.checkClusterId(ReplicationPeerManager.java:527) > at > org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.checkPeerConfig(ReplicationPeerManager.java:367) > at > org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.preAddPeer(ReplicationPeerManager.java:123) > at > org.apache.hadoop.hbase.master.replication.AddPeerProcedure.prePeerModification(AddPeerProcedure.java:101) > at > org.apache.hadoop.hbase.master.replication.ModifyPeerProcedure.executeFromState(ModifyPeerProcedure.java:162) > at >
[jira] [Updated] (HBASE-26406) Can not add peer replicating to non-HBase
[ https://issues.apache.org/jira/browse/HBASE-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-26406: Description: Failed to add a peer replicating to non-HBase(like MQ) by implementing custom ReplicationEndpoint, got exception like this in my UT: {code:java} 2021-10-29T15:14:47,632 INFO [RPCClient-NioEventLoopGroup-5-3] client.RawAsyncHBaseAdmin$ReplicationProcedureBiConsumer(2761): Operation: ADD_REPLICATION_PEER, peerId: 1 failed with Invalid cluster key: , should not replicate to itself for HBaseInterClusterReplicationEndpoint2021-10-29T15:14:47,632 INFO [RPCClient-NioEventLoopGroup-5-3] client.RawAsyncHBaseAdmin$ReplicationProcedureBiConsumer(2761): Operation: ADD_REPLICATION_PEER, peerId: 1 failed with Invalid cluster key: , should not replicate to itself for HBaseInterClusterReplicationEndpoint org.apache.hadoop.hbase.DoNotRetryIOException: Invalid cluster key: , should not replicate to itself for HBaseInterClusterReplicationEndpoint at java.lang.Thread.getStackTrace(Thread.java:1559) at org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:130) at org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:149) at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186) at org.apache.hadoop.hbase.client.Admin.addReplicationPeer(Admin.java:1948) at org.apache.hadoop.hbase.client.Admin.addReplicationPeer(Admin.java:1936) at org.apache.hadoop.hbase.replication.TestNonHBaseReplicationEndpoint.test(TestNonHBaseReplicationEndpoint.java:97) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) at Future.get(Unknown Source) at org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.checkClusterId(ReplicationPeerManager.java:527) at org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.checkPeerConfig(ReplicationPeerManager.java:367) at org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.preAddPeer(ReplicationPeerManager.java:123) at org.apache.hadoop.hbase.master.replication.AddPeerProcedure.prePeerModification(AddPeerProcedure.java:101) at org.apache.hadoop.hbase.master.replication.ModifyPeerProcedure.executeFromState(ModifyPeerProcedure.java:162) at org.apache.hadoop.hbase.master.replication.ModifyPeerProcedure.executeFromState(ModifyPeerProcedure.java:43) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:190) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) {code} HBASE-24743 ignored this situation and
[jira] [Created] (HBASE-26406) Can not add peer replicating to non-HBase
Sun Xin created HBASE-26406: --- Summary: Can not add peer replicating to non-HBase Key: HBASE-26406 URL: https://issues.apache.org/jira/browse/HBASE-26406 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.4.0, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Failed to add a peer replicating to non-HBase(like MQ) by implementing custom ReplicationEndpoint, got exception like this in my UT: {code:java} 2021-10-29T15:14:47,632 INFO [RPCClient-NioEventLoopGroup-5-3] client.RawAsyncHBaseAdmin$ReplicationProcedureBiConsumer(2761): Operation: ADD_REPLICATION_PEER, peerId: 1 failed with Invalid cluster key: , should not replicate to itself for HBaseInterClusterReplicationEndpoint2021-10-29T15:14:47,632 INFO [RPCClient-NioEventLoopGroup-5-3] client.RawAsyncHBaseAdmin$ReplicationProcedureBiConsumer(2761): Operation: ADD_REPLICATION_PEER, peerId: 1 failed with Invalid cluster key: , should not replicate to itself for HBaseInterClusterReplicationEndpoint org.apache.hadoop.hbase.DoNotRetryIOException: Invalid cluster key: , should not replicate to itself for HBaseInterClusterReplicationEndpoint at java.lang.Thread.getStackTrace(Thread.java:1559) at org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:130) at org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:149) at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186) at org.apache.hadoop.hbase.client.Admin.addReplicationPeer(Admin.java:1948) at org.apache.hadoop.hbase.client.Admin.addReplicationPeer(Admin.java:1936) at org.apache.hadoop.hbase.replication.TestNonHBaseReplicationEndpoint.test(TestNonHBaseReplicationEndpoint.java:97) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.apache.hadoop.hbase.SystemExitRule$1.evaluate(SystemExitRule.java:38) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) at Future.get(Unknown Source) at org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.checkClusterId(ReplicationPeerManager.java:527) at org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.checkPeerConfig(ReplicationPeerManager.java:367) at org.apache.hadoop.hbase.master.replication.ReplicationPeerManager.preAddPeer(ReplicationPeerManager.java:123) at org.apache.hadoop.hbase.master.replication.AddPeerProcedure.prePeerModification(AddPeerProcedure.java:101) at org.apache.hadoop.hbase.master.replication.ModifyPeerProcedure.executeFromState(ModifyPeerProcedure.java:162) at org.apache.hadoop.hbase.master.replication.ModifyPeerProcedure.executeFromState(ModifyPeerProcedure.java:43) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:190) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1667) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414) at
[jira] [Resolved] (HBASE-25773) TestSnapshotScannerHDFSAclController.setupBeforeClass is flaky
[ https://issues.apache.org/jira/browse/HBASE-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25773. - Resolution: Fixed Pushed to branch-2 and master, thanks [~zhangduo] for reviewing. > TestSnapshotScannerHDFSAclController.setupBeforeClass is flaky > -- > > Key: HBASE-25773 > URL: https://issues.apache.org/jira/browse/HBASE-25773 > Project: HBase > Issue Type: Improvement >Reporter: Xiaolin Ha >Assignee: Sun Xin >Priority: Major > Fix For: 2.5.0, 3.0.0-alpha-2 > > > [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3140/2/testReport/org.apache.hadoop.hbase.security.access/TestSnapshotScannerHDFSAclController/precommit_checks___yetus_jdk8_Hadoop3_checks__/] > SnapshotScannerHDFSAclController.postStartMaster alters hbase:acl to add a > new cf "m", but > `TestSnapshotScannerHDFSAclController.setupBeforeClass(TestSnapshotScannerHDFSAclController.java:101)` > fails before the disable and enable hbase:acl complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25773) TestSnapshotScannerHDFSAclController.setupBeforeClass is flaky
[ https://issues.apache.org/jira/browse/HBASE-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17407131#comment-17407131 ] Sun Xin commented on HBASE-25773: - [~Xiaolin Ha] Is this issue still be working on? Can assign to me and let me try? > TestSnapshotScannerHDFSAclController.setupBeforeClass is flaky > -- > > Key: HBASE-25773 > URL: https://issues.apache.org/jira/browse/HBASE-25773 > Project: HBase > Issue Type: Improvement >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Major > Fix For: 2.5.0, 3.0.0-alpha-2 > > > [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3140/2/testReport/org.apache.hadoop.hbase.security.access/TestSnapshotScannerHDFSAclController/precommit_checks___yetus_jdk8_Hadoop3_checks__/] > SnapshotScannerHDFSAclController.postStartMaster alters hbase:acl to add a > new cf "m", but > `TestSnapshotScannerHDFSAclController.setupBeforeClass(TestSnapshotScannerHDFSAclController.java:101)` > fails before the disable and enable hbase:acl complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-26194) Introduce a ReplicationServerSourceManager to simplify HReplicationServer
[ https://issues.apache.org/jira/browse/HBASE-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-26194. - Resolution: Done Merged. Thank [~stack] for reviewing. > Introduce a ReplicationServerSourceManager to simplify HReplicationServer > - > > Key: HBASE-26194 > URL: https://issues.apache.org/jira/browse/HBASE-26194 > Project: HBase > Issue Type: Sub-task > Components: Replication >Affects Versions: 3.0.0-alpha-2 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26194) Introduce a ReplicationServerSourceManager to simplify HReplicationServer
Sun Xin created HBASE-26194: --- Summary: Introduce a ReplicationServerSourceManager to simplify HReplicationServer Key: HBASE-26194 URL: https://issues.apache.org/jira/browse/HBASE-26194 Project: HBase Issue Type: Sub-task Components: Replication Affects Versions: 3.0.0-alpha-2 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-2 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-26084) Add owner of replication queue for ReplicationQueueInfo
[ https://issues.apache.org/jira/browse/HBASE-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-26084. - Fix Version/s: 3.0.0-alpha-2 Resolution: Done Merged. Thank [~stack] [~zhangduo] for reviewing. > Add owner of replication queue for ReplicationQueueInfo > --- > > Key: HBASE-26084 > URL: https://issues.apache.org/jira/browse/HBASE-26084 > Project: HBase > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-2 > > > The current ReplicationQueueInfo only has queueId, which is not enough to > distinguish queues in ReplicationServer, so we need to add the RS holding > the queue for ReplicationQueueInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26084) Add owner of replication queue for ReplicationQueueInfo
Sun Xin created HBASE-26084: --- Summary: Add owner of replication queue for ReplicationQueueInfo Key: HBASE-26084 URL: https://issues.apache.org/jira/browse/HBASE-26084 Project: HBase Issue Type: Sub-task Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 The current ReplicationQueueInfo only has queueId, which is not enough to distinguish queues in ReplicationServer, so we need to add the RS holding the queue for ReplicationQueueInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25110) Add heartbeat for ReplicationServer and dispatch replication sources to ReplicationServer
[ https://issues.apache.org/jira/browse/HBASE-25110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377923#comment-17377923 ] Sun Xin commented on HBASE-25110: - Divide this issue into two to achieve, HBASE-26077 and HBASE-26078 > Add heartbeat for ReplicationServer and dispatch replication sources to > ReplicationServer > - > > Key: HBASE-25110 > URL: https://issues.apache.org/jira/browse/HBASE-25110 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25110) Add heartbeat for ReplicationServer and dispatch replication sources to ReplicationServer
[ https://issues.apache.org/jira/browse/HBASE-25110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25110. - Release Note: Divide this issue into two to achieve, HBASE-26077 and HBASE-26078 Resolution: Incomplete > Add heartbeat for ReplicationServer and dispatch replication sources to > ReplicationServer > - > > Key: HBASE-25110 > URL: https://issues.apache.org/jira/browse/HBASE-25110 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26078) Dispatch replication sources to ReplicationServer
Sun Xin created HBASE-26078: --- Summary: Dispatch replication sources to ReplicationServer Key: HBASE-26078 URL: https://issues.apache.org/jira/browse/HBASE-26078 Project: HBase Issue Type: Sub-task Components: Replication Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-26077) Add heartbeat for ReplicationServer
Sun Xin created HBASE-26077: --- Summary: Add heartbeat for ReplicationServer Key: HBASE-26077 URL: https://issues.apache.org/jira/browse/HBASE-26077 Project: HBase Issue Type: Sub-task Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25807) Move method reportProcedureDone from RegionServerStatus.proto to Master.proto
[ https://issues.apache.org/jira/browse/HBASE-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25807. - Fix Version/s: 3.0.0-alpha-1 Resolution: Done Merged. Thank [~zhangduo] for reviewing. > Move method reportProcedureDone from RegionServerStatus.proto to Master.proto > - > > Key: HBASE-25807 > URL: https://issues.apache.org/jira/browse/HBASE-25807 > Project: HBase > Issue Type: Sub-task >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > We next need use the procedure mechanism to implement enable/disable/refresh > peer, and ReplicationServer also needs reportProcedureDone to master, so I > hope to move method reportProcedureDone to Master.proto from > RegionServerStatus.proto. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25820) Find a way to know whether logQueue goes empty when ReplicationSource is running on ReplicationServer
Sun Xin created HBASE-25820: --- Summary: Find a way to know whether logQueue goes empty when ReplicationSource is running on ReplicationServer Key: HBASE-25820 URL: https://issues.apache.org/jira/browse/HBASE-25820 Project: HBase Issue Type: Sub-task Reporter: Sun Xin HBASE-25110 we choose to use ZK to notify ReplicationServer that a new wal was generated, this is asynchronous. And then we got a problem, the shipper thread and the wal reader thread may go terminated as logQueue goes empty before receiving the notification of new wal. So we now need find a way to know whether logQueue is really empty after the last wal in logQueue is consumed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-24737) Find a way to resolve WALFileLengthProvider#getLogFileSizeIfBeingWritten problem
[ https://issues.apache.org/jira/browse/HBASE-24737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-24737. - Resolution: Done > Find a way to resolve WALFileLengthProvider#getLogFileSizeIfBeingWritten > problem > > > Key: HBASE-24737 > URL: https://issues.apache.org/jira/browse/HBASE-24737 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > > Now we use WALFileLengthProvider#getLogFileSizeIfBeingWritten to get the > synced wal length and prevent replicating unacked log entries. But after > offload ReplicationSource to new ReplicationServer, we need a new way to > resolve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24737) Find a way to resolve WALFileLengthProvider#getLogFileSizeIfBeingWritten problem
[ https://issues.apache.org/jira/browse/HBASE-24737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332899#comment-17332899 ] Sun Xin commented on HBASE-24737: - Thank [~zhangduo] for reviewing. Failed UTs are not related, I will fix these UTs in HBASE-25110. > Find a way to resolve WALFileLengthProvider#getLogFileSizeIfBeingWritten > problem > > > Key: HBASE-24737 > URL: https://issues.apache.org/jira/browse/HBASE-24737 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > > Now we use WALFileLengthProvider#getLogFileSizeIfBeingWritten to get the > synced wal length and prevent replicating unacked log entries. But after > offload ReplicationSource to new ReplicationServer, we need a new way to > resolve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HBASE-25807) Move method reportProcedureDone from RegionServerStatus.proto to Master.proto
[ https://issues.apache.org/jira/browse/HBASE-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin reassigned HBASE-25807: --- Assignee: Sun Xin > Move method reportProcedureDone from RegionServerStatus.proto to Master.proto > - > > Key: HBASE-25807 > URL: https://issues.apache.org/jira/browse/HBASE-25807 > Project: HBase > Issue Type: Sub-task >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > We next need use the procedure mechanism to implement enable/disable/refresh > peer, and ReplicationServer also needs reportProcedureDone to master, so I > hope to move method reportProcedureDone to Master.proto from > RegionServerStatus.proto. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25807) Move method reportProcedureDone from RegionServerStatus.proto to Master.proto
Sun Xin created HBASE-25807: --- Summary: Move method reportProcedureDone from RegionServerStatus.proto to Master.proto Key: HBASE-25807 URL: https://issues.apache.org/jira/browse/HBASE-25807 Project: HBase Issue Type: Sub-task Reporter: Sun Xin We next need use the procedure mechanism to implement enable/disable/refresh peer, and ReplicationServer also needs reportProcedureDone to master, so I hope to move method reportProcedureDone to Master.proto from RegionServerStatus.proto. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25562) ReplicationSourceWALReader log and handle exception immediately without retrying
[ https://issues.apache.org/jira/browse/HBASE-25562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25562. - Fix Version/s: 2.4.3 2.3.5 3.0.0-alpha-1 Resolution: Fixed > ReplicationSourceWALReader log and handle exception immediately without > retrying > > > Key: HBASE-25562 > URL: https://issues.apache.org/jira/browse/HBASE-25562 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.5, 2.4.3 > > > In [this piece of code about retrying in > ReplicationSourceWALReader#run|https://github.com/apache/hbase/blob/0353909bc268e3ff3def098963d021e973f1f153/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L151], > sleep time increases with the number of retries, if an exception happens > that cannot be recovered by itself, error logs will appear after 12 hours > (300 retries by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25590) Bulkload replication HFileRefs cannot be cleared in some cases where set exclude-namespace/exclude-table-cfs
[ https://issues.apache.org/jira/browse/HBASE-25590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308406#comment-17308406 ] Sun Xin commented on HBASE-25590: - {quote}is there anything needs to be done for this jira? Can it be resolved? {quote} Thanks [~huaxiangsun] for noticing this, I haven't close this jira yet as the PR backporting to branch-2.2 still need review. > Bulkload replication HFileRefs cannot be cleared in some cases where set > exclude-namespace/exclude-table-cfs > > > Key: HBASE-25590 > URL: https://issues.apache.org/jira/browse/HBASE-25590 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.5 > > > In > [ReplicationSource#addHFileRefs|https://github.com/apache/hbase/blob/ed90a14995acd87111d2b9849f07d84418ca43d4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L264], > we may add unwanted hfiles to the _HFileRefs_ if a peer is set > _replicate_all_ true and set _exclude-namespace/exclude-table-cfs_. > These unwanted _HFileRefs_ will not be replicated to remote cluster and not > be cleared. > Two problems are caused by this bug: > # The metric sizeOfHFileRefsQueue cannot be zeroed. > # Referenced HFiles cannot be deleted by _ReplicationHFileCleaner._ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25683) Simplify UTs using DummyServer
Sun Xin created HBASE-25683: --- Summary: Simplify UTs using DummyServer Key: HBASE-25683 URL: https://issues.apache.org/jira/browse/HBASE-25683 Project: HBase Issue Type: Test Components: test Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25638) The master local region is constantly major compact
[ https://issues.apache.org/jira/browse/HBASE-25638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25638. - Resolution: Not A Problem > The master local region is constantly major compact > --- > > Key: HBASE-25638 > URL: https://issues.apache.org/jira/browse/HBASE-25638 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0-alpha-1, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > In > [MasterRegionFlusherAndCompactor.compact|https://github.com/apache/hbase/blob/830d2895b27fa0cf39a28d3af9673a4126ea8258/hbase-server/src/main/java/org/apache/hadoop/hbase/master/region/MasterRegionFlusherAndCompactor.java#L164], > we call region.compact(true) constantly like recursion. This caused a lot of > logs to be flushed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25638) The master local region is constantly major compact
Sun Xin created HBASE-25638: --- Summary: The master local region is constantly major compact Key: HBASE-25638 URL: https://issues.apache.org/jira/browse/HBASE-25638 Project: HBase Issue Type: Bug Affects Versions: 2.4.1, 2.3.4, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin In [MasterRegionFlusherAndCompactor.compact|https://github.com/apache/hbase/blob/830d2895b27fa0cf39a28d3af9673a4126ea8258/hbase-server/src/main/java/org/apache/hadoop/hbase/master/region/MasterRegionFlusherAndCompactor.java#L164], we call region.compact(true) constantly like recursion. This caused a lot of logs to be flushed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HBASE-24737) Find a way to resolve WALFileLengthProvider#getLogFileSizeIfBeingWritten problem
[ https://issues.apache.org/jira/browse/HBASE-24737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin reassigned HBASE-24737: --- Assignee: Sun Xin > Find a way to resolve WALFileLengthProvider#getLogFileSizeIfBeingWritten > problem > > > Key: HBASE-24737 > URL: https://issues.apache.org/jira/browse/HBASE-24737 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > > Now we use WALFileLengthProvider#getLogFileSizeIfBeingWritten to get the > synced wal length and prevent replicating unacked log entries. But after > offload ReplicationSource to new ReplicationServer, we need a new way to > resolve this problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25598) TestFromClientSide5.testScanMetrics is flaky
[ https://issues.apache.org/jira/browse/HBASE-25598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25598. - Fix Version/s: 2.4.2 2.3.5 2.2.7 3.0.0-alpha-1 Resolution: Fixed Thanks [~zhangduo] for reviewing. Merged to master and all active branch-2.x. > TestFromClientSide5.testScanMetrics is flaky > > > Key: HBASE-25598 > URL: https://issues.apache.org/jira/browse/HBASE-25598 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0-alpha-1, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.2.7, 2.3.5, 2.4.2 > > > In some PRs, I got the following errors in UT results. > {code:java} > [ERROR] Errors: > [ERROR] org.apache.hadoop.hbase.client.TestFromClientSide5.testScanMetrics[0] > [ERROR] Run 1: TestFromClientSide5.testScanMetrics:1018 Did not count the > result bytes expected:<60> but was:<120> > [ERROR] Run 2: TestFromClientSide5.testScanMetrics:1036 Did not count the > result bytes expected:<60> but was:<180> > [ERROR] Run 3: TestFromClientSide5.testScanMetrics:951 » > MasterRegistryFetch Exception making... > [INFO] > [ERROR] > org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor5.testScanMetrics[1] > [ERROR] Run 1: > TestFromClientSideWithCoprocessor5>TestFromClientSide5.testScanMetrics:1036 > Did not count the result bytes expected:<60> but was:<120> > [ERROR] Run 2: > TestFromClientSideWithCoprocessor5>TestFromClientSide5.testScanMetrics:951 » > IO > [ERROR] Run 3: > TestFromClientSideWithCoprocessor5>TestFromClientSide5.testScanMetrics:951 » > IO > [INFO] > {code} > I read the code further and found that this UT is flaky. > {code:java} > // check byte counters > scan2 = new Scan(); > scan2.setScanMetricsEnabled(true); > scan2.setCaching(1); > try (ResultScanner scanner = ht.getScanner(scan2)) { > int numBytes = 0; > for (Result result : scanner.next(1)) { > for (Cell cell : result.listCells()) { > numBytes += PrivateCellUtil.estimatedSerializedSizeOf(cell); > } > } > scanner.close(); > ScanMetrics scanMetrics = scanner.getScanMetrics(); > assertEquals("Did not count the result bytes", numBytes, > scanMetrics.countOfBytesInResults.get()); > } > {code} > In the code above, it is to check scanMetrics.countOfBytesInResults, but just > get only ONE row by scanner.next(1) . A total of 3 rows are inserted into the > table, and scanner prefetch from server in advance until maxCacheSize is > exceeded, see > [here|https://github.com/apache/hbase/blob/5fa15cfde3d77e77ffb1f09d60dce4db264f3831/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncTableResultScanner.java#L94]. > So if scanner prefetch more than one row before closing scanner, the UT > fails. we can reproduce this problem steadily by sleeping before > scanner.close(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HBASE-25110) Add heartbeat for ReplicationServer and dispatch replication sources to ReplicationServer
[ https://issues.apache.org/jira/browse/HBASE-25110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin reassigned HBASE-25110: --- Assignee: Sun Xin (was: Guanghao Zhang) > Add heartbeat for ReplicationServer and dispatch replication sources to > ReplicationServer > - > > Key: HBASE-25110 > URL: https://issues.apache.org/jira/browse/HBASE-25110 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25598) TestFromClientSide5.testScanMetrics is flaky
Sun Xin created HBASE-25598: --- Summary: TestFromClientSide5.testScanMetrics is flaky Key: HBASE-25598 URL: https://issues.apache.org/jira/browse/HBASE-25598 Project: HBase Issue Type: Bug Affects Versions: 2.4.1, 2.3.4, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin In some PRs, I got the following errors in UT results. {code:java} [ERROR] Errors: [ERROR] org.apache.hadoop.hbase.client.TestFromClientSide5.testScanMetrics[0] [ERROR] Run 1: TestFromClientSide5.testScanMetrics:1018 Did not count the result bytes expected:<60> but was:<120> [ERROR] Run 2: TestFromClientSide5.testScanMetrics:1036 Did not count the result bytes expected:<60> but was:<180> [ERROR] Run 3: TestFromClientSide5.testScanMetrics:951 » MasterRegistryFetch Exception making... [INFO] [ERROR] org.apache.hadoop.hbase.client.TestFromClientSideWithCoprocessor5.testScanMetrics[1] [ERROR] Run 1: TestFromClientSideWithCoprocessor5>TestFromClientSide5.testScanMetrics:1036 Did not count the result bytes expected:<60> but was:<120> [ERROR] Run 2: TestFromClientSideWithCoprocessor5>TestFromClientSide5.testScanMetrics:951 » IO [ERROR] Run 3: TestFromClientSideWithCoprocessor5>TestFromClientSide5.testScanMetrics:951 » IO [INFO] {code} I read the code further and found that this UT is flaky. {code:java} // check byte counters scan2 = new Scan(); scan2.setScanMetricsEnabled(true); scan2.setCaching(1); try (ResultScanner scanner = ht.getScanner(scan2)) { int numBytes = 0; for (Result result : scanner.next(1)) { for (Cell cell : result.listCells()) { numBytes += PrivateCellUtil.estimatedSerializedSizeOf(cell); } } scanner.close(); ScanMetrics scanMetrics = scanner.getScanMetrics(); assertEquals("Did not count the result bytes", numBytes, scanMetrics.countOfBytesInResults.get()); } {code} In the code above, it is to check scanMetrics.countOfBytesInResults, but just get only ONE row by scanner.next(1) . A total of 3 rows are inserted into the table, and scanner prefetch from server in advance until maxCacheSize is exceeded, see [here|https://github.com/apache/hbase/blob/5fa15cfde3d77e77ffb1f09d60dce4db264f3831/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncTableResultScanner.java#L94]. So if scanner prefetch more than one row before closing scanner, the UT fails. we can reproduce this problem steadily by sleeping before scanner.close(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25590) Bulkload replication HFileRefs cannot be cleared in some cases where set exclude-namespace/exclude-table-cfs
Sun Xin created HBASE-25590: --- Summary: Bulkload replication HFileRefs cannot be cleared in some cases where set exclude-namespace/exclude-table-cfs Key: HBASE-25590 URL: https://issues.apache.org/jira/browse/HBASE-25590 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.4.1, 2.3.4, 2.2.6, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin In [ReplicationSource#addHFileRefs|https://github.com/apache/hbase/blob/ed90a14995acd87111d2b9849f07d84418ca43d4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L264], we may add unwanted hfiles to the _HFileRefs_ if a peer is set _replicate_all_ true and set _exclude-namespace/exclude-table-cfs_. These unwanted _HFileRefs_ will not be replicated to remote cluster and not be cleared. Two problems are caused by this bug: # The metric sizeOfHFileRefsQueue cannot be zeroed. # Referenced HFiles cannot be deleted by _ReplicationHFileCleaner._ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25562) ReplicationSourceWALReader log and handle exception immediately without retrying
[ https://issues.apache.org/jira/browse/HBASE-25562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17287473#comment-17287473 ] Sun Xin commented on HBASE-25562: - Thanks all for reviewing. Merged to master and branch-2.4. > ReplicationSourceWALReader log and handle exception immediately without > retrying > > > Key: HBASE-25562 > URL: https://issues.apache.org/jira/browse/HBASE-25562 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > In [this piece of code about retrying in > ReplicationSourceWALReader#run|https://github.com/apache/hbase/blob/0353909bc268e3ff3def098963d021e973f1f153/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L151], > sleep time increases with the number of retries, if an exception happens > that cannot be recovered by itself, error logs will appear after 12 hours > (300 retries by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-25559) Terminate threads of oldsources while RS is closing
[ https://issues.apache.org/jira/browse/HBASE-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281654#comment-17281654 ] Sun Xin edited comment on HBASE-25559 at 2/9/21, 9:29 AM: -- Merged to master and all active branch-2.x. Thanks [~wchevreuil] [~vjasani] [~stack] for reviwing. was (Author: ddupg): Merged to master and all active branch-2.x. > Terminate threads of oldsources while RS is closing > --- > > Key: HBASE-25559 > URL: https://issues.apache.org/jira/browse/HBASE-25559 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.2.7, 2.3.5, 2.4.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25559) Terminate threads of oldsources while RS is closing
[ https://issues.apache.org/jira/browse/HBASE-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25559. - Fix Version/s: 2.4.2 2.3.5 2.2.7 3.0.0-alpha-1 Resolution: Fixed Merged to master and all active branch-2.x. > Terminate threads of oldsources while RS is closing > --- > > Key: HBASE-25559 > URL: https://issues.apache.org/jira/browse/HBASE-25559 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.2.7, 2.3.5, 2.4.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25562) ReplicationSourceWALReader log and handle exception immediately without retrying
[ https://issues.apache.org/jira/browse/HBASE-25562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25562: Description: In [this piece of code about retrying in ReplicationSourceWALReader#run|https://github.com/apache/hbase/blob/0353909bc268e3ff3def098963d021e973f1f153/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L151], sleep time increases with the number of retries, if an exception happens that cannot be recovered by itself, error logs will appear after 12 hours (300 retries by default). (was: In this piece of code about retrying in ReplicationSourceWALReader#run, sleep time increases with the number of retries, if an exception happens that cannot be recovered by itself, error logs will appear after 12 hours (300 retries by default).) > ReplicationSourceWALReader log and handle exception immediately without > retrying > > > Key: HBASE-25562 > URL: https://issues.apache.org/jira/browse/HBASE-25562 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.2.6, 2.3.4, 2.4.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > In [this piece of code about retrying in > ReplicationSourceWALReader#run|https://github.com/apache/hbase/blob/0353909bc268e3ff3def098963d021e973f1f153/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReader.java#L151], > sleep time increases with the number of retries, if an exception happens > that cannot be recovered by itself, error logs will appear after 12 hours > (300 retries by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25562) ReplicationSourceWALReader log and handle exception immediately without retrying
Sun Xin created HBASE-25562: --- Summary: ReplicationSourceWALReader log and handle exception immediately without retrying Key: HBASE-25562 URL: https://issues.apache.org/jira/browse/HBASE-25562 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.4.1, 2.3.4, 2.2.6, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin In this piece of code about retrying in ReplicationSourceWALReader#run, sleep time increases with the number of retries, if an exception happens that cannot be recovered by itself, error logs will appear after 12 hours (300 retries by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25560) Remove unused parameter named peerId in the constructor method of CatalogReplicationSourcePeer
[ https://issues.apache.org/jira/browse/HBASE-25560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25560: Affects Version/s: 2.4.2 3.0.0-alpha-1 > Remove unused parameter named peerId in the constructor method of > CatalogReplicationSourcePeer > -- > > Key: HBASE-25560 > URL: https://issues.apache.org/jira/browse/HBASE-25560 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0-alpha-1, 2.4.2 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25560) Remove unused parameter named peerId in the constructor method of CatalogReplicationSourcePeer
Sun Xin created HBASE-25560: --- Summary: Remove unused parameter named peerId in the constructor method of CatalogReplicationSourcePeer Key: HBASE-25560 URL: https://issues.apache.org/jira/browse/HBASE-25560 Project: HBase Issue Type: Bug Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25559) Terminate threads of oldsources while RS is closing
Sun Xin created HBASE-25559: --- Summary: Terminate threads of oldsources while RS is closing Key: HBASE-25559 URL: https://issues.apache.org/jira/browse/HBASE-25559 Project: HBase Issue Type: Bug Affects Versions: 2.4.1, 2.3.4, 2.2.6, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-25553) It is better for ReplicationTracker.getListOfRegionServers to return ServerName instead of String
[ https://issues.apache.org/jira/browse/HBASE-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-25553. - Resolution: Fixed > It is better for ReplicationTracker.getListOfRegionServers to return > ServerName instead of String > - > > Key: HBASE-25553 > URL: https://issues.apache.org/jira/browse/HBASE-25553 > Project: HBase > Issue Type: Umbrella >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.2.7, 2.5.0, 2.3.5, 2.4.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25553) It is better for ReplicationTracker.getListOfRegionServers to return ServerName instead of String
[ https://issues.apache.org/jira/browse/HBASE-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280469#comment-17280469 ] Sun Xin commented on HBASE-25553: - Merged to master and all active branch-2.x. Thanks [~wchevreuil] [~vjasani] for reviewing. > It is better for ReplicationTracker.getListOfRegionServers to return > ServerName instead of String > - > > Key: HBASE-25553 > URL: https://issues.apache.org/jira/browse/HBASE-25553 > Project: HBase > Issue Type: Umbrella >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.2.7, 2.5.0, 2.3.5, 2.4.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25553) It is better for ReplicationTracker.getListOfRegionServers to return ServerName instead of String
[ https://issues.apache.org/jira/browse/HBASE-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25553: Fix Version/s: 2.4.2 2.3.5 2.5.0 2.2.7 3.0.0-alpha-1 > It is better for ReplicationTracker.getListOfRegionServers to return > ServerName instead of String > - > > Key: HBASE-25553 > URL: https://issues.apache.org/jira/browse/HBASE-25553 > Project: HBase > Issue Type: Umbrella >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1, 2.2.7, 2.5.0, 2.3.5, 2.4.2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25553) It is better for ReplicationTracker.getListOfRegionServers to return ServerName instead of String
[ https://issues.apache.org/jira/browse/HBASE-25553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25553: Issue Type: Umbrella (was: Bug) > It is better for ReplicationTracker.getListOfRegionServers to return > ServerName instead of String > - > > Key: HBASE-25553 > URL: https://issues.apache.org/jira/browse/HBASE-25553 > Project: HBase > Issue Type: Umbrella >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25553) It is better for ReplicationTracker.getListOfRegionServers to return ServerName instead of String
Sun Xin created HBASE-25553: --- Summary: It is better for ReplicationTracker.getListOfRegionServers to return ServerName instead of String Key: HBASE-25553 URL: https://issues.apache.org/jira/browse/HBASE-25553 Project: HBase Issue Type: Bug Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25309) Support start/stop replication server by scripts
Sun Xin created HBASE-25309: --- Summary: Support start/stop replication server by scripts Key: HBASE-25309 URL: https://issues.apache.org/jira/browse/HBASE-25309 Project: HBase Issue Type: Sub-task Components: Replication Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25305) Add master UI to show ReplicationServer
Sun Xin created HBASE-25305: --- Summary: Add master UI to show ReplicationServer Key: HBASE-25305 URL: https://issues.apache.org/jira/browse/HBASE-25305 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-25289) [testing] Clean up resources after tests in rsgroup_shell_test.rb
[ https://issues.apache.org/jira/browse/HBASE-25289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234290#comment-17234290 ] Sun Xin commented on HBASE-25289: - {quote}But there are many conflict when cherry-pick to branch-2. [~Ddupg] Can you help to submit a new PR for branch-2/branch-2.3/branch-2.2? {quote} Thank [~zghao] for reviewing. I've submited a new PR for branch-2. > [testing] Clean up resources after tests in rsgroup_shell_test.rb > - > > Key: HBASE-25289 > URL: https://issues.apache.org/jira/browse/HBASE-25289 > Project: HBase > Issue Type: Improvement > Components: rsgroup, test >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In rsgroup_shell_test.rb, some tests don't remove rsgroups and drop tables, > messing up adding new tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25300) 'Unknown table hbase:quota' happens when desc table in shell if quota disabled
Sun Xin created HBASE-25300: --- Summary: 'Unknown table hbase:quota' happens when desc table in shell if quota disabled Key: HBASE-25300 URL: https://issues.apache.org/jira/browse/HBASE-25300 Project: HBase Issue Type: Bug Components: shell Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25289) [testing] Clean up resources after tests in rsgroup_shell_test.rb
Sun Xin created HBASE-25289: --- Summary: [testing] Clean up resources after tests in rsgroup_shell_test.rb Key: HBASE-25289 URL: https://issues.apache.org/jira/browse/HBASE-25289 Project: HBase Issue Type: Improvement Components: rsgroup, test Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 In rsgroup_shell_test.rb, some tests don't remove rsgroups and drop tables, messing up adding new tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25171) Remove ZNodePaths.namespaceZNode
Sun Xin created HBASE-25171: --- Summary: Remove ZNodePaths.namespaceZNode Key: HBASE-25171 URL: https://issues.apache.org/jira/browse/HBASE-25171 Project: HBase Issue Type: Improvement Components: Zookeeper Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 In HBASE-21154, had removed the dependency on ZNodePaths.namespaceZNode, so remove this field. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HBASE-24813) ReplicationSource should clear buffer usage on ReplicationSourceManager upon termination
[ https://issues.apache.org/jira/browse/HBASE-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210572#comment-17210572 ] Sun Xin edited comment on HBASE-24813 at 10/9/20, 8:19 AM: --- Please take a look at HBASE-25117, that may fix this problem? was (Author: ddupg): Using isActive() instead of isAlive() in this [PR|https://github.com/apache/hbase/pull/2191/files#], that may work? !image-2020-10-09-10-50-00-372.png! > ReplicationSource should clear buffer usage on ReplicationSourceManager upon > termination > > > Key: HBASE-24813 > URL: https://issues.apache.org/jira/browse/HBASE-24813 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.3.1, 2.2.6 >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.3, 2.4.0, 2.2.7 > > Attachments: TestReplicationSyncUpTool.log, > image-2020-10-09-10-50-00-372.png > > > Following investigations on the issue described by [~elserj] on HBASE-24779, > we found out that once a peer is removed, thus killing peers related > *ReplicationSource* instance, it may leave > *ReplicationSourceManager.totalBufferUsed* inconsistent. This can happen if > *ReplicationSourceWALReader* had put some entries on its queue to be > processed by *ReplicationSourceShipper,* but the peer removal killed the > shipper before it could process the pending entries. When > *ReplicationSourceWALReader* thread add entries to the queue, it increments > *ReplicationSourceManager.totalBufferUsed* with the sum of the entries sizes. > When those entries are read by *ReplicationSourceShipper,* > *ReplicationSourceManager.totalBufferUsed* is then decreased. We should also > decrease *ReplicationSourceManager.totalBufferUsed* when *ReplicationSource* > is terminated, otherwise those unprocessed entries size would be consuming > *ReplicationSourceManager.totalBufferUsed __*indefinitely, unless the RS gets > restarted. This may be a problem for deployments with multiple peers, or if > new peers are added.** -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25117) ReplicationSourceShipper thread can not be finished
[ https://issues.apache.org/jira/browse/HBASE-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25117: Description: See [Flaky Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], some UTs about replication failed cause timeout. In [HBaseInterClusterReplicationEndpoint.sleepForRetries|https://github.com/apache/hbase/blob/78ae1f176d4215dcc34067ed25d786a4fcd4d888/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L203], InterruptedException is caught but no further processing, the interrupted status of the current thread is cleared. Below is the code comment of Thread.sleep. {code:java} /** * ... * * @throws InterruptedException * if any thread has interrupted the current thread. The * interrupted status of the current thread is * cleared when this exception is thrown. */ public static native void sleep(long millis) throws InterruptedException; {code} So InterruptedException must be processed, otherwise ReplicationSourceShipper thread cannot be terminated in some cases. was: See [Flaky Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], some UTs about replication failed cause timeout. > ReplicationSourceShipper thread can not be finished > --- > > Key: HBASE-25117 > URL: https://issues.apache.org/jira/browse/HBASE-25117 > Project: HBase > Issue Type: Bug >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > See [Flaky > Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], > some UTs about replication failed cause timeout. > In > [HBaseInterClusterReplicationEndpoint.sleepForRetries|https://github.com/apache/hbase/blob/78ae1f176d4215dcc34067ed25d786a4fcd4d888/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L203], > InterruptedException is caught but no further processing, the interrupted > status of the current thread is cleared. > Below is the code comment of Thread.sleep. > {code:java} > /** > * ... > * > * @throws InterruptedException > * if any thread has interrupted the current thread. The > * interrupted status of the current thread is > * cleared when this exception is thrown. > */ > public static native void sleep(long millis) throws InterruptedException; > {code} > So InterruptedException must be processed, otherwise ReplicationSourceShipper > thread cannot be terminated in some cases. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25117) ReplicationSourceShipper thread can not be finished
[ https://issues.apache.org/jira/browse/HBASE-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25117: Description: See [Flaky Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], some UTs about replication failed cause timeout. was:See [Flaky Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], some UTs about replication failed cause timeout. > ReplicationSourceShipper thread can not be finished > --- > > Key: HBASE-25117 > URL: https://issues.apache.org/jira/browse/HBASE-25117 > Project: HBase > Issue Type: Bug >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > See [Flaky > Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], > some UTs about replication failed cause timeout. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24813) ReplicationSource should clear buffer usage on ReplicationSourceManager upon termination
[ https://issues.apache.org/jira/browse/HBASE-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-24813: Attachment: image-2020-10-09-10-50-00-372.png > ReplicationSource should clear buffer usage on ReplicationSourceManager upon > termination > > > Key: HBASE-24813 > URL: https://issues.apache.org/jira/browse/HBASE-24813 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.3.1, 2.2.6 >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.3, 2.4.0, 2.2.7 > > Attachments: TestReplicationSyncUpTool.log, > image-2020-10-09-10-50-00-372.png > > > Following investigations on the issue described by [~elserj] on HBASE-24779, > we found out that once a peer is removed, thus killing peers related > *ReplicationSource* instance, it may leave > *ReplicationSourceManager.totalBufferUsed* inconsistent. This can happen if > *ReplicationSourceWALReader* had put some entries on its queue to be > processed by *ReplicationSourceShipper,* but the peer removal killed the > shipper before it could process the pending entries. When > *ReplicationSourceWALReader* thread add entries to the queue, it increments > *ReplicationSourceManager.totalBufferUsed* with the sum of the entries sizes. > When those entries are read by *ReplicationSourceShipper,* > *ReplicationSourceManager.totalBufferUsed* is then decreased. We should also > decrease *ReplicationSourceManager.totalBufferUsed* when *ReplicationSource* > is terminated, otherwise those unprocessed entries size would be consuming > *ReplicationSourceManager.totalBufferUsed __*indefinitely, unless the RS gets > restarted. This may be a problem for deployments with multiple peers, or if > new peers are added.** -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24813) ReplicationSource should clear buffer usage on ReplicationSourceManager upon termination
[ https://issues.apache.org/jira/browse/HBASE-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210572#comment-17210572 ] Sun Xin commented on HBASE-24813: - Using isActive() instead of isAlive() in this [PR|https://github.com/apache/hbase/pull/2191/files#], that may work? !image-2020-10-09-10-50-00-372.png! > ReplicationSource should clear buffer usage on ReplicationSourceManager upon > termination > > > Key: HBASE-24813 > URL: https://issues.apache.org/jira/browse/HBASE-24813 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.3.1, 2.2.6 >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Major > Fix For: 3.0.0-alpha-1, 2.3.3, 2.4.0, 2.2.7 > > Attachments: TestReplicationSyncUpTool.log, > image-2020-10-09-10-50-00-372.png > > > Following investigations on the issue described by [~elserj] on HBASE-24779, > we found out that once a peer is removed, thus killing peers related > *ReplicationSource* instance, it may leave > *ReplicationSourceManager.totalBufferUsed* inconsistent. This can happen if > *ReplicationSourceWALReader* had put some entries on its queue to be > processed by *ReplicationSourceShipper,* but the peer removal killed the > shipper before it could process the pending entries. When > *ReplicationSourceWALReader* thread add entries to the queue, it increments > *ReplicationSourceManager.totalBufferUsed* with the sum of the entries sizes. > When those entries are read by *ReplicationSourceShipper,* > *ReplicationSourceManager.totalBufferUsed* is then decreased. We should also > decrease *ReplicationSourceManager.totalBufferUsed* when *ReplicationSource* > is terminated, otherwise those unprocessed entries size would be consuming > *ReplicationSourceManager.totalBufferUsed __*indefinitely, unless the RS gets > restarted. This may be a problem for deployments with multiple peers, or if > new peers are added.** -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25117) ReplicationSourceShipper thread can not be finished
Sun Xin created HBASE-25117: --- Summary: ReplicationSourceShipper thread can not be finished Key: HBASE-25117 URL: https://issues.apache.org/jira/browse/HBASE-25117 Project: HBase Issue Type: Bug Reporter: Sun Xin Assignee: Sun Xin See [Flaky Tests|https://ci-hadoop.apache.org/job/HBase/job/HBase-Flaky-Tests/job/master/161/console], some UTs about replication failed cause timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25113) [testing] HBaseCluster support ReplicationServer for UTs
Sun Xin created HBASE-25113: --- Summary: [testing] HBaseCluster support ReplicationServer for UTs Key: HBASE-25113 URL: https://issues.apache.org/jira/browse/HBASE-25113 Project: HBase Issue Type: Sub-task Components: Replication Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25100) conf and conn are assigned twice in HBaseReplicationEndpoint and HBaseInterClusterReplicationEndpoint
[ https://issues.apache.org/jira/browse/HBASE-25100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25100: Description: In [HBaseReplicationEndpoint.init()|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/HBaseReplicationEndpoint.java#L109] and [HBaseInterClusterReplicationEndpoint.init|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L145] , the latter is a sub-class of the former, conf and conn are assigned twice. was: In [HBaseReplicationEndpoint.init()|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/HBaseReplicationEndpoint.java#L109] and [HBaseInterClusterReplicationEndpoint.init|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L145] , the latter is a sub-class of the former, conn is assigned twice. > conf and conn are assigned twice in HBaseReplicationEndpoint and > HBaseInterClusterReplicationEndpoint > - > > Key: HBASE-25100 > URL: https://issues.apache.org/jira/browse/HBASE-25100 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In > [HBaseReplicationEndpoint.init()|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/HBaseReplicationEndpoint.java#L109] > and > [HBaseInterClusterReplicationEndpoint.init|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L145] > , the latter is a sub-class of the former, conf and conn are assigned twice. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25100) conf and conn are assigned twice in HBaseReplicationEndpoint and HBaseInterClusterReplicationEndpoint
[ https://issues.apache.org/jira/browse/HBASE-25100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25100: Summary: conf and conn are assigned twice in HBaseReplicationEndpoint and HBaseInterClusterReplicationEndpoint (was: conn is assigned twice in HBaseReplicationEndpoint and HBaseInterClusterReplicationEndpoint) > conf and conn are assigned twice in HBaseReplicationEndpoint and > HBaseInterClusterReplicationEndpoint > - > > Key: HBASE-25100 > URL: https://issues.apache.org/jira/browse/HBASE-25100 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In > [HBaseReplicationEndpoint.init()|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/HBaseReplicationEndpoint.java#L109] > and > [HBaseInterClusterReplicationEndpoint.init|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L145] > , the latter is a sub-class of the former, conn is assigned twice. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25100) conn is assigned twice in HBaseReplicationEndpoint and HBaseInterClusterReplicationEndpoint
Sun Xin created HBASE-25100: --- Summary: conn is assigned twice in HBaseReplicationEndpoint and HBaseInterClusterReplicationEndpoint Key: HBASE-25100 URL: https://issues.apache.org/jira/browse/HBASE-25100 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 In [HBaseReplicationEndpoint.init()|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/HBaseReplicationEndpoint.java#L109] and [HBaseInterClusterReplicationEndpoint.init|https://github.com/apache/hbase/blob/c312760819ed185cab3a0717a1ea0ff6e8c47a23/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L145] , the latter is a sub-class of the former, conn is assigned twice. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (HBASE-25098) ReplicationStatisticsChore runs in wrong time unit
[ https://issues.apache.org/jira/browse/HBASE-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-25098 started by Sun Xin. --- > ReplicationStatisticsChore runs in wrong time unit > -- > > Key: HBASE-25098 > URL: https://issues.apache.org/jira/browse/HBASE-25098 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25098) ReplicationStatisticsChore runs in wrong time unit
Sun Xin created HBASE-25098: --- Summary: ReplicationStatisticsChore runs in wrong time unit Key: HBASE-25098 URL: https://issues.apache.org/jira/browse/HBASE-25098 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (HBASE-25014) ScheduledChore is never triggered when initalDelay > 1.5*period
[ https://issues.apache.org/jira/browse/HBASE-25014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-25014 started by Sun Xin. --- > ScheduledChore is never triggered when initalDelay > 1.5*period > --- > > Key: HBASE-25014 > URL: https://issues.apache.org/jira/browse/HBASE-25014 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0-alpha-1, 2.2.3, 2.2.4, 2.2.5 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In our recent tests, ScheduledChore is never triggered when initalDelay > > 1.5*period. > The cause of the bug is the following: > The trigger time for a ScheduleChore must be within an acceptable time window > that is 1.5 * period. see > [here|https://github.com/apache/hbase/blob/e5ca9adc54f9f580f85d21d38217afa97aa79d68/hbase-common/src/main/java/org/apache/hadoop/hbase/ScheduledChore.java#L234] > timeOfLastRun and timeOfThisRun are two variables that record two adjacent > trigger time. [The first initialization of > timeOfThisRun|https://github.com/apache/hbase/blob/e5ca9adc54f9f580f85d21d38217afa97aa79d68/hbase-common/src/main/java/org/apache/hadoop/hbase/ScheduledChore.java#L273] > is when the ScheduleChore is created, it's not a real trigger time. > If we set initialDelay > 1.5 period , after initialDelay, the first time when > chore is triggered has exceeded the allowed window. Then [cancel the chore > and schedule it > again|https://github.com/apache/hbase/blob/e5ca9adc54f9f580f85d21d38217afa97aa79d68/hbase-common/src/main/java/org/apache/hadoop/hbase/ChoreService.java#L176]. > So it's stuck in loop when initialDelay > 1.5 period : > 1. init timeOfThisRun at a wrong time. > 2. wait initalDelay > 3. chore trigger, but exceeded the allowed window. > 4. cancel chore and schedule it again > 5. go step 1. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25014) ScheduledChore is never triggered when initalDelay > 1.5*period
Sun Xin created HBASE-25014: --- Summary: ScheduledChore is never triggered when initalDelay > 1.5*period Key: HBASE-25014 URL: https://issues.apache.org/jira/browse/HBASE-25014 Project: HBase Issue Type: Bug Affects Versions: 2.2.5, 2.2.4, 2.2.3, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 In our recent tests, ScheduledChore is never triggered when initalDelay > 1.5*period. The cause of the bug is the following: The trigger time for a ScheduleChore must be within an acceptable time window that is 1.5 * period. see [here|https://github.com/apache/hbase/blob/e5ca9adc54f9f580f85d21d38217afa97aa79d68/hbase-common/src/main/java/org/apache/hadoop/hbase/ScheduledChore.java#L234] timeOfLastRun and timeOfThisRun are two variables that record two adjacent trigger time. [The first initialization of timeOfThisRun|https://github.com/apache/hbase/blob/e5ca9adc54f9f580f85d21d38217afa97aa79d68/hbase-common/src/main/java/org/apache/hadoop/hbase/ScheduledChore.java#L273] is when the ScheduleChore is created, it's not a real trigger time. If we set initialDelay > 1.5 period , after initialDelay, the first time when chore is triggered has exceeded the allowed window. Then [cancel the chore and schedule it again|https://github.com/apache/hbase/blob/e5ca9adc54f9f580f85d21d38217afa97aa79d68/hbase-common/src/main/java/org/apache/hadoop/hbase/ChoreService.java#L176]. So it's stuck in loop when initialDelay > 1.5 period : 1. init timeOfThisRun at a wrong time. 2. wait initalDelay 3. chore trigger, but exceeded the allowed window. 4. cancel chore and schedule it again 5. go step 1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25012) HBASE-24359 causes replication missed log of some RemoteException
[ https://issues.apache.org/jira/browse/HBASE-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25012: Description: HBASE-24359 broken the logic of handling exception. In branch2, it even causes some RemoteException log missed. [File changed|[https://github.com/apache/hbase/pull/1855/files#diff-1e3f171b19474698601a0752b618af0eL435]] in branch2. !image-2020-09-11-14-30-27-898.png! was:[HBASE-24359|https://issues.apache.org/jira/browse/HBASE-24359] broken the logic of handling exception. In branch2, it even causes some RemoteException log missed. > HBASE-24359 causes replication missed log of some RemoteException > - > > Key: HBASE-25012 > URL: https://issues.apache.org/jira/browse/HBASE-25012 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.3.0, 2.3.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > Attachments: image-2020-09-11-14-30-27-898.png > > > HBASE-24359 broken the logic of handling exception. In branch2, it even > causes some RemoteException log missed. > [File > changed|[https://github.com/apache/hbase/pull/1855/files#diff-1e3f171b19474698601a0752b618af0eL435]] > in branch2. > !image-2020-09-11-14-30-27-898.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-25012) HBASE-24359 causes replication missed log of some RemoteException
[ https://issues.apache.org/jira/browse/HBASE-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-25012: Attachment: image-2020-09-11-14-30-27-898.png > HBASE-24359 causes replication missed log of some RemoteException > - > > Key: HBASE-25012 > URL: https://issues.apache.org/jira/browse/HBASE-25012 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.3.0, 2.3.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > Attachments: image-2020-09-11-14-30-27-898.png > > > [HBASE-24359|https://issues.apache.org/jira/browse/HBASE-24359] broken the > logic of handling exception. In branch2, it even causes some RemoteException > log missed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (HBASE-25012) HBASE-24359 causes replication missed log of some RemoteException
[ https://issues.apache.org/jira/browse/HBASE-25012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-25012 started by Sun Xin. --- > HBASE-24359 causes replication missed log of some RemoteException > - > > Key: HBASE-25012 > URL: https://issues.apache.org/jira/browse/HBASE-25012 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 3.0.0-alpha-1, 2.3.0, 2.3.1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > [HBASE-24359|https://issues.apache.org/jira/browse/HBASE-24359] broken the > logic of handling exception. In branch2, it even causes some RemoteException > log missed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-25012) HBASE-24359 causes replication missed log of some RemoteException
Sun Xin created HBASE-25012: --- Summary: HBASE-24359 causes replication missed log of some RemoteException Key: HBASE-25012 URL: https://issues.apache.org/jira/browse/HBASE-25012 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 2.3.1, 2.3.0, 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin Fix For: 3.0.0-alpha-1 [HBASE-24359|https://issues.apache.org/jira/browse/HBASE-24359] broken the logic of handling exception. In branch2, it even causes some RemoteException log missed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (HBASE-24999) Master manages ReplicationServers
[ https://issues.apache.org/jira/browse/HBASE-24999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-24999 started by Sun Xin. --- > Master manages ReplicationServers > - > > Key: HBASE-24999 > URL: https://issues.apache.org/jira/browse/HBASE-24999 > Project: HBase > Issue Type: Sub-task > Components: Replication >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > > In [HBASE-24683|https://issues.apache.org/jira/browse/HBASE-24683] add an > isolated ReplicationServer. > What this issue is to do: > # ReplicationServer reports to Master periodically. > # Add a basic ReplicationServerManager in Master to manage ReplicationServer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24999) Master manages ReplicationServers
Sun Xin created HBASE-24999: --- Summary: Master manages ReplicationServers Key: HBASE-24999 URL: https://issues.apache.org/jira/browse/HBASE-24999 Project: HBase Issue Type: Sub-task Components: Replication Affects Versions: 3.0.0-alpha-1 Reporter: Sun Xin Assignee: Sun Xin In [HBASE-24683|https://issues.apache.org/jira/browse/HBASE-24683] add an isolated ReplicationServer. What this issue is to do: # ReplicationServer reports to Master periodically. # Add a basic ReplicationServerManager in Master to manage ReplicationServer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24982) Disassemble the method replicateWALEntry from AdminService to a new interface ReplicationServerService
[ https://issues.apache.org/jira/browse/HBASE-24982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-24982: Summary: Disassemble the method replicateWALEntry from AdminService to a new interface ReplicationServerService (was: Disassemble the method replicateWALEntry from AdminService to a new interface ReplicationSinkService) > Disassemble the method replicateWALEntry from AdminService to a new interface > ReplicationServerService > -- > > Key: HBASE-24982 > URL: https://issues.apache.org/jira/browse/HBASE-24982 > Project: HBase > Issue Type: Sub-task > Components: Replication >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-24982) Disassemble the method replicateWALEntry from AdminService to a new interface ReplicationSinkService
Sun Xin created HBASE-24982: --- Summary: Disassemble the method replicateWALEntry from AdminService to a new interface ReplicationSinkService Key: HBASE-24982 URL: https://issues.apache.org/jira/browse/HBASE-24982 Project: HBase Issue Type: Sub-task Components: Replication Reporter: Sun Xin Assignee: Sun Xin -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-24683) Add a basic ReplicationServer which only implement ReplicationSink Service
[ https://issues.apache.org/jira/browse/HBASE-24683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin resolved HBASE-24683. - Resolution: Resolved > Add a basic ReplicationServer which only implement ReplicationSink Service > -- > > Key: HBASE-24683 > URL: https://issues.apache.org/jira/browse/HBASE-24683 > Project: HBase > Issue Type: Sub-task >Reporter: Guanghao Zhang >Assignee: Sun Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24759) Refuse to update configuration of default group
[ https://issues.apache.org/jira/browse/HBASE-24759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190576#comment-17190576 ] Sun Xin commented on HBASE-24759: - Thanks [~zghao] for reviewing, I've opened a PR for branch-2. > Refuse to update configuration of default group > --- > > Key: HBASE-24759 > URL: https://issues.apache.org/jira/browse/HBASE-24759 > Project: HBase > Issue Type: Bug > Components: rsgroup >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In the current scenario, we didn't store the default rsgroup information. But > after HBASE-24431 , we have added a config map, which need to be persisted to > avoid lossing config of default rsgroup. > So refuse to update configuration of default group -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24759) Refuse to update configuration of default group
[ https://issues.apache.org/jira/browse/HBASE-24759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-24759: Summary: Refuse to update configuration of default group (was: Persisting configuration of default rsgroup) > Refuse to update configuration of default group > --- > > Key: HBASE-24759 > URL: https://issues.apache.org/jira/browse/HBASE-24759 > Project: HBase > Issue Type: Bug > Components: rsgroup >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In the current scenario, we didn't store the default rsgroup information. But > after HBASE-24431 , we have added a config map, which need to be persisted to > avoid lossing config of default rsgroup. > So refuse to update configuration of default group -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HBASE-24759) Persisting configuration of default rsgroup
[ https://issues.apache.org/jira/browse/HBASE-24759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Xin updated HBASE-24759: Description: In the current scenario, we didn't store the default rsgroup information. But after HBASE-24431 , we have added a config map, which need to be persisted to avoid lossing config of default rsgroup. So refuse to update configuration of default group was:In the current scenario, we didn't store the default rsgroup information. But after HBASE-24431 , we have added a config map, which need to be persisted to avoid lossing config of default rsgroup. > Persisting configuration of default rsgroup > --- > > Key: HBASE-24759 > URL: https://issues.apache.org/jira/browse/HBASE-24759 > Project: HBase > Issue Type: Bug > Components: rsgroup >Affects Versions: 3.0.0-alpha-1 >Reporter: Sun Xin >Assignee: Sun Xin >Priority: Major > Fix For: 3.0.0-alpha-1 > > > In the current scenario, we didn't store the default rsgroup information. But > after HBASE-24431 , we have added a config map, which need to be persisted to > avoid lossing config of default rsgroup. > So refuse to update configuration of default group -- This message was sent by Atlassian Jira (v8.3.4#803005)