[jira] [Resolved] (HBASE-15539) HBase Client region location is expensive
[ https://issues.apache.org/jira/browse/HBASE-15539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-15539. --- Resolution: Later No progress. Resolving as 'Later'. > HBase Client region location is expensive > -- > > Key: HBASE-15539 > URL: https://issues.apache.org/jira/browse/HBASE-15539 > Project: HBase > Issue Type: Sub-task > Components: Client >Reporter: Vladimir Rodionov >Assignee: Mikhail Antonov >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > ConnectionImplementation.locateRegion and MetaCache.getTableLocations are hot > spots in a client. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23676) Address feedback on HBASE-23055 Alter hbase:meta.
[ https://issues.apache.org/jira/browse/HBASE-23676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23676. --- Resolution: Won't Fix The parent HBASE-23055 was recast. This issue no longer relevant (The feedback was addressed up in new patch attached on HBASE-23055) > Address feedback on HBASE-23055 Alter hbase:meta. > - > > Key: HBASE-23676 > URL: https://issues.apache.org/jira/browse/HBASE-23676 > Project: HBase > Issue Type: Bug > Components: meta >Affects Versions: 2.3.0 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > Good feedback on HBASE-23055 came in after merge from [~zhangduo]. Opening > this issue to address it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-18326) Fix and reenable TestMasterProcedureWalLease
[ https://issues.apache.org/jira/browse/HBASE-18326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-18326. --- Assignee: Szabolcs Bukros Resolution: Fixed Resolving because test was removed. Thanks [~bszabolcs]. Assigned ticket to you as you did research. > Fix and reenable TestMasterProcedureWalLease > > > Key: HBASE-18326 > URL: https://issues.apache.org/jira/browse/HBASE-18326 > Project: HBase > Issue Type: Sub-task > Components: test >Reporter: Michael Stack >Assignee: Szabolcs Bukros >Priority: Blocker > Fix For: 3.0.0, 2.3.0 > > > Fix and reenable flakey important test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23612) Update pom.xml to use another 2.5.0 protoc as external protobuf
[ https://issues.apache.org/jira/browse/HBASE-23612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23612. --- Fix Version/s: 3.0.0 Hadoop Flags: Reviewed Assignee: zhao bo Resolution: Fixed Merged. Thanks for the patch [~bzhaoopenstack] > Update pom.xml to use another 2.5.0 protoc as external protobuf > --- > > Key: HBASE-23612 > URL: https://issues.apache.org/jira/browse/HBASE-23612 > Project: HBase > Issue Type: Sub-task > Components: build >Reporter: zhao bo >Assignee: zhao bo >Priority: Major > Fix For: 3.0.0 > > > Currently, there is no protoc 2.5.0 for release [1]. So we can make a new one > for ARM specific. For make sure that could work on ARM. > We will introduce a new ARM artifact for protoc, group_id is > org.openlabtesting.protobuf .. This is just used for protobuf-maven-plugin to > compile .proto files. As the 3.X version of protoc support ARM already. So > this won't affect the internal protoc usage, which is 3.5.1-1 now. > > [1][https://github.com/protocolbuffers/protobuf/issues/3844#issuecomment-343355946] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23690) Checkstyle plugin complains about our checkstyle.xml format; doc how to resolve mismatched version
[ https://issues.apache.org/jira/browse/HBASE-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23690. --- Fix Version/s: 3.0.0 Assignee: Michael Stack Resolution: Fixed Resolving. Made subissue to address the update of the checkstyle plugin and to do the Nick nice suggestion above. Thanks for reviews. > Checkstyle plugin complains about our checkstyle.xml format; doc how to > resolve mismatched version > -- > > Key: HBASE-23690 > URL: https://issues.apache.org/jira/browse/HBASE-23690 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Trivial > Fix For: 3.0.0 > > > Trying to add the checkstyle.xml to the intellij checkstyle plugin after > reading HBASE-23688, it complains with the following when it reads in the > config file: > {code} > com.puppycrawl.tools.checkstyle.api.CheckstyleException: cannot initialize > module TreeWalker - TreeWalker is not allowed as a parent of LineLength > Please review 'Parent Module' section for this Check in web documentation if > Check is standard. > at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:473) > at > com.puppycrawl.tools.checkstyle.api.AutomaticBean.configure(AutomaticBean.java:198) > at > org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:61) > at > org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:26) > at > org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.executeCommand(CheckstyleActionsImpl.java:130) > at > org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:60) > at > org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:51) > at > org.infernus.idea.checkstyle.checker.CheckerFactoryWorker.run(CheckerFactoryWorker.java:46) > Caused by: com.puppycrawl.tools.checkstyle.api.CheckstyleException: > TreeWalker is not allowed as a parent of LineLength Please review 'Parent > Module' section for this Check in web documentation if Check is standard. > at > com.puppycrawl.tools.checkstyle.TreeWalker.setupChild(TreeWalker.java:147) > at > com.puppycrawl.tools.checkstyle.api.AutomaticBean.configure(AutomaticBean.java:198) > at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:468) > ... 7 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23706) Update checkstyle plugin (and update checkstyle.xml to match)
Michael Stack created HBASE-23706: - Summary: Update checkstyle plugin (and update checkstyle.xml to match) Key: HBASE-23706 URL: https://issues.apache.org/jira/browse/HBASE-23706 Project: HBase Issue Type: Sub-task Reporter: Michael Stack In parent issue, its suggested we update our checkstyle plugin to match of intellij plugin default at least. Will need checkstyle.xml changes else it fails parse (See notes in parent by [~bharathv] on what needs doing). For extra points, do the [~ndimiduk] suggestion: "...It would be nice also if we could commit the .idea/checkstyle-idea.xml file with the checkstyle version used buy the plugin pinned to the same version as we're using in maven." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23705) Add CellComparator to HFileContext
Michael Stack created HBASE-23705: - Summary: Add CellComparator to HFileContext Key: HBASE-23705 URL: https://issues.apache.org/jira/browse/HBASE-23705 Project: HBase Issue Type: Sub-task Components: io Reporter: Michael Stack Assignee: Michael Stack Fix For: 3.0.0, 2.3.0 The HFileContext is present when reading and writing files. It is populated at read time using HFile trailer content and file metadata. At write time, we create it up front. Interesting is that though CellComparator is written to the HFile trailer, and parse of the Trailer creates an HFileInfo which builds the HFileContext at read time, the HFileContext does not expose what CellComparator to use decoding and seeking. Around the codebase there are various compensations made for this lack with decoders that actually have a decoding context (with a reference to the hfilecontext), hard-coding use of the default CellComparator. StoreFileInfo will use default if not passed a comparator (even though we'd just read the trailer) and HFile itself is similar. Let me fix this situation removing ambiguity. It will also fix bugs in parent issue where UTs are failing because wrong CellComparator is being used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23697) Document new RegionProcedureStore operation and migration
Michael Stack created HBASE-23697: - Summary: Document new RegionProcedureStore operation and migration Key: HBASE-23697 URL: https://issues.apache.org/jira/browse/HBASE-23697 Project: HBase Issue Type: Sub-task Components: documentation Affects Versions: 2.3.0 Reporter: Michael Stack Add a few notes to the refguide on the new RegionProcedureStore, how it works, how it differs from WALPS, and note it auto-migrates and there should be new issue moving on to the new store. Mention the configuration. Mention it is on WALFS even though it is a 'Region', etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23696) Stop WALProcedureStore after migration finishes
[ https://issues.apache.org/jira/browse/HBASE-23696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23696. --- Resolution: Duplicate Resolving as dupe of HBASE-23694 > Stop WALProcedureStore after migration finishes > --- > > Key: HBASE-23696 > URL: https://issues.apache.org/jira/browse/HBASE-23696 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > WALProcedureStore is left up with its sync thread running in background > though we are done with after starting it inside the migration method. Add > stop when done. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23696) Stop WALProcedureStore after migration finishes
Michael Stack created HBASE-23696: - Summary: Stop WALProcedureStore after migration finishes Key: HBASE-23696 URL: https://issues.apache.org/jira/browse/HBASE-23696 Project: HBase Issue Type: Bug Affects Versions: 2.3.0 Reporter: Michael Stack Assignee: Michael Stack Fix For: 3.0.0, 2.3.0 WALProcedureStore is left up with its sync thread running in background though we are done with after starting it inside the migration method. Add stop when done. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23690) Checkstyle plugin complains about our checkstyle.xml format
Michael Stack created HBASE-23690: - Summary: Checkstyle plugin complains about our checkstyle.xml format Key: HBASE-23690 URL: https://issues.apache.org/jira/browse/HBASE-23690 Project: HBase Issue Type: Bug Reporter: Michael Stack Trying to add the checkstyle.xml to the intellij checkstyle plugin after reading HBASE-23688, it complains with the following when it reads in the config file: {code} com.puppycrawl.tools.checkstyle.api.CheckstyleException: cannot initialize module TreeWalker - TreeWalker is not allowed as a parent of LineLength Please review 'Parent Module' section for this Check in web documentation if Check is standard. at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:473) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.configure(AutomaticBean.java:198) at org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:61) at org.infernus.idea.checkstyle.service.cmd.OpCreateChecker.execute(OpCreateChecker.java:26) at org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.executeCommand(CheckstyleActionsImpl.java:130) at org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:60) at org.infernus.idea.checkstyle.service.CheckstyleActionsImpl.createChecker(CheckstyleActionsImpl.java:51) at org.infernus.idea.checkstyle.checker.CheckerFactoryWorker.run(CheckerFactoryWorker.java:46) Caused by: com.puppycrawl.tools.checkstyle.api.CheckstyleException: TreeWalker is not allowed as a parent of LineLength Please review 'Parent Module' section for this Check in web documentation if Check is standard. at com.puppycrawl.tools.checkstyle.TreeWalker.setupChild(TreeWalker.java:147) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.configure(AutomaticBean.java:198) at com.puppycrawl.tools.checkstyle.Checker.setupChild(Checker.java:468) ... 7 more {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23689) Bookmark for github PR to jira redirection
[ https://issues.apache.org/jira/browse/HBASE-23689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23689. --- Fix Version/s: 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Resolving on merge. Nice one [~bharathv]. > Bookmark for github PR to jira redirection > --- > > Key: HBASE-23689 > URL: https://issues.apache.org/jira/browse/HBASE-23689 > Project: HBase > Issue Type: Sub-task > Components: tooling >Affects Versions: master >Reporter: Bharath Vissapragada >Assignee: Bharath Vissapragada >Priority: Minor > Fix For: 3.0.0 > > > Following is a simple js snippet that redirects from any HBase PR to its > corresponding jira. Without this, one has to copy the jira ID from the PR, > construct a jira URL manually and paste it in the browser URL bar. Saves a > bunch of clicks. > {code:javascript} > javascript:location.href='https://issues.apache.org/jira/browse/'document.getElementsByClassName("js-issue-title")[0].innerHTML.match(/HBASE-\d/)[0];{code} > Particularly helpful for reviewers who'd like to read the jira contents often > when reviewing a PR. > For chrome: > - Right Click on the bookmarks bar > - Click on Add page. Fill in the following details: > Name: HBase jira redirect (or any other that you prefer) > URL: – {{snippet from above}}-- > - Click Save > Now you should see "HBase jira redirect" (or any other name you gave) > bookmark on the bar. > Go to any Github PR, click on this button and it redirects to the > corresponding jira. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23687) DEBUG logging cleanup
[ https://issues.apache.org/jira/browse/HBASE-23687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23687. --- Fix Version/s: 2.3.0 3.0.0 Hadoop Flags: Reviewed Assignee: Michael Stack Resolution: Fixed Merged to branch-2 and master. Thanks for review [~janh] > DEBUG logging cleanup > - > > Key: HBASE-23687 > URL: https://issues.apache.org/jira/browse/HBASE-23687 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Trivial > Fix For: 3.0.0, 2.3.0 > > > Minor cleanup of annoying loggings. For example, over an hour, we logged this > 200k times: > {code}2020-01-14 11:06:00,287 DEBUG > org.apache.hadoop.hbase.master.cleaner.LogCleaner: Exiting > {code} > There is no corresponding 'Starting'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23687) DEBUG logging cleanup
Michael Stack created HBASE-23687: - Summary: DEBUG logging cleanup Key: HBASE-23687 URL: https://issues.apache.org/jira/browse/HBASE-23687 Project: HBase Issue Type: Bug Affects Versions: 2.3.0 Reporter: Michael Stack Minor cleanup of annoying loggings. For example, over an hour, we logged this 200k times: {code}2020-01-14 11:06:00,287 DEBUG org.apache.hadoop.hbase.master.cleaner.LogCleaner: Exiting {code} There is no corresponding 'Starting'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23685) indicates a last flushed sequence id ... that is less than the previous last flushed sequence id
Michael Stack created HBASE-23685: - Summary: indicates a last flushed sequence id ... that is less than the previous last flushed sequence id Key: HBASE-23685 URL: https://issues.apache.org/jira/browse/HBASE-23685 Project: HBase Issue Type: Bug Affects Versions: 2.3.0 Reporter: Michael Stack I'm getting loads of the below in Master log running tests against branch-2. It is heavily loaded but generally keeping up though there is backlog in WAL files > 32 ... around the cluster so forced flushes happening. I'll see the below for a Region even though it seems like we've since flushed out a sequenceid on the RS-side that is larger than what the Master is seeing. Two column family table. {code} 2020-01-13 23:33:18,455 WARN org.apache.hadoop.hbase.master.ServerManager: RegionServer hbasedn030.sp07.siri.apple.com,16020,1578934813139 indicates a last flushed sequence id (1593644) that is less than the previous last flushed sequence id (1593649) for region t1,f9371d5,1576227377175. 3e41deae849d25f0a2f1d654f482d73a. Ignoring. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23286) Improve MTTR: Split WAL to HFile
[ https://issues.apache.org/jira/browse/HBASE-23286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23286. --- Resolution: Fixed Resolving. No problem. Will file issues elsewhere (e.g. HBASE-23684). > Improve MTTR: Split WAL to HFile > > > Key: HBASE-23286 > URL: https://issues.apache.org/jira/browse/HBASE-23286 > Project: HBase > Issue Type: Improvement > Components: MTTR >Affects Versions: 3.0.0, 2.3.0 >Reporter: Guanghao Zhang >Assignee: Guanghao Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > After HBASE-20724, the compaction event marker is not used anymore when > failover. So our new proposal is split WAL to HFile to imporve MTTR. It has 3 > steps: > # Read WAL and write HFile to region’s column family’s recovered.hfiles > directory. > # Open region. > # Bulkload the recovered.hfiles for every column family. > The design doc was attathed by a google doc. Any suggestions are welcomed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23684) NPE HFilesOutputSink
Michael Stack created HBASE-23684: - Summary: NPE HFilesOutputSink Key: HBASE-23684 URL: https://issues.apache.org/jira/browse/HBASE-23684 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.3.0 Reporter: Michael Stack Ran into this after enabling hfile splitter: {code} 2020-01-13 17:37:08,204 INFO org.apache.hadoop.hbase.wal.OutputSink: 3 split writer threads finished 2020-01-13 17:37:08,233 INFO org.apache.hadoop.hbase.wal.WALSplitter: Processed 1007 edits across 0 regions cost 284 ms; edits skipped=76; WAL=hdfs://nameservice1/hbase/genie/WALs/hbasedn101.example.org,16020,1578934806382-splitting/hbasedn101.example.org%2C16020%2C1578934806382.1578937008832, size=128.5 M, length=134708720, corrupted=false, progress failed=true 2020-01-13 17:37:08,234 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of WALs/hbasedn101.example.org,16020,1578934806382-splitting/hbasedn101.example.org%2C16020%2C1578934806382.1578937008832 failed, returning error java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.writeRemainingEntryBuffers(BoundedRecoveredHFilesOutputSink.java:173) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.close(BoundedRecoveredHFilesOutputSink.java:140) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:339) at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:181) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.splitLog(SplitLogWorker.java:105) at org.apache.hadoop.hbase.regionserver.SplitLogWorker.lambda$new$0(SplitLogWorker.java:84) at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.configContextForNonMetaWriter(BoundedRecoveredHFilesOutputSink.java:225) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.createRecoveredHFileWriter(BoundedRecoveredHFilesOutputSink.java:213) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.append(BoundedRecoveredHFilesOutputSink.java:117) at org.apache.hadoop.hbase.wal.BoundedRecoveredHFilesOutputSink.lambda$writeRemainingEntryBuffers$3(BoundedRecoveredHFilesOutputSink.java:155) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) {code} It is a bit odd because log says there were zero regions. Not sure what that was about. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HBASE-23055) Alter hbase:meta
[ https://issues.apache.org/jira/browse/HBASE-23055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack reopened HBASE-23055: --- Reopening (again). Reverted the commit. Good back and forth going on here, in PR, and in sub-issue. [~zhangduo] is concerned that being able to disable hbase:meta is a step to far and proposes alter w/o disabling as a means to achieve this issues' objective (also suggests guard against operator accidentally deleting fundamental hbase:meta column families). Let me go his suggested route. > Alter hbase:meta > > > Key: HBASE-23055 > URL: https://issues.apache.org/jira/browse/HBASE-23055 > Project: HBase > Issue Type: Task > Components: meta >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > hbase:meta is currently hardcoded. Its schema cannot be change. > This issue is about allowing edits to hbase:meta schema. It will allow our > being able to set encodings such as the block-with-indexes which will help > quell CPU usage on host carrying hbase:meta. A dynamic hbase:meta is first > step on road to being able to split meta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23680) RegionProcedureStore missing cleaning of hfile archive
Michael Stack created HBASE-23680: - Summary: RegionProcedureStore missing cleaning of hfile archive Key: HBASE-23680 URL: https://issues.apache.org/jira/browse/HBASE-23680 Project: HBase Issue Type: Sub-task Affects Versions: 2.3.0 Reporter: Michael Stack Fix For: 2.3.0 See tail of parent issue. The new RegionProcedureStore accumulates deleted hfiles in its local archive dir. Needs a cleaner like the one that watches over /hbase/archive. Is there a problem clearning the new $masterproc$ files from the oldWALs too? These seem to stick around also. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23668) Master log start filling with "Flush journal status" messages
[ https://issues.apache.org/jira/browse/HBASE-23668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23668. --- Hadoop Flags: Reviewed Assignee: Michael Stack Resolution: Fixed Merged to branch-2+. Thanks for review [~zhangduo] (reopen to add your UT?). > Master log start filling with "Flush journal status" messages > - > > Key: HBASE-23668 > URL: https://issues.apache.org/jira/browse/HBASE-23668 > Project: HBase > Issue Type: Improvement > Components: proc-v2, RegionProcedureStore >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > Takes a while to get into this condition. Not each to tell how because all > logs have rolled off and I only have logs filled w/ below: > {code} > 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Flush status journal: > Acquiring readlock on region at 1578553261723 > Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, > failureReason:Nothing to flush,flush seq id45226854 at 1578553261723 > 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Flush status journal: > Acquiring readlock on region at 1578553261723 > Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, > failureReason:Nothing to flush,flush seq id45226855 at 1578553261723 > 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Flush status journal: > Acquiring readlock on region at 1578553261723 > Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, > failureReason:Nothing to flush,flush seq id45226856 at 1578553261723 > 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Flush status journal: > Acquiring readlock on region at 1578553261723 > Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, > failureReason:Nothing to flush,flush seq id45226857 at 1578553261723 > {code} > ... I added the printing of flushresult... i.e. cannot flush because store is > empty. > Digging. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23676) Address feedback on HBASE-23055 Alter hbase:meta.
Michael Stack created HBASE-23676: - Summary: Address feedback on HBASE-23055 Alter hbase:meta. Key: HBASE-23676 URL: https://issues.apache.org/jira/browse/HBASE-23676 Project: HBase Issue Type: Bug Affects Versions: 2.3.0 Reporter: Michael Stack Assignee: Michael Stack Fix For: 2.3.0 Good feedback on HBASE-23055 came in after merge from [~zhangduo]. Opening this issue to address it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23055) Alter hbase:meta
[ https://issues.apache.org/jira/browse/HBASE-23055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23055. --- Resolution: Fixed Pushed on master branch too. Thanks for reviews [~bharathv] > Alter hbase:meta > > > Key: HBASE-23055 > URL: https://issues.apache.org/jira/browse/HBASE-23055 > Project: HBase > Issue Type: Task >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > hbase:meta is currently hardcoded. Its schema cannot be change. > This issue is about allowing edits to hbase:meta schema. It will allow our > being able to set encodings such as the block-with-indexes which will help > quell CPU usage on host carrying hbase:meta. A dynamic hbase:meta is first > step on road to being able to split meta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23672) 350+ lossy-count threads running
Michael Stack created HBASE-23672: - Summary: 350+ lossy-count threads running Key: HBASE-23672 URL: https://issues.apache.org/jira/browse/HBASE-23672 Project: HBase Issue Type: Bug Affects Versions: 2.3.0 Reporter: Michael Stack Looking at a server under load (branch-2), I see 350 instances of lossy-count threads running. They look like this: {code} 8611 "lossy-count-0" #11672 daemon prio=5 os_prio=0 cpu=0.09ms elapsed=281.33s tid=0x7f1baee76800 nid=0x2411 waiting on condition [0x7f1b78793000] 8612java.lang.Thread.State: WAITING (parking) 8613 at jdk.internal.misc.Unsafe.park(java.base@11.0.4/Native Method) 8614 - parking to wait for <0x910a91e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 8615 at java.util.concurrent.locks.LockSupport.park(java.base@11.0.4/LockSupport.java:194) 8616 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.4/AbstractQueuedSynchronizer.java:2081) 8617 at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.4/LinkedBlockingQueue.java:433) 8618 at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.4/ThreadPoolExecutor.java:1054) 8619 at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.4/ThreadPoolExecutor.java:1114) 8620 at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.4/ThreadPoolExecutor.java:628) 8621 at java.lang.Thread.run(java.base@11.0.4/Thread.java:834) {code} Why we need 350 threads? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HBASE-23286) Improve MTTR: Split WAL to HFile
[ https://issues.apache.org/jira/browse/HBASE-23286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack reopened HBASE-23286: --- Reopening. Feature doesn't seem to be working on tip of branch-2. > Improve MTTR: Split WAL to HFile > > > Key: HBASE-23286 > URL: https://issues.apache.org/jira/browse/HBASE-23286 > Project: HBase > Issue Type: Improvement > Components: MTTR >Affects Versions: 3.0.0, 2.3.0 >Reporter: Guanghao Zhang >Assignee: Guanghao Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > After HBASE-20724, the compaction event marker is not used anymore when > failover. So our new proposal is split WAL to HFile to imporve MTTR. It has 3 > steps: > # Read WAL and write HFile to region’s column family’s recovered.hfiles > directory. > # Open region. > # Bulkload the recovered.hfiles for every column family. > The design doc was attathed by a google doc. Any suggestions are welcomed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23378) Clean Up FSUtil setClusterId
[ https://issues.apache.org/jira/browse/HBASE-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23378. --- Fix Version/s: 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Merged. Thanks for the patch [~belugabehr] > Clean Up FSUtil setClusterId > > > Key: HBASE-23378 > URL: https://issues.apache.org/jira/browse/HBASE-23378 > Project: HBase > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > Fix For: 3.0.0 > > > * Use try-with-resources > * Remove bad practice of catching one's own Exceptions > * Method signature 'wait' should be of type long to match JDK API > * Add additional debugging > * Do not swallow Interrupt status of thread > * General cleanup -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23103) Survey incidence of table state queries
[ https://issues.apache.org/jira/browse/HBASE-23103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23103. --- Assignee: Michael Stack Resolution: Information Provided Ok. The discussion here has good info but is moot in light of the form the parent issue took on commit. The parent no longer goes via master to find table state. That decision/development has been put off for now. Instead master issue adds meta table state as a method in Registry and adds an implementation to ZKAsyncRegistry that does lookup into zk, bypassing Master. For now resolving as 'Information Provided'. We'll be back to this topic after Master-based Registry lands. This latter project looks to change nature of Master from background janitorial player to active participant in data > Survey incidence of table state queries > --- > > Key: HBASE-23103 > URL: https://issues.apache.org/jira/browse/HBASE-23103 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Blocker > Fix For: 3.0.0 > > > Task that comes of parent issue. Parent makes it so we go via Master to > figure state of a table. It is the authority and since the parent issues adds > being able to enable/disable hbase:meta, table state is now in two places -- > in hbase:meta table... and elsewhere for the hbase:meta's state. Rather than > have client go to two locations dependent on which table is being asked > about, parent made it so we went to master. Parent allows that this puts more > load on the Master. [~zhangduo] brings up the valid concern that it might be > too much or that dependent on the Master for state puts Master too much > in-line with read/writes. > This issue is a survey to figure how much load and how much state-in-master > could mess up inline read/writes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23668) Master log start filling with "Flush journal status" messages
Michael Stack created HBASE-23668: - Summary: Master log start filling with "Flush journal status" messages Key: HBASE-23668 URL: https://issues.apache.org/jira/browse/HBASE-23668 Project: HBase Issue Type: Improvement Components: proc-v2 Reporter: Michael Stack Fix For: 2.3.0 Takes a while to get into this condition. Not each to tell how because all logs have rolled off and I only have logs filled w/ below: {code} 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush status journal: Acquiring readlock on region at 1578553261723 Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, failureReason:Nothing to flush,flush seq id45226854 at 1578553261723 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush status journal: Acquiring readlock on region at 1578553261723 Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, failureReason:Nothing to flush,flush seq id45226855 at 1578553261723 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush status journal: Acquiring readlock on region at 1578553261723 Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, failureReason:Nothing to flush,flush seq id45226856 at 1578553261723 2020-01-09 07:01:01,723 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Flush status journal: Acquiring readlock on region at 1578553261723 Flush successful flush result:CANNOT_FLUSH_MEMSTORE_EMPTY, failureReason:Nothing to flush,flush seq id45226857 at 1578553261723 {code} ... I added the printing of flushresult... i.e. cannot flush because store is empty. Digging. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23369) Auto-close 'unknown' Regions reported as OPEN on RegionServers
[ https://issues.apache.org/jira/browse/HBASE-23369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23369. --- Fix Version/s: 2.3.0 3.0.0 Hadoop Flags: Incompatible change Release Note: If a RegionServer reports a Region as OPEN in disagreement with Master's status on the Region, the Master now tells the RegionServer to silently close the Region. Assignee: Michael Stack Resolution: Fixed Merged to branch-2 and. master branch. I think this belongs in branch-2.2 too. Shout and I'll pull it back. > Auto-close 'unknown' Regions reported as OPEN on RegionServers > -- > > Key: HBASE-23369 > URL: https://issues.apache.org/jira/browse/HBASE-23369 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > In old days, if a RegionServer reported a variance that didn't agree w/ > Master view of the cluster, we'd kill the RegionServer. > Lately, in tests that overrun a cluster, after a sustained high-load, Master > can start failing its updates against Meta (CallQueueTooBigException <= More > on this later). It then can lose proper accounting of all Region members. One > variant has a RegionServer reporting its list of open Regions to the Master > and the Master doesn't 'know' of a particular Region or the Master may know > the Region but expects it open on another RegionServer. > Here is an example of how it looks each time RS reports: > {code} > 2019-12-03 07:07:00,757 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: No > t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode > but reported ONLINE at server.example.org,16020,1575354666245 > (inServerRegionList=false). > 2019-12-03 07:07:03,793 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: No > t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode > but reported ONLINE at server.example.org,16020,1575354666245 > (inServerRegionList=false). > {code} > Will also show as an 'inconsistency' in the 'HBCK' tab on the Master UI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23585) MetricsRegionServerWrapperImpl.getL1CacheHitCount always returns 200
[ https://issues.apache.org/jira/browse/HBASE-23585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23585. --- Fix Version/s: 1.6.1 Hadoop Flags: Reviewed Resolution: Fixed Pushed to branch-1. It seems to be a problem on this branch only (correct me if I am wrong [~DeanZ]). Thanks to reviewers [~binlijin] and [~janh] > MetricsRegionServerWrapperImpl.getL1CacheHitCount always returns 200 > > > Key: HBASE-23585 > URL: https://issues.apache.org/jira/browse/HBASE-23585 > Project: HBase > Issue Type: Bug > Components: metrics >Affects Versions: 1.4.12 >Reporter: Baiqiang Zhao >Assignee: Baiqiang Zhao >Priority: Major > Fix For: 1.6.1 > > > Looks like it was copied from a UT class and forgot to change it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23632) DeadServer cleanup
[ https://issues.apache.org/jira/browse/HBASE-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23632. --- Fix Version/s: 2.3.0 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Pushed to branch-2+ (Applied it to branch-2.2 but then reverted because it just an improvement and logging is changed mildly). > DeadServer cleanup > -- > > Key: HBASE-23632 > URL: https://issues.apache.org/jira/browse/HBASE-23632 > Project: HBase > Issue Type: Improvement >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Minor > Fix For: 3.0.0, 2.3.0 > > > Cleanup of DeadServer class shutting down access, undoing duplication, adding > doc., and removing unused code. > One change is that we do not remove a server from 'processing' list when we > 'remove' deadservers; we let SCP do it since it owns processing list (Saw > issue where on fast restart of a server, the server was removed from > deadserver and from processing list though the SCP for the dead server was > still running -- no repercussions that I could see but a little confusing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23596) HBCKServerCrashProcedure can double assign
[ https://issues.apache.org/jira/browse/HBASE-23596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23596. --- Fix Version/s: (was: 2.2.4) 2.2.3 Hadoop Flags: Reviewed Release Note: Makes it so the recently added HBCKServerCrashProcedure -- the SCP that gets invoked when an operator schedules an SCP via hbck2 scheduleRecoveries command -- now works the same as SCP EXCEPT if master knows nothing of the scheduled servername. In this latter case, HBCKSCP will do a full scan of hbase:meta looking for instances of the passed servername. If any found it will attempt cleanup of hbase:meta references by reassigning any found OPEN or OPENING and by closing any in CLOSING state. Used to fix instances of what the 'HBCK Report' page shows as 'Unknown Servers'. Resolution: Fixed Merged to branch-2.2+ > HBCKServerCrashProcedure can double assign > -- > > Key: HBASE-23596 > URL: https://issues.apache.org/jira/browse/HBASE-23596 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > The new SCP that does SCP plus cleanup 'Unknown Servers' with mentions in > hbase:meta added by the below can make for double assignments. > {code} > commit c238891a26734e1e4276b6b1677a58cf83de5dc4 > Author: stack > Date: Wed Nov 13 22:36:26 2019 -0800 > HBASE-23282 HBCKServerCrashProcedure for 'Unknown Servers' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23628) Remove Apache Commons Digest Base64
[ https://issues.apache.org/jira/browse/HBASE-23628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23628. --- Fix Version/s: 3.0.0 Release Note: >From the PR: "Yes. The two create the same output... I just wrote a small test suite to increase my confidence on that. I generated many tens of millions of random byte patterns and compared the output of the two algorithms. They came back identical every time. "Just in case any inquiring minds would like to know, there is no longer an encoding required when generating the strings. The JDK implementation specifically specifies that strings returned are StandardCharsets.ISO_8859_1. This does not change anything because UTF8 and ISO_8859 overlap for the limited character set (64 characters) the encoding uses." Resolution: Fixed Thanks for the patch [~belugabehr] and the work done to verify the change. > Remove Apache Commons Digest Base64 > --- > > Key: HBASE-23628 > URL: https://issues.apache.org/jira/browse/HBASE-23628 > Project: HBase > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > Fix For: 3.0.0 > > > Use the native JDK Base64 implementation instead. Most places are using the > JDK version, but a couple of spots were missed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23632) DeadServer cleanup
Michael Stack created HBASE-23632: - Summary: DeadServer cleanup Key: HBASE-23632 URL: https://issues.apache.org/jira/browse/HBASE-23632 Project: HBase Issue Type: Improvement Reporter: Michael Stack Assignee: Michael Stack -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23238) Additional test and checks for null references on ScannerCallableWithReplicas
[ https://issues.apache.org/jira/browse/HBASE-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23238. --- Resolution: Fixed [~bharathv] I pushed the addendum on branch-2.1+. The patch wouldn't go back to branch-1. Should it? Thanks. > Additional test and checks for null references on ScannerCallableWithReplicas > - > > Key: HBASE-23238 > URL: https://issues.apache.org/jira/browse/HBASE-23238 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.2.12 >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0, 2.2.3, 2.1.8 > > Attachments: HBASE-23238.branch-2.patch > > > One of our customers running a 1.2 based version is facing NPE when scanning > data from a MR job. It happens when the map task is finalising: > {noformat} > ... > 2019-09-10 14:17:22,238 INFO [main] org.apache.hadoop.mapred.MapTask: > Ignoring exception during close for > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader@3a5b7d7e > java.lang.NullPointerException > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.setClose(ScannerCallableWithReplicas.java:99) > at > org.apache.hadoop.hbase.client.ClientScanner.close(ClientScanner.java:730) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.close(TableRecordReaderImpl.java:178) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReader.close(TableRecordReader.java:89) > at > org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase$1.close(MultiTableInputFormatBase.java:112) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:529) > at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:2039) > ... > 2019-09-10 14:18:24,601 FATAL [IPC Server handler 5 on 35745] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1566832111959_6047_m_00_3 - exited : > java.lang.NullPointerException > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.setClose(ScannerCallableWithReplicas.java:99) > at > org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:264) > at > org.apache.hadoop.hbase.client.ClientScanner.possiblyNextScanner(ClientScanner.java:248) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:542) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:371) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:222) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147) > at > org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase$1.nextKeyValue(MultiTableInputFormatBase.java:139) > ... > {noformat} > After some investigation, we found out that 1.2 based deployments will > consistently face this problem under the following conditions: > 1) The sum of all the given row KVs size targeted to be returned in the scan > is larger than *max result size*; > 2) At same time, the scan filter has exhausted, so that all remaining KVs > should be filtered and not returned; > We could simulate this with the UT being proposed in this PR. When checking > newer branches, though, I could verify this specific problem is not present > on newer branches, I believe it was indirectly sorted by changes from > HBASE-17489. > Nevertheless, I think it would still be useful to have this extra test and > checks added as a safeguard measure. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HBASE-23238) Additional test and checks for null references on ScannerCallableWithReplicas
[ https://issues.apache.org/jira/browse/HBASE-23238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack reopened HBASE-23238: --- Reopening to apply addendum. > Additional test and checks for null references on ScannerCallableWithReplicas > - > > Key: HBASE-23238 > URL: https://issues.apache.org/jira/browse/HBASE-23238 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.2.12 >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Fix For: 3.0.0, 2.3.0, 1.6.0, 2.1.8, 2.2.3 > > Attachments: HBASE-23238.branch-2.patch > > > One of our customers running a 1.2 based version is facing NPE when scanning > data from a MR job. It happens when the map task is finalising: > {noformat} > ... > 2019-09-10 14:17:22,238 INFO [main] org.apache.hadoop.mapred.MapTask: > Ignoring exception during close for > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader@3a5b7d7e > java.lang.NullPointerException > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.setClose(ScannerCallableWithReplicas.java:99) > at > org.apache.hadoop.hbase.client.ClientScanner.close(ClientScanner.java:730) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.close(TableRecordReaderImpl.java:178) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReader.close(TableRecordReader.java:89) > at > org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase$1.close(MultiTableInputFormatBase.java:112) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:529) > at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:2039) > ... > 2019-09-10 14:18:24,601 FATAL [IPC Server handler 5 on 35745] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1566832111959_6047_m_00_3 - exited : > java.lang.NullPointerException > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.setClose(ScannerCallableWithReplicas.java:99) > at > org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:264) > at > org.apache.hadoop.hbase.client.ClientScanner.possiblyNextScanner(ClientScanner.java:248) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:542) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:371) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:222) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147) > at > org.apache.hadoop.hbase.mapreduce.MultiTableInputFormatBase$1.nextKeyValue(MultiTableInputFormatBase.java:139) > ... > {noformat} > After some investigation, we found out that 1.2 based deployments will > consistently face this problem under the following conditions: > 1) The sum of all the given row KVs size targeted to be returned in the scan > is larger than *max result size*; > 2) At same time, the scan filter has exhausted, so that all remaining KVs > should be filtered and not returned; > We could simulate this with the UT being proposed in this PR. When checking > newer branches, though, I could verify this specific problem is not present > on newer branches, I believe it was indirectly sorted by changes from > HBASE-17489. > Nevertheless, I think it would still be useful to have this extra test and > checks added as a safeguard measure. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23614) Revert miscommit "Add status when fixing hole" subsequently fixed by HBASE-23313
[ https://issues.apache.org/jira/browse/HBASE-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23614. --- Resolution: Not A Problem Resolving as not a problem. Wellington overwrote my change with his commit. > Revert miscommit "Add status when fixing hole" subsequently fixed by > HBASE-23313 > > > Key: HBASE-23614 > URL: https://issues.apache.org/jira/browse/HBASE-23614 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.0 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > > The mis-commit was found by [~zhangduo]. > I started in on trying to fix setRegionState but then [~wchevreuil] fixed it > with > commit 70bbc38aaefa7af336e274296766d4f3ece4646e > Author: Wellington Ramos Chevreuil > Date: Wed Nov 27 08:41:23 2019 + > HBASE-23313 [hbck2] setRegionState should update Master in-memory sta… > (#864) > Signed-off-by: Mingliang Liu > Signed-off-by: stack > My commit was mistakenly pushed to branch-2. Reverting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23614) Revert miscommit "Add status when fixing hole" subsequently fixed by HBASE-23313
Michael Stack created HBASE-23614: - Summary: Revert miscommit "Add status when fixing hole" subsequently fixed by HBASE-23313 Key: HBASE-23614 URL: https://issues.apache.org/jira/browse/HBASE-23614 Project: HBase Issue Type: Bug Affects Versions: 2.3.0 Reporter: Michael Stack Assignee: Michael Stack The mis-commit was found by [~zhangduo]. I started in on trying to fix setRegionState but then [~wchevreuil] fixed it with commit 70bbc38aaefa7af336e274296766d4f3ece4646e Author: Wellington Ramos Chevreuil Date: Wed Nov 27 08:41:23 2019 + HBASE-23313 [hbck2] setRegionState should update Master in-memory sta… (#864) Signed-off-by: Mingliang Liu Signed-off-by: stack My commit was mistakenly pushed to branch-2. Reverting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-20103) [pv2] AssignmentProcedure is too coarse grained
[ https://issues.apache.org/jira/browse/HBASE-20103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-20103. --- Resolution: Won't Fix Stale. > [pv2] AssignmentProcedure is too coarse grained > --- > > Key: HBASE-20103 > URL: https://issues.apache.org/jira/browse/HBASE-20103 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Critical > Fix For: 3.0.0 > > > Comes of work on HBASE-20100 but in particular, in feedback from [~Apache9] > https://mail.google.com/mail/u/0/#inbox/161d8e41054be406 > The AP is too coarse-grained. There is precheck+start, then transform state > is edit meta setting state to OPENING and then dispatch (rpc) Finish is > edit of meta and setting internal state. The edit of meta should be distinct > step at least. > Would save on duplicated ops -- e.g. re-editing hbase:meta and dispatching > another RPC -- if we fail going into finishing. [~Apache9] brings up our > perhaps masking other state change hiccups when steps are so coarse-grained. > Do same for unassignprocedure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-20266) Review current set of ignored tests
[ https://issues.apache.org/jira/browse/HBASE-20266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-20266. --- Resolution: Later Stale. Later. > Review current set of ignored tests > --- > > Key: HBASE-20266 > URL: https://issues.apache.org/jira/browse/HBASE-20266 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Critical > Fix For: 3.0.0, 2.3.0 > > > [~Apache9] turned up a list of currently ignored tests. At first blush, its > fine to ignore some such as TestHTraceHooks and TestRegionsOnMaster but > others could do with a review. This issue is about looking at list to make > sure nothing important missed for hbase2 and that we for sure marked why a > test was ignored with comment and that there is a follow-on to enable JIRA. > {code} > TestRpcHandlerException > TestRSKilledWhenInitializing > TestHTraceHooks > TestAsyncTableGetMultiThreadedWithEagerCompaction > TestStochasticBalancerJmxMetrics > TestReplicator > TestQuotaThrottle > TestFavoredStochasticLoadBalancer > TestAsyncTableGetMultiThreadedWithBasicCompaction > TestRegionPlacement > TestMasterTransitions > TestMemstoreLABWithoutPool > TestRegionsOnMasterOptions > TestRestoreSnapshotFromClientWithRegionReplicas > TestMasterBalanceThrottling > TestMasterProcedureWalLease > TestRegionServerReadRequestMetrics > TestHttpServerLifecycle > TestHRegionServerBulkLoadWithOldSecureEndpoint > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-20412) Update our compliance-checker from 2.1 to 2.4
[ https://issues.apache.org/jira/browse/HBASE-20412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-20412. --- Resolution: Implemented commit 0a6aec49813e794d7ed5d6608e0ee4fab5587ccd Author: Mike Drob Date: Tue Jun 12 13:23:13 2018 -0500 HBASE-19377 Update Java API CC version Compatibility checker complaining about hash collisions, newer versions use longer id strings. > Update our compliance-checker from 2.1 to 2.4 > - > > Key: HBASE-20412 > URL: https://issues.apache.org/jira/browse/HBASE-20412 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Major > Attachments: update.txt > > > I thought we had an issue to do this already but I can't find it. > The newer compatibility-checker has the filtering on annotation added by one > of us (or at least asked-for by one-of-us). > i tried it yesterday. Seems to work nicely. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-21167) Master killed after IOE in FanOutOneBlockAsyncDFSOutput on log roll
[ https://issues.apache.org/jira/browse/HBASE-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-21167. --- Resolution: Later Haven't seen this in a while. > Master killed after IOE in FanOutOneBlockAsyncDFSOutput on log roll > --- > > Key: HBASE-21167 > URL: https://issues.apache.org/jira/browse/HBASE-21167 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Michael Stack >Priority: Major > > Logging this in case we see it again. I had a Master working furiously. It > had assigned over 400k regions on startup. Then this happened which knocked > the hard-working server out: > {code} > 2018-09-06 07:50:18,983 ERROR org.apache.hadoop.hbase.master.HMaster: Master > server abort: loaded coprocessors are: > [org.apache.hadoop.hbase.security.access.AccessController, > com.cloudera.navigator.audit.hbase.MasterAuditCoProcessor] > 2018-09-06 07:50:18,983 ERROR org.apache.hadoop.hbase.master.HMaster: * > ABORTING master vc0207.halxg.cloudera.com,22001,1536173228913: IOE in log > roller * > java.io.IOException: Connection to 10.17.208.34/10.17.208.34:20002 closed > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$channelInactive$2(FanOutOneBlockAsyncDFSOutput.java:289) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:236) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:99) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.channelInactive(FanOutOneBlockAsyncDFSOutput.java:288) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) > at > org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) > at > org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > at > org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) > at > org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHan
[jira] [Resolved] (HBASE-21350) Forward-port HBASE-21242 [amv2] Miscellaneous minor log and assign procedure create improvements
[ https://issues.apache.org/jira/browse/HBASE-21350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-21350. --- Resolution: Won't Fix Stale. Context is different now. > Forward-port HBASE-21242 [amv2] Miscellaneous minor log and assign procedure > create improvements > > > Key: HBASE-21350 > URL: https://issues.apache.org/jira/browse/HBASE-21350 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > Sub-issue to forward port the parent. Its acting up and the parent has been > open long enough. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-21308) Forward-port to branch-2 "HBASE-21259 [amv2] Revived deadservers; recreated serverstatenode"
[ https://issues.apache.org/jira/browse/HBASE-21308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-21308. --- Resolution: Won't Fix Stale. Context is different now. > Forward-port to branch-2 "HBASE-21259 [amv2] Revived deadservers; recreated > serverstatenode" > > > Key: HBASE-21308 > URL: https://issues.apache.org/jira/browse/HBASE-21308 > Project: HBase > Issue Type: Sub-task >Reporter: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0 > > > TODO: Recast HBASE-21259 so it works for branch-2; stuff is different in > branch-2. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-21613) Up/Down arrows on home page listing regions can look like smudges; needs definition
[ https://issues.apache.org/jira/browse/HBASE-21613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-21613. --- Resolution: Duplicate Resolving as duplicate of HBASE-21403 > Up/Down arrows on home page listing regions can look like smudges; needs > definition > --- > > Key: HBASE-21613 > URL: https://issues.apache.org/jira/browse/HBASE-21613 > Project: HBase > Issue Type: Bug > Components: UI >Reporter: Michael Stack >Priority: Major > > This has come up a few times now on branch-2.1 votes: > http://mail-archives.apache.org/mod_mbox/hbase-dev/201810.mbox/%3ccaaayanpjcjzsb+ynefeczp4pk9xgxxecw+pnpppvu6_ogoe...@mail.gmail.com%3E > and then later by our [~dbist13] votiing on a 2.1.2RC (Artem added an image > here: https://photos.app.goo.gl/PziWBMAzXCwbZqmF8). The UI got a bit of more > detail added around region transitioning and it cluttered the UI such that > the up/down sort arrows can look crowded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-9779) IntegrationTestLoadAndVerify fails deleting IntegrationTestLoadAndVerify table
[ https://issues.apache.org/jira/browse/HBASE-9779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-9779. -- Resolution: Won't Fix Stale. Context is different now. > IntegrationTestLoadAndVerify fails deleting IntegrationTestLoadAndVerify > table > --- > > Key: HBASE-9779 > URL: https://issues.apache.org/jira/browse/HBASE-9779 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 0.96.0 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Critical > Attachments: 9779part.txt > > > As part of the test, we want to delete the created table to restore cluster > state. Interestingly we can disable the table successfully but then > immediately after we fail the delete because we cannot get the table > descriptor -- getting the file descriptor is used to test if table is present. > The test for getDescriptor is kinda broke because it throws base IOE which > causes clients to retry over and over again as though the descriptor was > going to come back. > This bug is kinda ugly because in at least one case it caused our > long-running hbase-it suite run to fail so would be good to fix. > Here is sample from a test run: > {code} > Disabling table IntegrationTestLoadAndVerify 2013-10-11 18:27:53,485 INFO > [main] client.HBaseAdmin: Started disable of IntegrationTestLoadAndVerify > 2013-10-11 18:27:53,526 INFO [main] zookeeper.ZooKeeper: Initiating client > connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=9 > watcher=catalogtracker-on-hconnection-0x5a7e666f > 2013-10-11 18:27:53,527 INFO [main] zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x5a7e666f connecting to ZooKeeper > ensemble=a1805.halxg.cloudera.com:2181 > 2013-10-11 18:27:53,527 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Opening socket connection to server > a1805.halxg.cloudera.com/10.20.200.105:2181. Will not attempt to authenticate > using SASL (unknown error) > 2013-10-11 18:27:53,527 DEBUG [main] catalog.CatalogTracker: Starting catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@4ace08a5 > 2013-10-11 18:27:53,529 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Socket > connection established to a1805.halxg.cloudera.com/10.20.200.105:2181, > initiating session > 2013-10-11 18:27:53,539 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Session establishment complete on server > a1805.halxg.cloudera.com/10.20.200.105:2181, sessionid = 0x1412d47f53a5c70, > negotiated timeout = 4 > 2013-10-11 18:27:53,602 DEBUG [main] catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@4ace08a5 > 2013-10-11 18:27:53,662 INFO [main] zookeeper.ZooKeeper: Session: > 0x1412d47f53a5c70 closed > 2013-10-11 18:27:53,662 INFO [main-EventThread] zookeeper.ClientCnxn: > EventThread shut down > .2013-10-11 18:27:54,666 INFO [main] zookeeper.ZooKeeper: Initiating client > connection, connectString=a1805.halxg.cloudera.com:2181 sessionTimeout=9 > watcher=catalogtracker-on-hconnection-0x5a7e666f > 2013-10-11 18:27:54,667 INFO [main] zookeeper.RecoverableZooKeeper: Process > identifier=catalogtracker-on-hconnection-0x5a7e666f connecting to ZooKeeper > ensemble=a1805.halxg.cloudera.com:2181 > 2013-10-11 18:27:54,667 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Opening socket connection to server > a1805.halxg.cloudera.com/10.20.200.105:2181. Will not attempt to authenticate > using SASL (unknown error) > 2013-10-11 18:27:54,667 DEBUG [main] catalog.CatalogTracker: Starting catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@692c0c5d > 2013-10-11 18:27:54,667 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: Socket > connection established to a1805.halxg.cloudera.com/10.20.200.105:2181, > initiating session > 2013-10-11 18:27:54,696 INFO > [main-SendThread(a1805.halxg.cloudera.com:2181)] zookeeper.ClientCnxn: > Session establishment complete on server > a1805.halxg.cloudera.com/10.20.200.105:2181, sessionid = 0x1412d47f53a5c71, > negotiated timeout = 4 > 2013-10-11 18:27:54,821 DEBUG [main] catalog.CatalogTracker: Stopping catalog > tracker org.apache.hadoop.hbase.catalog.CatalogTracker@692c0c5d > 2013-10-11 18:27:54,871 INFO [main] zookeeper.ZooKeeper: Session: > 0x1412d47f53a5c71 closed > 2013-10-11 18:27:54,871 INFO [main-EventThread] zookeeper.ClientCnxn: > EventThread shut down > .2013-10-11 18:27:55,890 INFO [main] zookeeper.ZooKeeper: Initiating client > connection, connectString=a1805.halxg.cloudera.com:2181 sessio
[jira] [Resolved] (HBASE-9403) Build improvements
[ https://issues.apache.org/jira/browse/HBASE-9403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-9403. -- Resolution: Won't Fix Stale. Context is different now. > Build improvements > -- > > Key: HBASE-9403 > URL: https://issues.apache.org/jira/browse/HBASE-9403 > Project: HBase > Issue Type: Task > Components: build >Reporter: Michael Stack >Priority: Major > > Here are some improvements we could make to the build. Will list them as I > think of them. Can do them individually as subtasks of this one. > When I undo the hbase-X.Y.Z-hadoop1-bin.tar.gz tarball and look at the doc, > the version is hbase-X.Y.Z-hadoop1 in the javadoc. Should be just > hbase-X.Y.Z. This is so in the xref src and in javadoc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-9059) Address HBASE-8764 'Some MasterMonitorCallable should retry' review
[ https://issues.apache.org/jira/browse/HBASE-9059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-9059. -- Resolution: Won't Fix Stale. Context is different now. > Address HBASE-8764 'Some MasterMonitorCallable should retry' review > --- > > Key: HBASE-9059 > URL: https://issues.apache.org/jira/browse/HBASE-9059 > Project: HBase > Issue Type: Bug > Components: master >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > > Jesse came in w/ some review post-commit. Let me address in this followup. > Let me paste form our offlist correspondence: > {quote} > +++ > b/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RegionOfflineException.java > @@ -24,7 +24,7 @@ import org.apache.hadoop.hbase.exceptions.RegionException; > > /** Thrown when a table can not be located */ > @InterfaceAudience.Public > -@InterfaceStability.Stable > +@InterfaceStability.Evolving > Really? Same patch? Come on man - you are doing similar cleanup all over the > place (shakes head)... :) > +@InterfaceStability.Stable > +public class RpcRetryingCaller { > Calling this stable as the first time its going in seems a bit presumptuous... > +this.startTime = EnvironmentEdgeManager.currentTimeMillis(); > +int remaining = (int)(callTimeout - (this.startTime - > this.globalStartTime)); > +if (remaining < MIN_RPC_TIMEOUT) { > + // If there is no time left, we're trying anyway. It's too late. > + // 0 means no timeout, and it's not the intent here. So we secure both > cases by > + // resetting to the minimum. > + remaining = MIN_RPC_TIMEOUT; > +} > +RpcClient.setRpcTimeout(remaining); > Looks like some new logic... seems reasonable to me, so I'll let it slide > this time :) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-9378) TestRegionFavoredNodes.testFavoredNodes
[ https://issues.apache.org/jira/browse/HBASE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-9378. -- Resolution: Won't Fix Stale. Context is different now. > TestRegionFavoredNodes.testFavoredNodes > --- > > Key: HBASE-9378 > URL: https://issues.apache.org/jira/browse/HBASE-9378 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Assignee: Devaraj Das >Priority: Major > > https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/700/testReport/org.apache.hadoop.hbase.regionserver/TestRegionFavoredNodes/testFavoredNodes/ > {code} > org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes.testFavoredNodes > Failing for the past 1 build (Since Failed#700 ) > Took 61 ms. > add description > Error Message > Block location 127.0.0.1:51233 not a favored node > Stacktrace > java.lang.AssertionError: Block location 127.0.0.1:51233 not a favored node > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes.testFavoredNodes(TestRegionFavoredNodes.java:159) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at org.junit.runners.Suite.runChild(Suite.java:127) > at org.junit.runners.Suite.runChild(Suite.java:26) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > {code} > Any chance of your taking a looksee [~devaraj]? What you reckon? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-9053) Reenable TestHFileOutputFormat.testExcludeAllFromMinorCompaction
[ https://issues.apache.org/jira/browse/HBASE-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-9053. -- Resolution: Won't Fix > Reenable TestHFileOutputFormat.testExcludeAllFromMinorCompaction > > > Key: HBASE-9053 > URL: https://issues.apache.org/jira/browse/HBASE-9053 > Project: HBase > Issue Type: Bug > Components: test >Reporter: Michael Stack >Priority: Major > > Reenable TestHFileOutputFormat.testExcludeAllFromMinorCompaction after making > it so it is no longer flakey. Was disabled over in HBASE-9051 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-8990) Reenable TestFromClientSideWithCoprocessor.testClientPoolThreadLocal
[ https://issues.apache.org/jira/browse/HBASE-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-8990. -- Resolution: Won't Fix Stale. Context is different now. > Reenable TestFromClientSideWithCoprocessor.testClientPoolThreadLocal > > > Key: HBASE-8990 > URL: https://issues.apache.org/jira/browse/HBASE-8990 > Project: HBase > Issue Type: Task > Components: test >Reporter: Michael Stack >Priority: Major > > Look at HBASE-8989 and figure why it is flakey then reenable this test. It > came in as part of HBASE-2939 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-8958) Sometimes we refer to the single .META. table region as ".META.,,1" and other times as ".META.,,1.1028785192"
[ https://issues.apache.org/jira/browse/HBASE-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-8958. -- Resolution: Won't Fix Stale. Context is different now. > Sometimes we refer to the single .META. table region as ".META.,,1" and other > times as ".META.,,1.1028785192" > -- > > Key: HBASE-8958 > URL: https://issues.apache.org/jira/browse/HBASE-8958 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Major > > See here how we say in a log: > {code} > 2013-07-15 22:32:53,805 INFO [main] regionserver.HRegion(4176): Open > {ENCODED => 1028785192, NAME => '.META.,,1', STARTKEY => '', ENDKEY => ''} > {code} > but when we open other regions we do: > {code} > 764 2013-07-15 22:40:10,867 INFO [RS_OPEN_REGION-durruti:61987-0] > regionserver.HRegion: Open {ENCODED => 93dad2bbf6ff5ea0d7477f504b303346, NAME > => 'x,,1373953210791.93dad2bbf6ff5ea0d7477f504b303346.', ... > {code} > Note how in the second, the name includes the encoded name. > We'll also do : > {code} > 2013-07-15 22:32:53,810 INFO [main] regionserver.HRegion(629): Onlined > 1028785192/.META.; next sequenceid=1 > {code} > vs > {code} > 785 2013-07-15 22:40:10,885 INFO [AM.ZK.Worker-pool-2-thread-7] > master.RegionStates: Onlined 93dad2bbf6ff5ea0d7477f504b303346 on > durruti,61987,1373947581222 > {code} > ... where we print the encoded name. > Master web UI shows ".META.,,1.1028785192" > Benoit originally noticed this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-8748) Be able to accomodate zookeeper going away for a minute or two -- or more
[ https://issues.apache.org/jira/browse/HBASE-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-8748. -- Resolution: Won't Fix Stale. Context is different now. > Be able to accomodate zookeeper going away for a minute or two -- or more > - > > Key: HBASE-8748 > URL: https://issues.apache.org/jira/browse/HBASE-8748 > Project: HBase > Issue Type: Brainstorming > Components: Zookeeper >Reporter: Michael Stack >Priority: Major > > I was talking w/ Christophe Taton yesterday and he asked what happens if > zookeeper goes away for a minute or two -- say a network or ensemble hiccup > of some type -- then what happens? > Unless the ensemble comes back inside the zk session timeout, the cluster > will go down. > To my knowledge, zk has hiccuped a few times. There was the bug where > sequence numbers rolled around the top causing the ensemble to blip (fixed in > a newer zk). There was another event where some combination of > a leader election and accumulated log files (>100k) caused the > ensemble blip at SU. > At FB apparently the zk session is way up -- > 5minutes -- in case a > top-of-the-rack switch reboots partitioning the network separating nodes from > the zk ensemble and rather than rely on presence of ephemeral nodes, rather, > they depend on heartbeats to determine presence or not of a regionserver (w/ > some smarts so that if all members of a rack disappear at the same time, it > is not likely they all crashed at same time). > I am stating the obvious I know but the base presumption that zk will just > always be there is lazy on our part and we should not be acting as though it > were. > Marking this a brainstorming issue because will need a bit of > discussion/design undoing our current presumption. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-8717) ui is inconsistent in its use of server names
[ https://issues.apache.org/jira/browse/HBASE-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-8717. -- Resolution: Implemented Fixed elsewhere. > ui is inconsistent in its use of server names > - > > Key: HBASE-8717 > URL: https://issues.apache.org/jira/browse/HBASE-8717 > Project: HBase > Issue Type: Bug > Components: UI, Usability >Reporter: Michael Stack >Priority: Major > > In main master screen, the regionservers are listed showing their hostname > only though the heading is 'ServerName': sss-4 rather than > sss-4,60020,1369949440012. Should we list ServerName here? Would have port > on it. > The dead servers tab shows full ServerName as in sss-4,60020,1369949440012. > The column is named ServerName. This looks right. > The name on the master main page is 'Master: sss-1'. Should it be 'Master: > sss-1,60020,1369949440012'; i.e. full ServerName? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-8601) Make ROW bloom work w/ .META.
[ https://issues.apache.org/jira/browse/HBASE-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-8601. -- Resolution: Won't Fix Stale. Context is different now. > Make ROW bloom work w/ .META. > - > > Key: HBASE-8601 > URL: https://issues.apache.org/jira/browse/HBASE-8601 > Project: HBase > Issue Type: Bug > Components: Performance >Reporter: Michael Stack >Priority: Major > > I just tried enabling ROW blooms globally but tests against meta were failing > doing getClosestOrBefore. If I waited, they worked. Something odd is going > on here. Would be good having blooms on on .META. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-8009) Fix and reenable the hbase-example unit tests.
[ https://issues.apache.org/jira/browse/HBASE-8009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-8009. -- Resolution: Won't Fix Stale. Context is different now. > Fix and reenable the hbase-example unit tests. > -- > > Key: HBASE-8009 > URL: https://issues.apache.org/jira/browse/HBASE-8009 > Project: HBase > Issue Type: Task > Components: test >Reporter: Michael Stack >Priority: Critical > > The unit tests pass locally for me repeatedly but fail from time to time up > on jenkins. HBASE-7994 disabled them. This issue is about spending the time > to make sure they pass up on jenkins again. They have been disabled because > unit tests have been failing way more often than they have been passing over > the last few months and we want to establish passing tests as the precedent > again. Once that is in place, we can work on bringing back examples. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-7909) How to figure if a Cell is deep or shallow.
[ https://issues.apache.org/jira/browse/HBASE-7909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-7909. -- Resolution: Won't Fix Stale. Context is different now. > How to figure if a Cell is deep or shallow. > --- > > Key: HBASE-7909 > URL: https://issues.apache.org/jira/browse/HBASE-7909 > Project: HBase > Issue Type: Task >Reporter: Michael Stack >Priority: Minor > > The CellScanner interface is how you iterate scanners. It is more bare bones > than java Iterator, explicitly so, to minimize the need for retaining > references to the current Cell. > The Interface currently has get/current to pull the Cell that is currently > loaded in the breech. It also has (had) another method getDeepCopy. This > latter was removed by hbase-7899 "Tools to build cell blocks" as suggested by > [~mcorgan] (and seconded by other reviewers in that they found it > problematic). > So, how then to determine if the Cell you have is a deep or shallow copy? > On the one hand, should we even be concerned? The whole point of our Cell > retrofit, in part, is to force us disconnect from how the Cell is implemented > so maybe we should just do away w/ this notion of deepCopy altogether and > hope that in action, we don't actually need it and that we our fixation is > only because deepCopies is all we ever had when we were exclusively KeyValue. > Or, do we need to add a means of asking a Cell "Are you deep?" or having > deepCopies implement a subInterface -- StableCell or StandaloneCell? > This issue raises the problem but I do not think it critical we deal with it > just now. At least, I do not see imminent need, at least not currently where > we are still Cell backed by "deepCopy" KeyValues. Maybe later when we have > different implementations this issue will come to the fore. Until then, am > fine leaving it as minor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-7025) Metric for how many WAL files a regionserver is carrying
[ https://issues.apache.org/jira/browse/HBASE-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-7025. -- Resolution: Implemented This is implemented I believe (at least I can see a graph that shows WAL counts per host... where I'm sitting) > Metric for how many WAL files a regionserver is carrying > > > Key: HBASE-7025 > URL: https://issues.apache.org/jira/browse/HBASE-7025 > Project: HBase > Issue Type: Improvement > Components: metrics >Reporter: Michael Stack >Priority: Major > > A metric that shows how many WAL files a regionserver is carrying at any one > time would be useful for fingering those servers that are always over the > upper bounds and in need of attention -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-7023) Forward-port HBASE-6727 size-based HBaseServer callQueue throttle from 0.89fb branch
[ https://issues.apache.org/jira/browse/HBASE-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-7023. -- Resolution: Won't Fix Stale. Context is different now. > Forward-port HBASE-6727 size-based HBaseServer callQueue throttle from 0.89fb > branch > > > Key: HBASE-7023 > URL: https://issues.apache.org/jira/browse/HBASE-7023 > Project: HBase > Issue Type: Improvement > Components: IPC/RPC >Reporter: Michael Stack >Assignee: Ted Yu >Priority: Major > Labels: beginner > Attachments: 6727-fb.txt > > > Forward port the size base throttle that is out in 0.89fb branch. Its nicer > than what we have in trunk where we just count queue items. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-6902) Add doc and unit test of the various checksum settings
[ https://issues.apache.org/jira/browse/HBASE-6902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-6902. -- Resolution: Won't Fix Stale. Context is different now. > Add doc and unit test of the various checksum settings > -- > > Key: HBASE-6902 > URL: https://issues.apache.org/jira/browse/HBASE-6902 > Project: HBase > Issue Type: Bug > Components: documentation >Affects Versions: 0.95.2 >Reporter: Michael Stack >Priority: Critical > > See HBASE-6868. Doc the options, their pluses and negatives as well as the > bugs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-6154) Pom cleanups; move verion properties above their use, add NO-MVN-MAN-VER, eclipse fixes
[ https://issues.apache.org/jira/browse/HBASE-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-6154. -- Resolution: Won't Fix Stale. Context is different now. > Pom cleanups; move verion properties above their use, add NO-MVN-MAN-VER, > eclipse fixes > --- > > Key: HBASE-6154 > URL: https://issues.apache.org/jira/browse/HBASE-6154 > Project: HBase > Issue Type: Task > Components: pom >Reporter: Michael Stack >Priority: Minor > Labels: delete > > See Jesse comments over in hbase-6145. Good stuff on changes to improve poms. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-5868) jmx bean layout makes no sense
[ https://issues.apache.org/jira/browse/HBASE-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-5868. -- Resolution: Won't Fix Stale. Context is different now. > jmx bean layout makes no sense > -- > > Key: HBASE-5868 > URL: https://issues.apache.org/jira/browse/HBASE-5868 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Critical > Labels: delete > > Top level is 'hadoop' Under 'hadoop', there is 'HBase' MBean. BESIDE this > MBean is one named Master and another named RegionServer. It makes no sense. > Top level should be org.apache.hbase. Inside there should be an MBean per > running server. It should be the server's ServerName, not 'Master' or > 'RegionServer'. > Under RegionServer there is a RegionServer bean [sic], then beside it a > RegionServerStatistics and a RegionServerDynamicStatistics. > I'd think that as they are, they are unusable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-6199) Change PENDING_OPEN scope from pre-rpc open to OPENING to just post-rpc open to OPENING
[ https://issues.apache.org/jira/browse/HBASE-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-6199. -- Resolution: Won't Fix Stale. Context is different now. > Change PENDING_OPEN scope from pre-rpc open to OPENING to just post-rpc open > to OPENING > --- > > Key: HBASE-6199 > URL: https://issues.apache.org/jira/browse/HBASE-6199 > Project: HBase > Issue Type: Improvement >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Attachments: 6199v4.txt, pending_open.txt, pending_open2.txt, > pending_open3.txt > > > PENDING_OPEN currently is a murky state. Its a master in-memory state with > no corresponding znode state that sits between OFFLINE and OPENING states. > The OFFLINE state is set by the master when it goes to open a region. > OPENING is set by the regionserver after its assumed control of a region and > is moving it through the OPENING process. PENDING_OPEN currently spans the > open rpc invocation. This state is in place pre-open-rpc-invocation, during > open-rpc-invocation, and post-rpc-invocation until we get the OPENING > callback. That PENDING_OPEN covers this many different conditions effectively > makes it unactionable. > This issue proposes PENDING_OPEN only be in place post-rpc-invocation. Now > its meaning is clear as the space between rpc-open-invocation and our > receiving the callback which sets RegionState to OPENING. PENDING_OPEN > becomes actionable too in that if a regionserver dies post > rpc-open-invocation, we know that we can reassign the region. > See > https://issues.apache.org/jira/browse/HBASE-6060?focusedCommentId=13292646&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13292646 > for more discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-5275) Create migration for hbase-2600 meta table rejigger so regions denoted by end row
[ https://issues.apache.org/jira/browse/HBASE-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-5275. -- Resolution: Won't Fix Stale. Context is different now. > Create migration for hbase-2600 meta table rejigger so regions denoted by end > row > - > > Key: HBASE-5275 > URL: https://issues.apache.org/jira/browse/HBASE-5275 > Project: HBase > Issue Type: Task >Reporter: Michael Stack >Priority: Major > > Chatting with Alex, we'd do as was done previous where we'll can data from > 0.92 and then have a test that unbundles this canned data, migrates it and > then makes sure all still works. Migration test would include verification > of idempotency; i.e. if migration fails midway, we should be able to rerun it. > Canned data should include a meta with splits and WALs to split (migrations > usually require clean shutdown so no WALs should be in place but just in > case... And replication is reading logs) > We were thinking that on startup, we'd check hbase.version file. If not > updated, we'd rewrite .META. offline before starting up. > In offline mode -- open of the .META. regions -- we'd do a rewrite per row > changing the HRegionInfo version from VERSION=1 to VERSION=2. > VERSION=2 is the new format HRegionInfo. > VERSION=2 will use endrow but it will keep its current encoded name (though > it was generated with startrow as input) so we don't have to move stuff > around in filesystem. > New HRIs subsequent to the migration will be written out as VERSION=3. A > VERSION=3 has endrow in its name but the encoded name will be made using > startrow+endrow+regionid+tablename rather than just > startrow+regionid+tablename as in VERSION=1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-3628) Add upper bound on threads for TThreadPoolServer; too many have run into the OOME can't create native thread because thrift spawns w/o bound
[ https://issues.apache.org/jira/browse/HBASE-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-3628. -- Resolution: Won't Fix Stale. Context is different now. > Add upper bound on threads for TThreadPoolServer; too many have run into the > OOME can't create native thread because thrift spawns w/o bound > > > Key: HBASE-3628 > URL: https://issues.apache.org/jira/browse/HBASE-3628 > Project: HBase > Issue Type: Bug > Components: Thrift >Reporter: Michael Stack >Priority: Major > Labels: thrift, thrift2 > > See tail of this thread: > http://search-hadoop.com/m/Ooyif0dZ89/major+hdfs+issues&subj=Re+major+hdfs+issues > We need to hack in something like the below: > {code} > diff --git a/src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java > b/src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java > index 06621ab..74856af 100644 > --- a/src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java > +++ b/src/main/java/org/apache/hadoop/hbase/thrift/ThriftServer.java > @@ -69,6 +69,7 @@ import org.apache.hadoop.hbase.thrift.generated.TRegionInfo; > import org.apache.hadoop.hbase.thrift.generated.TRowResult; > import org.apache.hadoop.hbase.util.Bytes; > import org.apache.thrift.TException; > +import org.apache.thrift.TProcessorFactory; > import org.apache.thrift.protocol.TBinaryProtocol; > import org.apache.thrift.protocol.TCompactProtocol; > import org.apache.thrift.protocol.TProtocolFactory; > @@ -911,9 +912,25 @@ public class ThriftServer { >} else { > transportFactory = new TTransportFactory(); >} > - > - LOG.info("starting HBase ThreadPool Thrift server on " + listenAddress > + ":" + Integer.toString(listenPort)); > - server = new TThreadPoolServer(processor, serverTransport, > transportFactory, protocolFactory); > + TThreadPoolServer.Options poolServerOptions = > +new TThreadPoolServer.Options(); > + int maxWorkerThreads = Integer.MAX_VALUE; > + if (cmd.hasOption("maxWorkerThreads")) { > +try { > + maxWorkerThreads = > +Integer.parseInt(cmd.getOptionValue("maxWorkerThreads", "" + > Integer.MAX_VALUE)); > +} catch (NumberFormatException e) { > + LOG.error("Could not parse maxWorkerThreads option", e); > + printUsageAndExit(options, -1); > +} > + } > + poolServerOptions.maxWorkerThreads = maxWorkerThreads; > + LOG.info("starting HBase ThreadPool Thrift server on " + listenAddress > + > +":" + Integer.toString(listenPort) + > +", maxWorkerThreads=" + maxWorkerThreads); > + server = new TThreadPoolServer(processor, serverTransport, > +transportFactory, transportFactory, protocolFactory, protocolFactory, > +poolServerOptions); > } > {code} > ...only with better factoring AND exposing other options in Options; they > look useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-3675) hbase.hlog.split.skip.errors is false by default but we don't act properly when its true; can make for inconsistent view
[ https://issues.apache.org/jira/browse/HBASE-3675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-3675. -- Resolution: Won't Fix Stale. Context is different now. > hbase.hlog.split.skip.errors is false by default but we don't act properly > when its true; can make for inconsistent view > > > Key: HBASE-3675 > URL: https://issues.apache.org/jira/browse/HBASE-3675 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Critical > > So, by default hbase.hlog.split.skip.error is false so we should not be > skipping errors (What should we do, abort?). > Anyways, see https://issues.apache.org/jira/browse/HBASE-3674. It has > checksum error on near to last log BUT it writes out recovered.edits gotten > so far. We then go and assign the regions anyways, applying edits gotten so > far, though there are edits behind the checksum error still to be recovered. > Not good. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-3638) If a FS bootstrap, need to also ensure ZK is cleaned
[ https://issues.apache.org/jira/browse/HBASE-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-3638. -- Resolution: Won't Fix Stale. Context is different now. > If a FS bootstrap, need to also ensure ZK is cleaned > > > Key: HBASE-3638 > URL: https://issues.apache.org/jira/browse/HBASE-3638 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Minor > Labels: beginner > > In a test environment where a cycle of start, operation, kill hbase (repeat), > noticed that we were doing a bootstrap on startup but then we were picking up > the previous cycles zk state. It made for a mess in the test. > Last thing seen on previous cycle was: > {code} > 2011-03-11 06:33:36,708 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENING, server=X.X.X.60020,1299853933073, > region=1028785192/.META. > {code} > Then, in the messed up cycle I saw: > {code} > 2011-03-11 06:42:48,530 INFO org.apache.hadoop.hbase.master.MasterFileSystem: > BOOTSTRAP: creating ROOT and first META regions > . > {code} > Then after setting watcher on .META., we get a > {code} > 2011-03-11 06:42:58,301 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Processing region > .META.,,1.1028785192 in state RS_ZK_REGION_OPENED > 2011-03-11 06:42:58,302 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Region in transition > 1028785192 references a server no longer up X.X.X; letting RIT timeout so > will be assigned elsewhere > {code} > We're all confused. > Should at least clear our zk if a bootstrap happened. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-2958) When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it
[ https://issues.apache.org/jira/browse/HBASE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-2958. -- Resolution: Won't Fix Stale. Context is different now. > When hbase.hlog.split.skip.errors is set to false, we fail the split but > thats it > - > > Key: HBASE-2958 > URL: https://issues.apache.org/jira/browse/HBASE-2958 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Major > Labels: delete > > When hbase.hlog.split.skip.errors is set to false, if we encounter an error > splitting, splitting stops and exception is let propagate up the stack. I > see that its caught in the new MasterFileSystem class and logged, but thats > it. It would seem processing continues BUT we've dropped the edits in the > split. We need to do better (default is hbase.hlog.split.skip.errors set to > false -- i.e. skip errors but keep going). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-2236) Upper bound of outstanding WALs can be overrun; take 2 (take 1 was hbase-2053)
[ https://issues.apache.org/jira/browse/HBASE-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-2236. -- Resolution: Won't Fix Still an issue but context is different now. Resolving this one. > Upper bound of outstanding WALs can be overrun; take 2 (take 1 was hbase-2053) > -- > > Key: HBASE-2236 > URL: https://issues.apache.org/jira/browse/HBASE-2236 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Reporter: Michael Stack >Priority: Critical > Labels: moved_from_0_20_5 > > So hbase-2053 is not aggressive enough. WALs can still overwhelm the upper > limit on log count. While the code added by HBASE-2053, when done, will > ensure we let go of the oldest WAL, to do it, we might have to flush many > regions. E.g: > {code} > 2010-02-15 14:20:29,351 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=45, maxlogs=32; forcing flush of 5 regions(s): > test1,193717,1266095474624, test1,194375,1266108228663, > test1,195690,1266095539377, test1,196348,1266095539377, > test1,197939,1266069173999 > {code} > This takes time. If we are taking on edits a furious rate, we might have > rolled the log again, meantime, maybe more than once. > Also log rolls happen inline with a put/delete as soon as it hits the 64MB > (default) boundary whereas the necessary flushing is done in background by a > single thread and the memstore can overrun the (default) 64MB size. Flushes > needed to release logs will be mixed in with "natural" flushes as memstores > fill. Flushes may take longer than the writing of an HLog because they can > be larger. > So, on an RS that is struggling the tendency would seem to be for a slight > rise in WALs. Only if the RS gets a breather will the flusher catch up. > If HBASE-2087 happens, then the count of WALs get a boost. > Ideas to fix this for good would be : > + Priority queue for queuing up flushes with those that are queued to free up > WALs having priority > + Improve the HBASE-2053 code so that it will free more than just the last > WAL, maybe even queuing flushes so we clear all WALs such that we are back > under the maximum WALS threshold again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-1872) Dirty, fast, kill table script
[ https://issues.apache.org/jira/browse/HBASE-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-1872. -- Resolution: Won't Fix Stale > Dirty, fast, kill table script > -- > > Key: HBASE-1872 > URL: https://issues.apache.org/jira/browse/HBASE-1872 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Major > Labels: beginner > Attachments: kill_table.rb > > > Some fellas embedding hbase want to be able to kill tables quickly between > tests; they don't want to have to wait on enable/disable stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-1611) Have shell output binary hex-encoded rather than octal-encoded
[ https://issues.apache.org/jira/browse/HBASE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-1611. -- Resolution: Won't Fix Stale > Have shell output binary hex-encoded rather than octal-encoded > -- > > Key: HBASE-1611 > URL: https://issues.apache.org/jira/browse/HBASE-1611 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Priority: Major > Labels: beginner > > Native Ruby String dump and inspect output unprintables in octal. Don't seem > to be able to change that fact. Figure way to do them as hex to match > binaries in UI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23572) In 'HBCK Report', distinguish between live, dead, and unknown servers
[ https://issues.apache.org/jira/browse/HBASE-23572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23572. --- Fix Version/s: 2.2.3 2.3.0 3.0.0 Hadoop Flags: Reviewed Assignee: Michael Stack Resolution: Fixed Merged manually to branch-2.2+. Thanks for review [~busbey] > In 'HBCK Report', distinguish between live, dead, and unknown servers > - > > Key: HBASE-23572 > URL: https://issues.apache.org/jira/browse/HBASE-23572 > Project: HBase > Issue Type: Bug >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Trivial > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > Debugging, when viewing 'HBCK Report' sections, it helps if we know if > referenced server is online, dead, or unknown. > Add ornamentation so that when we mention a servername in 'HBCK Report', if > live, then show the server as link (to live server), if dead, show it in > italics, and if unknown, show it plain text. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23600) Improve chances of edits landing into hbase:meta even when high load
Michael Stack created HBASE-23600: - Summary: Improve chances of edits landing into hbase:meta even when high load Key: HBASE-23600 URL: https://issues.apache.org/jira/browse/HBASE-23600 Project: HBase Issue Type: Improvement Components: rpc Reporter: Michael Stack Of late I've been testing clusters under high load to study failures and to figure how to effect recovery if cluster is unable to recover on its own. One interesting case is a RS that is struggling mostly because writes to HDFS are backed up and sync calls are running very slow taking a long time to complete. The RPC backs up with waiting requests, and eventually goes over one or more bounds. The RS then starts throwing CallQueueTooBigExceptions. This struggling state can last a good while. We throw CQTBEs whatever the priority of the incoming request. We throw CQTBE in two places; on original parse of the request before we dispatch it on a handler -- here we check size of all queues and if over the threshold (default 1G), throw the exception -- and then later when we dispatch the request to internal queues, we'll count items in queue and if over default in any one queue (default is 10 * handler count), we'll fail dispatch and again throw CQTBE. We shouldn't be running w/ big queues. We should be rejecting Requests we know we'll never process in time before client loses interest (See the CoDel thesis and the implementations added a good while back). TODO. Meantime I was looking to see if having read a high-priority request, if rather than dropping it on the floor, instead, what would happen if I let it through even if above thresholds? My main concern is edits to hbase:meta. When sustained, saturated load on the RS carrying hbase:meta, edits may not land. The result is incomplete Procedures and a disorientated Master. I was playing w/ trying to put off the corruption as long as possible, experimenting (CoDel doesn't do priority at first blush; we probably want to add this). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23596) HBCKServerCrashProcedure can double assign
Michael Stack created HBASE-23596: - Summary: HBCKServerCrashProcedure can double assign Key: HBASE-23596 URL: https://issues.apache.org/jira/browse/HBASE-23596 Project: HBase Issue Type: Bug Components: proc-v2 Reporter: Michael Stack Fix For: 2.2.3 The new SCP that does SCP plus cleanup 'Unknown Servers' with mentions in hbase:meta added by the below can make for double assignments. {code} commit c238891a26734e1e4276b6b1677a58cf83de5dc4 Author: stack Date: Wed Nov 13 22:36:26 2019 -0800 HBASE-23282 HBCKServerCrashProcedure for 'Unknown Servers' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23593) Stalled SCP Assigns
Michael Stack created HBASE-23593: - Summary: Stalled SCP Assigns Key: HBASE-23593 URL: https://issues.apache.org/jira/browse/HBASE-23593 Project: HBase Issue Type: Bug Components: proc-v2 Affects Versions: 2.2.3 Reporter: Michael Stack I'm stuck on this one so doing a write up here in case anyone else has ideas. Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem but from time to time I'll see the SCP stuck waiting on an Assign to finish. The assign seems stuck at the queuing of the OpenRegionProcedure. We've stored the procedure but then not a peek thereafter. Later we'll see complaint that the region is STUCK. Doesn't recover. Doesn't run. Basic story is as follows: Server dies: {code} 2019-12-17 11:10:42,002 INFO org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [s011.example.org,16020,1576561318119] 2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: Added s011.example.org,16020,1576561318119; numProcessing=1 ... 2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: Started processing s011.example.org,16020,1576561318119; numProcessing=1 {code} The dead server restarts which purges the old server from dead server and processing lists: {code} 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it from the dead servers list {code} even though we are still processing logs in the SCP of the old server... {code} 2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: Archived processed log hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491 to hdfs://nameservice1/hbase/oldWALs/s011.example. org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491 {code} I thought early purge of deadserver was a problem but I don't think so after study. WALS split took two minutes to split and server was removed from dead servers... three minutes earlier... {code} 2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 log files in [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting] in 143236ms {code} Almost immediately we get this: {code} 2019-12-17 11:14:08,649 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition state=OPEN, location=s011.example.org,16020,1576561318119, table=t1, region=9d6d6d5f261a0cbe7c9e85091f2c2bd4 {code} For this region assign, I see the SCP proc making an assign for this region which then makes a subtask to OpenRegionProcedure. This is where it gets stuck. No progress after this. The procedure does not come alive to run. Here are logs for the ORP pid=421761: {code} 2019-12-17 11:38:34,761 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] 2019-12-17 11:38:34,765 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure 2019-12-17 11:38:34,770 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=421761, ppid=402475, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th rollback step {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-22920) github pr testing job should use dev-support script for gathering machine info
[ https://issues.apache.org/jira/browse/HBASE-22920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-22920. --- Fix Version/s: 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Beata Sudi... please sign into JIRA so we can credit you this issue. Meanwhile resolving w/o an assignee. > github pr testing job should use dev-support script for gathering machine info > -- > > Key: HBASE-22920 > URL: https://issues.apache.org/jira/browse/HBASE-22920 > Project: HBase > Issue Type: Improvement > Components: community, test >Reporter: Sean Busbey >Priority: Major > Labels: beginner > Fix For: 3.0.0 > > > the PR tester {{Jenkinsfile_GitHub}} has its own set of commands for > gathering information about the build environment it runs in. Instead it > should rely on the {{dev-support/gather_machine_environment.sh}} that gets > used by nightly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23572) In 'HBCK Report', distringush between live, dead, and unknown servers
Michael Stack created HBASE-23572: - Summary: In 'HBCK Report', distringush between live, dead, and unknown servers Key: HBASE-23572 URL: https://issues.apache.org/jira/browse/HBASE-23572 Project: HBase Issue Type: Bug Reporter: Michael Stack Debugging, when viewing 'HBCK Report' sections, it helps if we know if referenced server is online, dead, or unknown. Add ornamentation so that when we mention a servername in 'HBCK Report', if live, then show the server as link (to live server), if dead, show it in italics, and if unknown, show it plain text. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23570) Point users to the async-profiler home page if diagrams are coming up blank
[ https://issues.apache.org/jira/browse/HBASE-23570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23570. --- Fix Version/s: 3.0.0 Resolution: Fixed Merged these one-liners to master. > Point users to the async-profiler home page if diagrams are coming up blank > --- > > Key: HBASE-23570 > URL: https://issues.apache.org/jira/browse/HBASE-23570 > Project: HBase > Issue Type: Bug > Components: profiler >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Trivial > Fix For: 3.0.0 > > > Add minor note on servlet and to doc pointing folks to async-profiler home > page if diagrams are coming up blank -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23570) Point users to the async-profiler home page if diagrams are coming up blank
Michael Stack created HBASE-23570: - Summary: Point users to the async-profiler home page if diagrams are coming up blank Key: HBASE-23570 URL: https://issues.apache.org/jira/browse/HBASE-23570 Project: HBase Issue Type: Bug Components: profiler Reporter: Michael Stack Assignee: Michael Stack Add minor note on servlet and to doc pointing folks to async-profiler home page if diagrams are coming up blank -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23555) TestQuotaThrottle is broken
[ https://issues.apache.org/jira/browse/HBASE-23555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23555. --- Fix Version/s: 2.3.0 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Merge to master and backported to branch-2. Thanks for the fix [~meiyi] > TestQuotaThrottle is broken > --- > > Key: HBASE-23555 > URL: https://issues.apache.org/jira/browse/HBASE-23555 > Project: HBase > Issue Type: Bug >Reporter: Yi Mei >Assignee: Yi Mei >Priority: Minor > Fix For: 3.0.0, 2.3.0 > > > TestQuotaThrottle is broken now. And it is anotated as Ignore because it's > flakey so the Jenkins test can not report it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23554) Encoded regionname to regionname utility
[ https://issues.apache.org/jira/browse/HBASE-23554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23554. --- Hadoop Flags: Reviewed Resolution: Fixed Merged to branch-2.2+. Thanks for reviews [~busbey] and [~zhangduo] > Encoded regionname to regionname utility > > > Key: HBASE-23554 > URL: https://issues.apache.org/jira/browse/HBASE-23554 > Project: HBase > Issue Type: Bug > Components: shell >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > Debugging I keep wanting to look at region state/transition in meta but all I > have is encoded region name gleaned from log or from some parts of the UI. I > find myself doing dump of the meta table to a text file just to search > especially if region replicas are enabled (their encoded name is not even > mentioned in hbase:meta). Utility that let me lookup regionname using encoded > regionname would be handy. > This actually exists already... almost. The Admin Service has a > getRegionInfo. Usually it just returns RegionInfo if passed a region name. It > can add a bit more info if it a MOB Region and the query is against Master or > if the query is against the hosting RegionServer, it can tack on some > compaction state detail. Wouldn't take much to extend this existing facility > so could query w/ encoded name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23556) Minor ChoreService Cleanup
[ https://issues.apache.org/jira/browse/HBASE-23556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23556. --- Fix Version/s: master Hadoop Flags: Reviewed Resolution: Fixed Pushed on Master. Thanks for patch [~belugabehr] > Minor ChoreService Cleanup > -- > > Key: HBASE-23556 > URL: https://issues.apache.org/jira/browse/HBASE-23556 > Project: HBase > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > Fix For: master > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23561) Look up of Region in Master by encoded region name is O(n)
Michael Stack created HBASE-23561: - Summary: Look up of Region in Master by encoded region name is O(n) Key: HBASE-23561 URL: https://issues.apache.org/jira/browse/HBASE-23561 Project: HBase Issue Type: Bug Reporter: Michael Stack {{ public RegionState getRegionState(final String encodedRegionName) { // TODO: Need a map but it is just dispatch merge... for (RegionStateNode node: regionsMap.values()) { if (node.getRegionInfo().getEncodedName().equals(encodedRegionName)) { return node.toRegionState(); } } return null; }}} It is not used much so making it trivial. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23554) Encoded regionname to regionname utility
Michael Stack created HBASE-23554: - Summary: Encoded regionname to regionname utility Key: HBASE-23554 URL: https://issues.apache.org/jira/browse/HBASE-23554 Project: HBase Issue Type: Bug Components: shell Reporter: Michael Stack Assignee: Michael Stack Fix For: 3.0.0, 2.3.0, 2.2.3 Debugging I keep wanting to look at region state/transition in meta but all I have is encoded region name gleaned from log or from some parts of the UI. I find myself doing dump of the meta table to a text file just to search especially if region replicas are enabled (their encoded name is not even mentioned in hbase:meta). Utility that let me lookup regionname using encoded regionname would be handy. This actually exists already... almost. The Admin Service has a getRegionInfo. Usually it just returns RegionInfo if passed a region name. It can add a bit more info if it a MOB Region and the query is against Master or if the query is against the hosting RegionServer, it can tack on some compaction state detail. Wouldn't take much to extend this existing facility so could query w/ encoded name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23369) Auto-close 'unknown' Regions reported as OPEN on RegionServers
Michael Stack created HBASE-23369: - Summary: Auto-close 'unknown' Regions reported as OPEN on RegionServers Key: HBASE-23369 URL: https://issues.apache.org/jira/browse/HBASE-23369 Project: HBase Issue Type: Bug Reporter: Michael Stack In old days, if a RegionServer reported a variance that didn't agree w/ Master view of the cluster, we'd kill the RegionServer. Lately, in tests that overrun a cluster, after a sustained high-load, Master can start failing its updates against Meta (CallQueueTooBigException <= More on this later). It then can lose proper accounting of all Region members. One variant has a RegionServer reporting its list of open Regions to the Master and the Master doesn't 'know' of a particular Region or the Master may know the Region but expects it open on another RegionServer. Here is an example of how it looks each time RS reports: {code} 2019-12-03 07:07:00,757 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: No t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode but reported ONLINE at server.example.org,16020,1575354666245 (inServerRegionList=false). 2019-12-03 07:07:03,793 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: No t1,08f5c285,1573094375485.ee78a0c951c1c902d8f3f3912394a0e5. RegionStateNode but reported ONLINE at server.example.org,16020,1575354666245 (inServerRegionList=false). {code} Will also show as an 'inconsistency' in the 'HBCK' tab on the Master UI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23117) Bad enum in hbase:meta info:state column can fail loadMeta and stop startup
[ https://issues.apache.org/jira/browse/HBASE-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23117. --- Fix Version/s: 2.2.3 2.3.0 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Pushed to branch-2.2+ . Thanks for the fix [~sandeep.pal] (thanks too to the reviewers). > Bad enum in hbase:meta info:state column can fail loadMeta and stop startup > --- > > Key: HBASE-23117 > URL: https://issues.apache.org/jira/browse/HBASE-23117 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Michael Stack >Assignee: Sandeep Pal >Priority: Minor > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > Had a bad value in info:state field in meta and it made it so couldn't start > up the cluster; loadMeta would not succeed. If a bad state, should note it, > compensate, and move on. > The bad entry was an own goal that happened while trying to fix other issues > in a pre-hbck2 cluster. > Here was the exception: > {code} > java.lang.IllegalArgumentException: No enum constant > org.apache.hadoop.hbase.master.RegionState.State.1 > at java.lang.Enum.valueOf(Enum.java:238) > at > org.apache.hadoop.hbase.master.RegionState$State.valueOf(RegionState.java:37) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.getRegionState(RegionStateStore.java:338) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.visitMetaEntry(RegionStateStore.java:116) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.access$100(RegionStateStore.java:59) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore$1.visit(RegionStateStore.java:87) > at > org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:769) > at > org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:734) > at > org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:690) > at > org.apache.hadoop.hbase.MetaTableAccessor.fullScanRegions(MetaTableAccessor.java:220) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.visitMeta(RegionStateStore.java:77) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.loadMeta(AssignmentManager.java:1248) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.joinCluster(AssignmentManager.java:1209) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:998) > at > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2260) > at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:583) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23332) [HBCKReport] Split Regions shown as Overlaps in 'Overlap' section
[ https://issues.apache.org/jira/browse/HBASE-23332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23332. --- Resolution: Cannot Reproduce Resolving. Lost logs. Seems like root cause is corrupt procedure. Spent time verifying we don't drop 'split/offline' flags when serializing to hbase:meta and that seems fine. Resolving because unable to debug. > [HBCKReport] Split Regions shown as Overlaps in 'Overlap' section > - > > Key: HBASE-23332 > URL: https://issues.apache.org/jira/browse/HBASE-23332 > Project: HBase > Issue Type: Bug > Components: hbck2, UI >Reporter: Michael Stack >Priority: Major > > The new 'HBCK Report' page has to be exacting else makes for wild goose chase > or worse, operator damage of running cluster. > I just came across instances where split parents as reported as overlapping > their daughters: > {code} > {ENCODED => 22776817918e40d0ba93eb48314d65a1, NAME => > 't1,2ac082e1,1572669261019.22776817918e40d0ba93eb48314d65a1.', STARTKEY => > '2ac082e1', ENDKEY => '2b020c18'} {ENCODED => > 8cbe15b2f59d69974357e8800a0bfbbc, NAME => > 't1,2ac082e1,1574362260851.8cbe15b2f59d69974357e8800a0bfbbc.', STARTKEY => > '2ac082e1', ENDKEY => '2ae3529d-1d72-4250-9bd8-4e9b9959284f'} > {ENCODED => 22776817918e40d0ba93eb48314d65a1, NAME => > 't1,2ac082e1,1572669261019.22776817918e40d0ba93eb48314d65a1.', STARTKEY => > '2ac082e1', ENDKEY => '2b020c18'} {ENCODED => > bd062ce8e9c99a6988f0a8223168e028, NAME => > 't1,2ae3529d-1d72-4250-9bd8-4e9b9959284f,1574362260851.bd062ce8e9c99a6988f0a8223168e028.', > STARTKEY => '2ae3529d-1d72-4250-9bd8-4e9b9959284f', ENDKEY => > '2b020c18'} > {code} > Need to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23332) [HBCKReport] Split Regions shown as Overlaps in 'Overlap' section
Michael Stack created HBASE-23332: - Summary: [HBCKReport] Split Regions shown as Overlaps in 'Overlap' section Key: HBASE-23332 URL: https://issues.apache.org/jira/browse/HBASE-23332 Project: HBase Issue Type: Bug Components: hbck2, UI Reporter: Michael Stack The new 'HBCK Report' page has to be exacting else makes for wild goose chase or worse, operator damage of running cluster. I just came across instances where split parents as reported as overlapping their daughters: {code} {ENCODED => 22776817918e40d0ba93eb48314d65a1, NAME => 't1,2ac082e1,1572669261019.22776817918e40d0ba93eb48314d65a1.', STARTKEY => '2ac082e1', ENDKEY => '2b020c18'} {ENCODED => 8cbe15b2f59d69974357e8800a0bfbbc, NAME => 't1,2ac082e1,1574362260851.8cbe15b2f59d69974357e8800a0bfbbc.', STARTKEY => '2ac082e1', ENDKEY => '2ae3529d-1d72-4250-9bd8-4e9b9959284f'} {ENCODED => 22776817918e40d0ba93eb48314d65a1, NAME => 't1,2ac082e1,1572669261019.22776817918e40d0ba93eb48314d65a1.', STARTKEY => '2ac082e1', ENDKEY => '2b020c18'} {ENCODED => bd062ce8e9c99a6988f0a8223168e028, NAME => 't1,2ae3529d-1d72-4250-9bd8-4e9b9959284f,1574362260851.bd062ce8e9c99a6988f0a8223168e028.', STARTKEY => '2ae3529d-1d72-4250-9bd8-4e9b9959284f', ENDKEY => '2b020c18'} {code} Need to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23280) Purge rep_barrier:seqnumDuringOpen on delete of Region
[ https://issues.apache.org/jira/browse/HBASE-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23280. --- Resolution: Not A Problem Resolving as 'Not a problem' any more after subtask which runs the ReplicationBarrierCleaner when hbck2 fixMeta is invoked and because of HBASE-23294 which fixed a bug in RBC. > Purge rep_barrier:seqnumDuringOpen on delete of Region > -- > > Key: HBASE-23280 > URL: https://issues.apache.org/jira/browse/HBASE-23280 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Michael Stack >Priority: Major > > The Region GC Procedure only cleans the 'info' column family. We also write > a rep_barrier column family as of HBASE-20115 . HBASE-20117 adds a chore to > clean them up after-the-fact. I've not studied how rep_barrier works (There > is a comment in MetaTableAccessor to add explaination). > This issue is about adding the deletion of the rep_barrier content on region > delete ([~zhangduo] will this mess up serial replication?). > I want to clean out these rows. They occasionally can be misinterpreted in > such as the hbck report as 'Orphan Regions' or in simple loading tools, we'll > find the rep_barrier row and then fail because no accompanying > info:regioninfo. > Perhaps removing rep_barrier column family promptly is the wrong thing to > do... we need the lag for replication to catch up Let me know [~zhangduo]. > Here is what they look like: > {code} > hbase(main):050:0> get 'hbase:meta', > ',22d0e538,1572669183985.6aa8710020b8a4f9ea290539fc254a76.' > COLUMN > CELL > rep_barrier:seqnumDuringOpen > timestamp=1573272944262, value=\x00\x00\x00\x00\x00\x00\x00\x02 > {code} > They get updated on split and when location moves. I don't seem to be able to > disable this facility -- it is on always. It also called 'unused' in title of > HBASE-20117. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23307) Add running of ReplicationBarrierCleaner to hbck2 fixMeta invocation
[ https://issues.apache.org/jira/browse/HBASE-23307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23307. --- Fix Version/s: 2.2.3 2.3.0 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Merged to branch-2.2+. Thanks for review [~binlijin]. Confirmed this works out on loaded cluster. > Add running of ReplicationBarrierCleaner to hbck2 fixMeta invocation > > > Key: HBASE-23307 > URL: https://issues.apache.org/jira/browse/HBASE-23307 > Project: HBase > Issue Type: Sub-task > Components: hbck2 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > Run the ReplicationBarrierCleaner chore when hbck2 invokes fixMeta. It will > clean up stale rep_barrier entries in hbase:meta which can help if trying to > do a restore of hbase:meta to good state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23328) info:regioninfo goes wrong when region replicas enabled
[ https://issues.apache.org/jira/browse/HBASE-23328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23328. --- Fix Version/s: 2.1.9 2.2.3 2.3.0 3.0.0 Hadoop Flags: Reviewed Resolution: Fixed Merged to branch-2.1+. Thanks for reviews [~gxcheng] and [~ramkrishna] > info:regioninfo goes wrong when region replicas enabled > --- > > Key: HBASE-23328 > URL: https://issues.apache.org/jira/browse/HBASE-23328 > Project: HBase > Issue Type: Bug > Components: read replicas >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.2.3, 2.1.9 > > > Noticed that the info:regioninfo content in hbase:meta can become that of a > serialized replica. I think it mostly harmless but accounting especially > debugging is frustrated because hbase:meta row name does not match the > info:regioninfo. > Here is an example: > {code} > t1,c6e977ef,1572669121340.0b455b2d57f91c153d5088533205c268. > column=info:regioninfo, timestamp=1574367093772, value={ENCODED => > 5199f7826c340ba944517e97c6ebaf04, NAME => > 't1,c6e977ef,1572669121340_0001.5199f7826c340ba944517e97c6ebaf04.', STARTKEY > => 'c6e977ef', ENDKEY => 'c72b0126', REPLICA_ID => 1} > {code} > Notice how hbase:meta row name is like that of the info:regioninfo content > only we are listing REPLICA_ID content and the encoded name is different (as > it factors replicaid). > The original Region Replica design describes how the info:regioninfo is > supposed to have the default HRI serialized only. See comment on HRI changes > in > https://issues.apache.org/jira/secure/attachment/12627276/hbase-10347_redo_v8.patch > -Going back over history, this may have been a bug since Region Replicas came > in.- <= No. Looking at an old cluster w/ region replicas, it doesn't have > this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23328) info:regioninfo goes wrong when region replicas enabled
Michael Stack created HBASE-23328: - Summary: info:regioninfo goes wrong when region replicas enabled Key: HBASE-23328 URL: https://issues.apache.org/jira/browse/HBASE-23328 Project: HBase Issue Type: Bug Components: read replicas Reporter: Michael Stack Assignee: Michael Stack Noticed that the info:regioninfo content in hbase:meta can become that of a serialized replica. I think it mostly harmless but accounting especially debugging is frustrated because hbase:meta row name does not match the info:regioninfo. Here is an example: {code} t1,c6e977ef,1572669121340.0b455b2d57f91c153d5088533205c268. column=info:regioninfo, timestamp=1574367093772, value={ENCODED => 5199f7826c340ba944517e97c6ebaf04, NAME => 't1,c6e977ef,1572669121340_0001.5199f7826c340ba944517e97c6ebaf04.', STARTKEY => 'c6e977ef', ENDKEY => 'c72b0126', REPLICA_ID => 1} {code} Notice how hbase:meta row name is like that of the info:regioninfo content only we are listing REPLICA_ID content and the encoded name is different (as it factors replicaid). The original Region Replica design describes how the info:regioninfo is supposed to have the default HRI serialized only. See comment on HRI changes in https://issues.apache.org/jira/secure/attachment/12627276/hbase-10347_redo_v8.patch Going back over history, this may have been a bug since Region Replicas came in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23321) [hbck2] fixHoles of fixMeta doesn't update in-memory state
[ https://issues.apache.org/jira/browse/HBASE-23321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23321. --- Fix Version/s: 2.2.3 2.3.0 3.0.0 Release Note: If holes in hbase:meta, hbck2 fixMeta now will update Master in-memory state so you do not need to restart master just so you can assign the new hole-bridging regions. Resolution: Fixed Merged to branch-2.2+ > [hbck2] fixHoles of fixMeta doesn't update in-memory state > -- > > Key: HBASE-23321 > URL: https://issues.apache.org/jira/browse/HBASE-23321 > Project: HBase > Issue Type: Improvement > Components: hbck2 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Minor > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > If hbase:meta has holes, you can run fixMeta from hbck2. This will close the > holes but you have to restart the Master for it to notice the new region > additions. Also, we were plugging holes by adding regions but no state for > the region which makes it awkward to subsequently assign. Fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23322) [hbck2] Simplification on HBCKSCP scheduling
Michael Stack created HBASE-23322: - Summary: [hbck2] Simplification on HBCKSCP scheduling Key: HBASE-23322 URL: https://issues.apache.org/jira/browse/HBASE-23322 Project: HBase Issue Type: Sub-task Components: hbck2 Reporter: Michael Stack Assignee: Michael Stack I can make the scheduling of HBCKSCP simpler. I can also fix a bug in parent issue that I notice after exercising it a bunch on a cluster. The bug is that 'Unknown Servers' seem to be retained in the Map of reporting servers. They are usually cleared just before an SCP is scheduled but scheduling HBCKSCP doesn't go the usual route. The patch here forces HBCKSCP via the usual SCP route only at the scheduling time, context dictates whether SCP or the scouring HBCKSCP. Let me put up a patch and will test in meantime. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23308) Review of NullPointerExceptions
[ https://issues.apache.org/jira/browse/HBASE-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23308. --- Hadoop Flags: Reviewed Resolution: Fixed Merged to branch-2 and master branch. Thanks for the patch [~belugabehr] > Review of NullPointerExceptions > --- > > Key: HBASE-23308 > URL: https://issues.apache.org/jira/browse/HBASE-23308 > Project: HBase > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23321) [hbck2] fixHoles of fixMeta doesn't update in-memory state
Michael Stack created HBASE-23321: - Summary: [hbck2] fixHoles of fixMeta doesn't update in-memory state Key: HBASE-23321 URL: https://issues.apache.org/jira/browse/HBASE-23321 Project: HBase Issue Type: Improvement Components: hbck2 Reporter: Michael Stack Assignee: Michael Stack If hbase:meta has holes, you can run fixMeta from hbck2. This will close the holes but you have to restart the Master for it to notice the new region additions. Also, we were plugging holes by adding regions but no state for the region which makes it awkward to subsequently assign. Fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23315) Miscellaneous HBCK Report page cleanup
[ https://issues.apache.org/jira/browse/HBASE-23315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23315. --- Fix Version/s: 2.2.3 2.3.0 3.0.0 Assignee: Michael Stack Resolution: Fixed Merged to branch-2.2+. > Miscellaneous HBCK Report page cleanup > -- > > Key: HBASE-23315 > URL: https://issues.apache.org/jira/browse/HBASE-23315 > Project: HBase > Issue Type: Improvement >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Minor > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > A bunch of touch up on the hbck report page: > * Add a bit of javadoc around SerialReplicationChecker. > * Miniscule edit to the profiler jsp page and then a bit of doc on how to > make it work that might help. > * Add some detail if NPE getting BitSetNode to help w/ debug. > * Change HbckChore to log region names instead of encoded names; helps doing > diagnostics; can take region name and query in shell to find out all about > the region according to hbase:meta. > * Add some fix-it help inline in the HBCK Report page -- how to fix. > * Add counts in procedures page so can see if making progress; move listing > of WALs to end of the page. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23315) Miscellaneous HBCK Report page cleanup
Michael Stack created HBASE-23315: - Summary: Miscellaneous HBCK Report page cleanup Key: HBASE-23315 URL: https://issues.apache.org/jira/browse/HBASE-23315 Project: HBase Issue Type: Improvement Reporter: Michael Stack A bunch of touch up on the hbck report page: * Add a bit of javadoc around SerialReplicationChecker. * Miniscule edit to the profiler jsp page and then a bit of doc on how to make it work that might help. * Add some detail if NPE getting BitSetNode to help w/ debug. * Change HbckChore to log region names instead of encoded names; helps doing diagnostics; can take region name and query in shell to find out all about the region according to hbase:meta. * Add some fix-it help inline in the HBCK Report page -- how to fix. * Add counts in procedures page so can see if making progress; move listing of WALs to end of the page. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'
[ https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack resolved HBASE-23282. --- Fix Version/s: 2.2.3 2.3.0 3.0.0 Release Note: hbck2 scheduleRecoveries will now run a SCP that also looks in hbase:meta for any references to the scheduled server -- not just consult Master in-memory state -- just in case vestiges of the server are leftover in hbase:meta Assignee: Michael Stack Resolution: Fixed > HBCKServerCrashProcedure for 'Unknown Servers' > -- > > Key: HBASE-23282 > URL: https://issues.apache.org/jira/browse/HBASE-23282 > Project: HBase > Issue Type: Bug > Components: hbck2, proc-v2 >Affects Versions: 2.2.2 >Reporter: Michael Stack >Assignee: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.2.3 > > > With an overdriving, sustained load, I can fairly easily manufacture an > hbase:meta table that references servers that are no longer in the live list > nor are members of deadservers; i.e. 'Unknown Servers'. The new 'HBCK > Report' UI in Master has a section where it lists 'Unknown Servers' if any in > hbase:meta. > Once in this state, the repair is awkward. Our assign/unassign Procedure is > particularly dogged about insisting that we confirm close/open of Regions > when it is going about its business which is well and good if server is in > live/dead sets but when an 'Unknown Server', we invariably end up trying to > confirm against a non-longer present server (More on this in follow-on > issues). > What is wanted is queuing of a ServerCrashProcedure for each 'Unknown > Server'. It would split any WALs (there shouldn't be any if server was > restarted) and ideally it would cancel out any assigns and reassign regions > off the 'Unknown Server'. But the 'normal' SCP consults the in-memory > cluster state figuring what Regions were on the crashed server... And > 'Unknown Servers' don't have state in in-master memory Maps of Servers to > Regions or in DeadServers list which works fine for the usual case. > Suggestion here is that hbck2 be able to drive in a special SCP, one which > would get list of Regions by scanning hbase:meta rather than asking Master > memory; an HBCKSCP. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23313) [hbck2] setRegionState should update Master in-memory state too
Michael Stack created HBASE-23313: - Summary: [hbck2] setRegionState should update Master in-memory state too Key: HBASE-23313 URL: https://issues.apache.org/jira/browse/HBASE-23313 Project: HBase Issue Type: Bug Reporter: Michael Stack setRegionState changes the hbase:meta table info:state column. It does not alter the Master's in-memory state. This means you have to kill Master and have another assume Active Master role of a state-change to be noticed. Better if the setRegionState just went via Master and updated Master and hbase:meta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HBASE-23307) Add running of ReplicationBarrierCleaner to hbck2 fixMeta invocation
Michael Stack created HBASE-23307: - Summary: Add running of ReplicationBarrierCleaner to hbck2 fixMeta invocation Key: HBASE-23307 URL: https://issues.apache.org/jira/browse/HBASE-23307 Project: HBase Issue Type: Sub-task Components: hbck2 Reporter: Michael Stack Assignee: Michael Stack Run the ReplicationBarrierCleaner chore when hbck2 invokes fixMeta. It will clean up stale rep_barrier entries in hbase:meta which can help if trying to do a restore of hbase:meta to good state. -- This message was sent by Atlassian Jira (v8.3.4#803005)