Re: [DISCUSSION] Upgrading core dependencies
Minor point, but I maintain we don't want to make coprocessors like osgi or built on osgi. I think we still want to scope them as extension mixins, not an inner platform. We see the limitations (limited API compatibility guarantees for internals by definition) over on Phoenix but it's the right trade off for HBase in my opinion. We can still help implementors by refactoring to stable supported interfaces as motivated on a case by case basis, like what we did with HRegion -> Region. Let's get rid of all Guava types in any public or LP API. > On Feb 7, 2017, at 9:31 PM, Stackwrote: > > Thanks Nick and Duo. > > See below. > >> On Tue, Feb 7, 2017 at 6:50 PM, Nick Dimiduk wrote: >> >> For the client: I'm a fan of shaded client modules by default and >> minimizing the exposure of that surface area of 3rd party libs (none, if >> possible). For example, Elastic Search has a similar set of challenges, the >> solve it by advocating users shade from step 1. It's addressed first thing >> in the docs for their client libs. We could take it a step further by >> making the shaded client the default client (o.a.hbase:hbase-client) >> artifact and internally consume an hbase-client-unshaded. Turns the whole >> thing on it's head in a way that's better for the naive user. >> >> > I like this idea. Let me try it out. Our shaded thingies are not 'air > tight' enough yet I suspect but maybe we can fix this. Making it so clients > don't have to include hbase-server too will be a little harder (will try > flipping this too so it always shaded by default). > > >> For MR/Spark/etc connectors: We're probably stuck as it is until necessary >> classes can be extracted from hbase-server. I haven't looked into this >> lately, so I hesitate to give a prescription. >> >> > This was last attempt and the contributor did a good job at sizing the > effort: HBASE-11843. > > >> For coprocessors: They forfeit their right to 3rd party library dependency >> stability by entering our process space. Maybe in 3.0 or 4.0 we can rebuild >> on jigsaw or OSGi, but for today I think the best we should do is provide >> relatively stable internal APIs. I also find it unlikely that we'd want to >> spend loads of cycles optimizing for this usecase. There's other, bigger >> fish, IMHO. >> >> > Agree. > > >> For size/compile time: I think these ultimately matter less than user >> experience. Let's find a solution that sucks less for downstreamers and >> work backward on reducing bloat. >> >> I like how you put it. > > >> On the point of leaning heavily on Guava: their pace is traditionally too >> fast for us to expose in any public API. Maybe that's changing, in which >> case we could reconsider for 3.0. Better to start using the new API's >> available in Java 8... >> >> > I like what Duo says here that we just not expose these libs in our API. > > Yeah, we can do jdk8 new APIs but guava is something else (there is some > small overlap in functional idioms -- we can favor jdk8 here -- but guava > has a bunch more it'd be good to make use of). > > Anyways, I was using Guava as illustration of a larger issue. > > Thanks again for the input you two, > S > > > > > > >> Thanks for taking this up, Stack. >> -n >> >>> On Tue, Feb 7, 2017 at 12:22 PM Stack wrote: >>> >>> Here's an old thorny issue that won't go away. I'd like to hear what >> folks >>> are thinking these times. >>> >>> My immediate need is that I want to upgrade Guava [1]. I want to move us >> to >>> guava 21.0, the latest release [2]. We currently depend on guava 12.0. >>> Hadoop's guava -- 11.0 -- is also on our CLASSPATH (three times). We >> could >>> just do it in an hbase-2.0.0, a major version release, but then >>> downstreamers and coprocessors that may have been a little lazy and that >>> have transitively come to depend on our versions of libs will break [3]. >>> Then there is the murky area around the running of YARN/MR/Spark jobs >> where >>> the ordering of libs on the CLASSPATH gets interesting where fat-jaring >> or >>> command-line antics can get you over (most) problems if you persevere. >>> >>> Multiply the above by netty, jackson, and a few other favorites. >>> >>> Our proffered solution to the above is the shaded hbase artifact project; >>> have applications and tasks refer to the shaded hbase client instead. >>> Because we've not done the work to narrow the surface area we expose to >>> downstreamers, most consumers of our API -- certainly in a spark/MR >> context >>> since our MR utility is buried in hbase-server module still -- need both >>> the shaded hbase client and server on their CLASSPATH (i.e. near all of >>> hbase). >>> >>> Leaving aside for the moment that our shaded client and server need >>> untangling, getting folks up on the shaded artifacts takes effort >>> evangelizing. We also need to be doing work to make sure our shading >>> doesn't leak dependencies, that
[jira] [Created] (HBASE-17637) Update progress more frequently in IntegrationTestBigLinkedList.Generator.persist
Andrew Purtell created HBASE-17637: -- Summary: Update progress more frequently in IntegrationTestBigLinkedList.Generator.persist Key: HBASE-17637 URL: https://issues.apache.org/jira/browse/HBASE-17637 Project: HBase Issue Type: Improvement Reporter: Andrew Purtell Priority: Minor In underpowered or loaded environments (like a Docker based virtual cluster), the MR framework may time out a task lagging because chaos has made everything very slow. A simple adjustment to progress reporting can avoid false failures. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Successful: HBase Generate Website
Build status: Successful If successful, the website and docs have been generated. To update the live site, follow the instructions below. If failed, skip to the bottom of this email. Use the following commands to download the patch and apply it to a clean branch based on origin/asf-site. If you prefer to keep the hbase-site repo around permanently, you can skip the clone step. git clone https://git-wip-us.apache.org/repos/asf/hbase-site.git cd hbase-site wget -O- https://builds.apache.org/job/hbase_generate_website/486/artifact/website.patch.zip | funzip > a05abd83effd8e4c0abff7acf1b5f33f8609295f.patch git fetch git checkout -b asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f origin/asf-site git am --whitespace=fix a05abd83effd8e4c0abff7acf1b5f33f8609295f.patch At this point, you can preview the changes by opening index.html or any of the other HTML pages in your local asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f branch. There are lots of spurious changes, such as timestamps and CSS styles in tables, so a generic git diff is not very useful. To see a list of files that have been added, deleted, renamed, changed type, or are otherwise interesting, use the following command: git diff --name-status --diff-filter=ADCRTXUB origin/asf-site To see only files that had 100 or more lines changed: git diff --stat origin/asf-site | grep -E '[1-9][0-9]{2,}' When you are satisfied, publish your changes to origin/asf-site using these commands: git commit --allow-empty -m "Empty commit" # to work around a current ASF INFRA bug git push origin asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f:asf-site git checkout asf-site git branch -D asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f Changes take a couple of minutes to be propagated. You can verify whether they have been propagated by looking at the Last Published date at the bottom of http://hbase.apache.org/. It should match the date in the index.html on the asf-site branch in Git. As a courtesy- reply-all to this email to let other committers know you pushed the site. If failed, see https://builds.apache.org/job/hbase_generate_website/486/console
[jira] [Created] (HBASE-17636) Fix speling [sic] error in enable replication script output
Lars George created HBASE-17636: --- Summary: Fix speling [sic] error in enable replication script output Key: HBASE-17636 URL: https://issues.apache.org/jira/browse/HBASE-17636 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 1.3.1 Reporter: Lars George When enabling the replication for a table: {noformat} hbase(main):012:0> enable_table_replication 'repltest' 0 row(s) in 7.6080 seconds The replication swith of table 'repltest' successfully enabled {noformat} See {{swith}} as opposed to {{switch}}. Also, that sentence is somewhat too complicated. Better is maybe {{Replication for table successfully enabled.}}? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17635) enable_table_replication script cannot handle replication scope
Lars George created HBASE-17635: --- Summary: enable_table_replication script cannot handle replication scope Key: HBASE-17635 URL: https://issues.apache.org/jira/browse/HBASE-17635 Project: HBase Issue Type: Bug Components: Replication Affects Versions: 1.3.1 Reporter: Lars George When you add a peer, then enable a table for replication using {{enable_table_replication}}, the script will create the table on the peer cluster, but with one difference: _Master Cluster_: {noformat} hbase(main):027:0> describe 'testtable' Table testtable is ENABLED testtable COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '1', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 1 row(s) in 0.0700 seconds {noformat} _Peer Cluster_: {noformat} hbase(main):003:0> describe 'testtable' Table testtable is ENABLED testtable COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TTL => 'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 1 row(s) in 0.1260 seconds {noformat} Note that the replication scope is different. Removing the peer, adding it again and enabling the table gives this now: {noformat} hbase(main):026:0> enable_table_replication 'testtable' ERROR: Table testtable exists in peer cluster 1, but the table descriptors are not same when compared with source cluster. Thus can not enable the table's replication switch. {noformat} That is dumb, as it was the same script that enabled the replication scope in the first place. It should skip that particular attribute when comparing the cluster schemas. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17634) Clean up the usage of Result.isPartial
Duo Zhang created HBASE-17634: - Summary: Clean up the usage of Result.isPartial Key: HBASE-17634 URL: https://issues.apache.org/jira/browse/HBASE-17634 Project: HBase Issue Type: Improvement Affects Versions: 2.0.0, 1.4.0 Reporter: Duo Zhang Fix For: 2.0.0, 1.4.0 We have marked Result.isPartial as deprecated in HBASE-17599. This issue aims to remove the isPartial usage in our code base. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17633) Update unflushed sequence id in SequenceIdAccounting after flush with the minimum sequence id in memstore
Duo Zhang created HBASE-17633: - Summary: Update unflushed sequence id in SequenceIdAccounting after flush with the minimum sequence id in memstore Key: HBASE-17633 URL: https://issues.apache.org/jira/browse/HBASE-17633 Project: HBase Issue Type: Improvement Reporter: Duo Zhang Now the tracking work is done by SequenceIdAccounting. And it is a little tricky when dealing with flush. We should remove the mapping for the given stores of a region from lowestUnflushedSequenceIds, so that we have space to store the new lowest unflushed sequence id after flush. But we still need to keep the old sequence ids in another map as we still need to use these values when reporting to master to prevent data loss(think of the scenario that we report the new lowest unflushed sequence id to master and we crashed before actually flushed the data to disk). And when reviewing HBASE-17407, I found that for CompactingMemStore, we have to record the minimum sequence id.in memstore. We could just update the mappings in SequenceIdAccounting after flush. This means we do not need to update the lowest unflushed sequence id in SequenceIdAccounting, and also do not need to make space for the new lowest unflushed when startCacheFlush, and also do not need the extra map to store the old mappings. This could simplify our logic a lot. But this is an fundamental change so I need sometime to implement, especially for modifying tests... And I also need sometime to check if I miss something. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17632) Modify example health script to work on CentOS 6 etc.
Lars George created HBASE-17632: --- Summary: Modify example health script to work on CentOS 6 etc. Key: HBASE-17632 URL: https://issues.apache.org/jira/browse/HBASE-17632 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: Lars George -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17631) Canary interval too low
Lars George created HBASE-17631: --- Summary: Canary interval too low Key: HBASE-17631 URL: https://issues.apache.org/jira/browse/HBASE-17631 Project: HBase Issue Type: Bug Components: canary Affects Versions: 1.3.1 Reporter: Lars George The interval currently is {{6000}} milliseconds, or six seconds, which makes little sense to test that often in succession. We should set the default to at least 60 seconds, or even every 5 minutes? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17630) Health Script not shutting down server process with certain script behavior
Lars George created HBASE-17630: --- Summary: Health Script not shutting down server process with certain script behavior Key: HBASE-17630 URL: https://issues.apache.org/jira/browse/HBASE-17630 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 1.3.1 Reporter: Lars George As discussed on dev@... I tried the supplied {{healthcheck.sh}}, but did not have {{snmpd}} running. That caused the script to take a long time to error out, which exceed the 10 seconds the check was meant to run. That resets the check and it keeps reporting the error, but never stops the servers: {noformat} 2017-02-04 05:55:08,962 INFO [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] hbase.HealthCheckChore: Health Check Chore runs every 10sec 2017-02-04 05:55:08,975 INFO [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] hbase.HealthChecker: HealthChecker initialized with script at /opt/hbase/bin/healthcheck.sh, timeout=6 ... 2017-02-04 05:55:50,435 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:55:50,436 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: CompactionChecker missed its start time 2017-02-04 05:55:50,437 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: slave-1.internal.larsgeorge.com,16020,1486216506007-MemstoreFlusherChore missed its start time 2017-02-04 05:55:50,438 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:56:20,522 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec : ERROR check link, OK: disks ok, 2017-02-04 05:56:20,523 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:56:50,600 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:56:50,600 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:57:20,681 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec : ERROR check link, OK: disks ok, 2017-02-04 05:57:20,681 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:57:50,763 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:57:50,764 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:58:20,844 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec : ERROR check link, OK: disks ok, 2017-02-04 05:58:20,844 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:58:50,923 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec : ERROR check link, OK: disks ok, 2017-02-04 05:58:50,923 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] hbase.ScheduledChore: Chore: HealthChecker missed its start time 2017-02-04 05:59:21,017 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec : ERROR check link, OK: disks ok, 2017-02-04 05:59:21,018 INFO [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] hbase.ScheduledChore: Chore: HealthChecker missed its start time {noformat} We need to fix the handling of the timeout of the health check script and ho the chore is treating that to shut down the server process. The current settings of check frequency and timeout overlap and cause the above. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
Re: Health Script does not stop region server
Will do, thanks On Sun, Feb 5, 2017 at 3:57 PM, Ted Yuwrote: > Yesterday I tried snmpwalk on CentOS as well - same behavior. > > Lars: > Can you file a JIRA to fix the bug ? > > Thanks > > On Sun, Feb 5, 2017 at 2:22 AM, Lars George wrote: > >> Hi Ted, >> >> This does not work on Mac as provided. I tried on a CentOS 6 machine, >> and had to install net-snmp and net-snmp-utils, plus start the snmpd >> to make it time out quicker. But even even there the snmpwalk return >> nothing, making the script fail. >> >> Anyhow, the snmpwalk failing after the retries is just an example of >> what can happen if the health check script takes too long to fail. The >> bottom line is that it does _not_ stop the server as expected as our >> check in the code is reset because of the chore's delay. That is a bug >> methinks. >> >> Or, in other words, when I fixed the snmpwalk to come back quickly as >> explained above, the error was caught in time and the server stopped >> as expected. >> >> Makes sense? >> >> Lars >> >> On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu wrote: >> > Running the command from the script locally (on Mac): >> > >> > $ /usr/bin/snmpwalk -t 5 -Oe -Oq -Os -v 1 -c public localhost if >> > Timeout: No Response from localhost >> > $ echo $? >> > 1 >> > >> > Looks like the script should parse the output from snmpwalk and provide >> > some hint if unexpected result is reported. >> > >> > Cheers >> > >> > On Sat, Feb 4, 2017 at 6:40 AM, Lars George >> wrote: >> > >> >> Hi, >> >> >> >> I tried the supplied `healthcheck.sh`, but did not have snmpd running. >> >> That caused the script to take a long time to error out, which exceed >> >> the 10 seconds the check was meant to run. That resets the check and >> >> it keeps reporting the error, but never stops the servers: >> >> >> >> 2017-02-04 05:55:08,962 INFO >> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] >> >> hbase.HealthCheckChore: Health Check Chore runs every 10sec >> >> 2017-02-04 05:55:08,975 INFO >> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020] >> >> hbase.HealthChecker: HealthChecker initialized with script at >> >> /opt/hbase/bin/healthcheck.sh, timeout=6 >> >> >> >> ... >> >> >> >> 2017-02-04 05:55:50,435 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:55:50,436 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time >> >> 2017-02-04 05:55:50,437 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: >> >> slave-1.internal.larsgeorge.com,16020,1486216506007- >> MemstoreFlusherChore >> >> missed its start time >> >> 2017-02-04 05:55:50,438 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:56:20,522 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:56:20,523 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:56:50,600 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:56:50,600 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:57:20,681 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:57:20,681 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:57:50,763 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec : >> >> ERROR check link, OK: disks ok, >> >> >> >> 2017-02-04 05:57:50,764 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time >> >> 2017-02-04 05:58:20,844 INFO >> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1] >> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec : >> >> ERROR check
Re: Canary Test Tool and write sniffing
Please keep in mind we are talking about two issues here: 1) The short default interval time, and 2) the issue that the canary table regions might not be on all servers. Anyone here that tried write sniffing on a current cluster with the SLB and saw it work? Best, Lars On Mon, Feb 6, 2017 at 10:38 PM, Enis Söztutarwrote: > Open an issue? > Enis > > On Mon, Feb 6, 2017 at 9:39 AM, Stack wrote: > >> On Sun, Feb 5, 2017 at 2:25 AM, Lars George wrote: >> >> > The next example is wrong too, claiming to show 60 secs, while it >> > shows 600 secs (the default value as well). >> > >> > The question is still, what is a good value for intervals? Anyone here >> > that uses the Canary that would like to chime in? >> > >> > >> I was hanging out with a user where on a mid-sized cluster with Canary >> running with defaults, the regionserver carrying meta was 100% CPU because >> of all the requests from Canary doing repeated full-table Scans. >> >> 6 seconds is too short. Seems like a typo that should be 60seconds. It is >> not as though the Canary is going to do anything about it if it finds >> something wrong. >> >> S >> >> >> >> >> > On Sat, Feb 4, 2017 at 5:40 PM, Ted Yu wrote: >> > > Brief search on HBASE-4393 didn't reveal why the interval was >> shortened. >> > > >> > > If you read the first paragraph of: >> > > http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode >> > > >> > > possibly the reasoning was that canary would exit upon seeing some >> error >> > > (the first time). >> > > >> > > BTW There was a mismatch in the description for this command: (5 >> seconds >> > > vs. 5 milliseconds) >> > > >> > > ${HBASE_HOME}/bin/hbase canary -daemon -interval 5 -f false >> > > >> > > >> > > On Sat, Feb 4, 2017 at 8:21 AM, Lars George >> > wrote: >> > > >> > >> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the >> > >> last one has 6 secs. Am I reading this right? It hands 6000 into the >> > >> Thread.sleep() call, which takes millisecs. So that makes 6 secs >> > >> between checks, which seems super short, no? I might just dull here. >> > >> >> > >> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu wrote: >> > >> > For the default interval , if you were looking at: >> > >> > >> > >> > private static final long DEFAULT_INTERVAL = 6000; >> > >> > >> > >> > The above was from: >> > >> > >> > >> > HBASE-4393 Implement a canary monitoring program >> > >> > >> > >> > which was integrated on Tue Apr 24 07:20:16 2012 >> > >> > >> > >> > FYI >> > >> > >> > >> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George >> > >> wrote: >> > >> > >> > >> >> Also, the default interval used to be 60 secs, but is now 6 secs. >> > Does >> > >> >> that make sense? Seems awfully short for a default, assuming you >> have >> > >> >> many regions or servers. >> > >> >> >> > >> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George < >> lars.geo...@gmail.com> >> > >> >> wrote: >> > >> >> > Hi, >> > >> >> > >> > >> >> > Looking at the Canary tool, it tries to ensure that all canary >> test >> > >> >> > table regions are spread across all region servers. If that is >> not >> > the >> > >> >> > case, it calls: >> > >> >> > >> > >> >> > if (numberOfCoveredServers < numberOfServers) { >> > >> >> > admin.balancer(); >> > >> >> > } >> > >> >> > >> > >> >> > I doubt this will help with the StochasticLoadBalancer, which is >> > known >> > >> >> > to consider per-table balancing as one of many factors. In >> > practice, >> > >> >> > the SLB will most likely _not_ distribute the canary regions >> > >> >> > sufficiently, leaving gap in the check. Switching on the >> per-table >> > >> >> > option is discouraged against to let it do its thing. >> > >> >> > >> > >> >> > Just pointing it out for vetting. >> > >> >> > >> > >> >> > Lars >> > >> >> >> > >> >> > >>