Re: [DISCUSSION] Upgrading core dependencies

2017-02-11 Thread Andrew Purtell
Minor point, but I maintain we don't want to make coprocessors like osgi or 
built on osgi. I think we still want to scope them as extension mixins, not an 
inner platform. We see the limitations (limited API compatibility guarantees 
for internals by definition) over on Phoenix but it's the right trade off for 
HBase in my opinion. We can still help implementors by refactoring to stable 
supported interfaces as motivated on a case by case basis, like what we did 
with HRegion -> Region. 

Let's get rid of all Guava types in any public or LP API. 


> On Feb 7, 2017, at 9:31 PM, Stack  wrote:
> 
> Thanks Nick and Duo.
> 
> See below.
> 
>> On Tue, Feb 7, 2017 at 6:50 PM, Nick Dimiduk  wrote:
>> 
>> For the client: I'm a fan of shaded client modules by default and
>> minimizing the exposure of that surface area of 3rd party libs (none, if
>> possible). For example, Elastic Search has a similar set of challenges, the
>> solve it by advocating users shade from step 1. It's addressed first thing
>> in the docs for their client libs. We could take it a step further by
>> making the shaded client the default client (o.a.hbase:hbase-client)
>> artifact and internally consume an hbase-client-unshaded. Turns the whole
>> thing on it's head in a way that's better for the naive user.
>> 
>> 
> I like this idea. Let me try it out. Our shaded thingies are not 'air
> tight' enough yet I suspect but maybe we can fix this. Making it so clients
> don't have to include hbase-server too will be a little harder (will try
> flipping this too so it always shaded by default).
> 
> 
>> For MR/Spark/etc connectors: We're probably stuck as it is until necessary
>> classes can be extracted from hbase-server. I haven't looked into this
>> lately, so I hesitate to give a prescription.
>> 
>> 
> This was last attempt and the contributor did a good job at sizing the
> effort: HBASE-11843.
> 
> 
>> For coprocessors: They forfeit their right to 3rd party library dependency
>> stability by entering our process space. Maybe in 3.0 or 4.0 we can rebuild
>> on jigsaw or OSGi, but for today I think the best we should do is provide
>> relatively stable internal APIs. I also find it unlikely that we'd want to
>> spend loads of cycles optimizing for this usecase. There's other, bigger
>> fish, IMHO.
>> 
>> 
> Agree.
> 
> 
>> For size/compile time: I think these ultimately matter less than user
>> experience. Let's find a solution that sucks less for downstreamers and
>> work backward on reducing bloat.
>> 
>> I like how you put it.
> 
> 
>> On the point of leaning heavily on Guava: their pace is traditionally too
>> fast for us to expose in any public API. Maybe that's changing, in which
>> case we could reconsider for 3.0. Better to start using the new API's
>> available in Java 8...
>> 
>> 
> I like what Duo says here that we just not expose these libs in our API.
> 
> Yeah, we can do jdk8 new APIs but guava is something else (there is some
> small overlap in functional idioms -- we can favor jdk8 here -- but guava
> has a bunch more it'd be good to make use of).
> 
> Anyways, I was using Guava as illustration of a larger issue.
> 
> Thanks again for the input you two,
> S
> 
> 
> 
> 
> 
> 
>> Thanks for taking this up, Stack.
>> -n
>> 
>>> On Tue, Feb 7, 2017 at 12:22 PM Stack  wrote:
>>> 
>>> Here's an old thorny issue that won't go away. I'd like to hear what
>> folks
>>> are thinking these times.
>>> 
>>> My immediate need is that I want to upgrade Guava [1]. I want to move us
>> to
>>> guava 21.0, the latest release [2]. We currently depend on guava 12.0.
>>> Hadoop's guava -- 11.0 -- is also on our CLASSPATH (three times). We
>> could
>>> just do it in an hbase-2.0.0, a major version release, but then
>>> downstreamers and coprocessors that may have been a little lazy and that
>>> have transitively come to depend on our versions of libs will break [3].
>>> Then there is the murky area around the running of YARN/MR/Spark jobs
>> where
>>> the ordering of libs on the CLASSPATH gets interesting where fat-jaring
>> or
>>> command-line antics can get you over (most) problems if you persevere.
>>> 
>>> Multiply the above by netty, jackson, and a few other favorites.
>>> 
>>> Our proffered solution to the above is the shaded hbase artifact project;
>>> have applications and tasks refer to the shaded hbase client instead.
>>> Because we've not done the work to narrow the surface area we expose to
>>> downstreamers, most consumers of our API -- certainly in a spark/MR
>> context
>>> since our MR utility is buried in hbase-server module still -- need both
>>> the shaded hbase client and server on their CLASSPATH (i.e. near all of
>>> hbase).
>>> 
>>> Leaving aside for the moment that our shaded client and server need
>>> untangling, getting folks up on the shaded artifacts takes effort
>>> evangelizing. We also need to be doing work to make sure our shading
>>> doesn't leak dependencies, that 

[jira] [Created] (HBASE-17637) Update progress more frequently in IntegrationTestBigLinkedList.Generator.persist

2017-02-11 Thread Andrew Purtell (JIRA)
Andrew Purtell created HBASE-17637:
--

 Summary: Update progress more frequently in 
IntegrationTestBigLinkedList.Generator.persist
 Key: HBASE-17637
 URL: https://issues.apache.org/jira/browse/HBASE-17637
 Project: HBase
  Issue Type: Improvement
Reporter: Andrew Purtell
Priority: Minor


In underpowered or loaded environments (like a Docker based virtual cluster), 
the MR framework may time out a task lagging because chaos has made everything 
very slow. A simple adjustment to progress reporting can avoid false failures. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Successful: HBase Generate Website

2017-02-11 Thread Apache Jenkins Server
Build status: Successful

If successful, the website and docs have been generated. To update the live 
site, follow the instructions below. If failed, skip to the bottom of this 
email.

Use the following commands to download the patch and apply it to a clean branch 
based on origin/asf-site. If you prefer to keep the hbase-site repo around 
permanently, you can skip the clone step.

  git clone https://git-wip-us.apache.org/repos/asf/hbase-site.git

  cd hbase-site
  wget -O- 
https://builds.apache.org/job/hbase_generate_website/486/artifact/website.patch.zip
 | funzip > a05abd83effd8e4c0abff7acf1b5f33f8609295f.patch
  git fetch
  git checkout -b asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f 
origin/asf-site
  git am --whitespace=fix a05abd83effd8e4c0abff7acf1b5f33f8609295f.patch

At this point, you can preview the changes by opening index.html or any of the 
other HTML pages in your local 
asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f branch.

There are lots of spurious changes, such as timestamps and CSS styles in 
tables, so a generic git diff is not very useful. To see a list of files that 
have been added, deleted, renamed, changed type, or are otherwise interesting, 
use the following command:

  git diff --name-status --diff-filter=ADCRTXUB origin/asf-site

To see only files that had 100 or more lines changed:

  git diff --stat origin/asf-site | grep -E '[1-9][0-9]{2,}'

When you are satisfied, publish your changes to origin/asf-site using these 
commands:

  git commit --allow-empty -m "Empty commit" # to work around a current ASF 
INFRA bug
  git push origin asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f:asf-site
  git checkout asf-site
  git branch -D asf-site-a05abd83effd8e4c0abff7acf1b5f33f8609295f

Changes take a couple of minutes to be propagated. You can verify whether they 
have been propagated by looking at the Last Published date at the bottom of 
http://hbase.apache.org/. It should match the date in the index.html on the 
asf-site branch in Git.

As a courtesy- reply-all to this email to let other committers know you pushed 
the site.



If failed, see https://builds.apache.org/job/hbase_generate_website/486/console

[jira] [Created] (HBASE-17636) Fix speling [sic] error in enable replication script output

2017-02-11 Thread Lars George (JIRA)
Lars George created HBASE-17636:
---

 Summary: Fix speling [sic] error in enable replication script 
output
 Key: HBASE-17636
 URL: https://issues.apache.org/jira/browse/HBASE-17636
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 1.3.1
Reporter: Lars George


When enabling the replication for a table:

{noformat}
hbase(main):012:0> enable_table_replication 'repltest'
0 row(s) in 7.6080 seconds
The replication swith of table 'repltest' successfully enabled
{noformat}

See {{swith}} as opposed to {{switch}}. Also, that sentence is somewhat too 
complicated. Better is maybe {{Replication for table  successfully 
enabled.}}?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17635) enable_table_replication script cannot handle replication scope

2017-02-11 Thread Lars George (JIRA)
Lars George created HBASE-17635:
---

 Summary: enable_table_replication script cannot handle replication 
scope
 Key: HBASE-17635
 URL: https://issues.apache.org/jira/browse/HBASE-17635
 Project: HBase
  Issue Type: Bug
  Components: Replication
Affects Versions: 1.3.1
Reporter: Lars George


When you add a peer, then enable a table for replication using 
{{enable_table_replication}}, the script will create the table on the peer 
cluster, but with one difference:

_Master Cluster_:

{noformat}
hbase(main):027:0> describe 'testtable'
Table testtable is ENABLED  


testtable   


COLUMN FAMILIES DESCRIPTION 


{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', 
REPLICATION_SCOPE => '1', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS 
=> '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE',
 BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}  


1 row(s) in 0.0700 seconds
{noformat}

_Peer Cluster_:
{noformat}
hbase(main):003:0> describe 'testtable'
Table testtable is ENABLED  


testtable   


COLUMN FAMILIES DESCRIPTION 


{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', 
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TTL => 
'FOREVER', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'FALSE',
 BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}  


1 row(s) in 0.1260 seconds
{noformat}

Note that the replication scope is different. Removing the peer, adding it 
again and enabling the table gives this now:

{noformat}
hbase(main):026:0> enable_table_replication 'testtable'

ERROR: Table testtable exists in peer cluster 1, but the table descriptors are 
not same when compared with source cluster. Thus can not enable the table's 
replication switch.
{noformat}

That is dumb, as it was the same script that enabled the replication scope in 
the first place. It should skip that particular attribute when comparing the 
cluster schemas.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17634) Clean up the usage of Result.isPartial

2017-02-11 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-17634:
-

 Summary: Clean up the usage of Result.isPartial
 Key: HBASE-17634
 URL: https://issues.apache.org/jira/browse/HBASE-17634
 Project: HBase
  Issue Type: Improvement
Affects Versions: 2.0.0, 1.4.0
Reporter: Duo Zhang
 Fix For: 2.0.0, 1.4.0


We have marked Result.isPartial as deprecated in HBASE-17599. This issue aims 
to remove the isPartial usage in our code base.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17633) Update unflushed sequence id in SequenceIdAccounting after flush with the minimum sequence id in memstore

2017-02-11 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-17633:
-

 Summary: Update unflushed sequence id in SequenceIdAccounting 
after flush with the minimum sequence id in memstore
 Key: HBASE-17633
 URL: https://issues.apache.org/jira/browse/HBASE-17633
 Project: HBase
  Issue Type: Improvement
Reporter: Duo Zhang


Now the tracking work is done by SequenceIdAccounting. And it is a little 
tricky when dealing with flush. We should remove the mapping for the given 
stores of a region from lowestUnflushedSequenceIds, so that we have space to 
store the new lowest unflushed sequence id after flush. But we still need to 
keep the old sequence ids in another map as we still need to use these values 
when reporting to master to prevent data loss(think of the scenario that we 
report the new lowest unflushed sequence id to master and we crashed before 
actually flushed the data to disk).

And when reviewing HBASE-17407, I found  that for CompactingMemStore, we have 
to record the minimum sequence id.in memstore. We could just update the 
mappings in SequenceIdAccounting after flush. This means we do not need to 
update the lowest unflushed sequence id in SequenceIdAccounting, and also do 
not need to make space for the new lowest unflushed when startCacheFlush, and 
also do not need the extra map to store the old mappings.

This could simplify our logic a lot. But this is an fundamental change so I 
need sometime to implement, especially for modifying tests... And I also need 
sometime to check if I miss something.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17632) Modify example health script to work on CentOS 6 etc.

2017-02-11 Thread Lars George (JIRA)
Lars George created HBASE-17632:
---

 Summary: Modify example health script to work on CentOS 6 etc.
 Key: HBASE-17632
 URL: https://issues.apache.org/jira/browse/HBASE-17632
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: Lars George






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17631) Canary interval too low

2017-02-11 Thread Lars George (JIRA)
Lars George created HBASE-17631:
---

 Summary: Canary interval too low
 Key: HBASE-17631
 URL: https://issues.apache.org/jira/browse/HBASE-17631
 Project: HBase
  Issue Type: Bug
  Components: canary
Affects Versions: 1.3.1
Reporter: Lars George


The interval currently is {{6000}} milliseconds, or six seconds, which makes 
little sense to test that often in succession. We should set the default to at 
least 60 seconds, or even every 5 minutes?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (HBASE-17630) Health Script not shutting down server process with certain script behavior

2017-02-11 Thread Lars George (JIRA)
Lars George created HBASE-17630:
---

 Summary: Health Script not shutting down server process with 
certain script behavior
 Key: HBASE-17630
 URL: https://issues.apache.org/jira/browse/HBASE-17630
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 1.3.1
Reporter: Lars George


As discussed on dev@...

I tried the supplied {{healthcheck.sh}}, but did not have {{snmpd}} running.
That caused the script to take a long time to error out, which exceed
the 10 seconds the check was meant to run. That resets the check and
it keeps reporting the error, but never stops the servers:

{noformat}
2017-02-04 05:55:08,962 INFO
[regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
hbase.HealthCheckChore: Health Check Chore runs every 10sec
2017-02-04 05:55:08,975 INFO
[regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
hbase.HealthChecker: HealthChecker initialized with script at
/opt/hbase/bin/healthcheck.sh, timeout=6

...

2017-02-04 05:55:50,435 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec :
ERROR check link, OK: disks ok,

2017-02-04 05:55:50,436 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.ScheduledChore: Chore: CompactionChecker missed its start time
2017-02-04 05:55:50,437 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.ScheduledChore: Chore:
slave-1.internal.larsgeorge.com,16020,1486216506007-MemstoreFlusherChore
missed its start time
2017-02-04 05:55:50,438 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:56:20,522 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec :
ERROR check link, OK: disks ok,

2017-02-04 05:56:20,523 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:56:50,600 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec :
ERROR check link, OK: disks ok,

2017-02-04 05:56:50,600 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:57:20,681 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec :
ERROR check link, OK: disks ok,

2017-02-04 05:57:20,681 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:57:50,763 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec :
ERROR check link, OK: disks ok,

2017-02-04 05:57:50,764 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:58:20,844 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec :
ERROR check link, OK: disks ok,

2017-02-04 05:58:20,844 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:58:50,923 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 50sec :
ERROR check link, OK: disks ok,

2017-02-04 05:58:50,923 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
2017-02-04 05:59:21,017 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.HealthCheckChore: Health status at 412837hrs, 59mins, 21sec :
ERROR check link, OK: disks ok,

2017-02-04 05:59:21,018 INFO
[slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
hbase.ScheduledChore: Chore: HealthChecker missed its start time
{noformat}

We need to fix the handling of the timeout of the health check script and ho 
the chore is treating that to shut down the server process. The current 
settings of check frequency and timeout overlap and cause the above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Health Script does not stop region server

2017-02-11 Thread Lars George
Will do, thanks


On Sun, Feb 5, 2017 at 3:57 PM, Ted Yu  wrote:
> Yesterday I tried snmpwalk on CentOS as well - same behavior.
>
> Lars:
> Can you file a JIRA to fix the bug ?
>
> Thanks
>
> On Sun, Feb 5, 2017 at 2:22 AM, Lars George  wrote:
>
>> Hi Ted,
>>
>> This does not work on Mac as provided. I tried on a CentOS 6 machine,
>> and had to install net-snmp and net-snmp-utils, plus start the snmpd
>> to make it time out quicker. But even even there the snmpwalk return
>> nothing, making the script fail.
>>
>> Anyhow, the snmpwalk failing after the retries is just an example of
>> what can happen if the health check script takes too long to fail. The
>> bottom line is that it does _not_ stop the server as expected as our
>> check in the code is reset because of the chore's delay. That is a bug
>> methinks.
>>
>> Or, in other words, when I fixed the snmpwalk to come back quickly as
>> explained above, the error was caught in time and the server stopped
>> as expected.
>>
>> Makes sense?
>>
>> Lars
>>
>> On Sat, Feb 4, 2017 at 4:30 PM, Ted Yu  wrote:
>> > Running the command from the script locally (on Mac):
>> >
>> > $ /usr/bin/snmpwalk -t 5 -Oe  -Oq  -Os -v 1 -c public localhost if
>> > Timeout: No Response from localhost
>> > $ echo $?
>> > 1
>> >
>> > Looks like the script should parse the output from snmpwalk and provide
>> > some hint if unexpected result is reported.
>> >
>> > Cheers
>> >
>> > On Sat, Feb 4, 2017 at 6:40 AM, Lars George 
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> I tried the supplied `healthcheck.sh`, but did not have snmpd running.
>> >> That caused the script to take a long time to error out, which exceed
>> >> the 10 seconds the check was meant to run. That resets the check and
>> >> it keeps reporting the error, but never stops the servers:
>> >>
>> >> 2017-02-04 05:55:08,962 INFO
>> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
>> >> hbase.HealthCheckChore: Health Check Chore runs every 10sec
>> >> 2017-02-04 05:55:08,975 INFO
>> >> [regionserver/slave-1.internal.larsgeorge.com/10.0.10.10:16020]
>> >> hbase.HealthChecker: HealthChecker initialized with script at
>> >> /opt/hbase/bin/healthcheck.sh, timeout=6
>> >>
>> >> ...
>> >>
>> >> 2017-02-04 05:55:50,435 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 55mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:55:50,436 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: CompactionChecker missed its start time
>> >> 2017-02-04 05:55:50,437 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore:
>> >> slave-1.internal.larsgeorge.com,16020,1486216506007-
>> MemstoreFlusherChore
>> >> missed its start time
>> >> 2017-02-04 05:55:50,438 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:56:20,522 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 20sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:56:20,523 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:56:50,600 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 56mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:56:50,600 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_2]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:57:20,681 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 20sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:57:20,681 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:57:50,763 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 57mins, 50sec :
>> >> ERROR check link, OK: disks ok,
>> >>
>> >> 2017-02-04 05:57:50,764 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.ScheduledChore: Chore: HealthChecker missed its start time
>> >> 2017-02-04 05:58:20,844 INFO
>> >> [slave-1.internal.larsgeorge.com,16020,1486216506007_ChoreService_1]
>> >> hbase.HealthCheckChore: Health status at 412837hrs, 58mins, 20sec :
>> >> ERROR check 

Re: Canary Test Tool and write sniffing

2017-02-11 Thread Lars George
Please keep in mind we are talking about two issues here:

1) The short default interval time, and
2) the issue that the canary table regions might not be on all servers.

Anyone here that tried write sniffing on a current cluster with the
SLB and saw it work?

Best,
Lars


On Mon, Feb 6, 2017 at 10:38 PM, Enis Söztutar  wrote:
> Open an issue?
> Enis
>
> On Mon, Feb 6, 2017 at 9:39 AM, Stack  wrote:
>
>> On Sun, Feb 5, 2017 at 2:25 AM, Lars George  wrote:
>>
>> > The next example is wrong too, claiming to show 60 secs, while it
>> > shows 600 secs (the default value as well).
>> >
>> > The question is still, what is a good value for intervals? Anyone here
>> > that uses the Canary that would like to chime in?
>> >
>> >
>> I was hanging out with a user where on a mid-sized cluster with Canary
>> running with defaults, the regionserver carrying meta was 100% CPU because
>> of all the requests from Canary doing repeated full-table Scans.
>>
>> 6 seconds is too short. Seems like a typo that should be 60seconds. It is
>> not as though the Canary is going to do anything about it if it finds
>> something wrong.
>>
>> S
>>
>>
>>
>>
>> > On Sat, Feb 4, 2017 at 5:40 PM, Ted Yu  wrote:
>> > > Brief search on HBASE-4393 didn't reveal why the interval was
>> shortened.
>> > >
>> > > If you read the first paragraph of:
>> > > http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode
>> > >
>> > > possibly the reasoning was that canary would exit upon seeing some
>> error
>> > > (the first time).
>> > >
>> > > BTW There was a mismatch in the description for this command: (5
>> seconds
>> > > vs. 5 milliseconds)
>> > >
>> > > ${HBASE_HOME}/bin/hbase canary -daemon -interval 5 -f false
>> > >
>> > >
>> > > On Sat, Feb 4, 2017 at 8:21 AM, Lars George 
>> > wrote:
>> > >
>> > >> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
>> > >> last one has 6 secs. Am I reading this right? It hands 6000 into the
>> > >> Thread.sleep() call, which takes millisecs. So that makes 6 secs
>> > >> between checks, which seems super short, no? I might just dull here.
>> > >>
>> > >> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu  wrote:
>> > >> > For the default interval , if you were looking at:
>> > >> >
>> > >> >   private static final long DEFAULT_INTERVAL = 6000;
>> > >> >
>> > >> > The above was from:
>> > >> >
>> > >> > HBASE-4393 Implement a canary monitoring program
>> > >> >
>> > >> > which was integrated on Tue Apr 24 07:20:16 2012
>> > >> >
>> > >> > FYI
>> > >> >
>> > >> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George 
>> > >> wrote:
>> > >> >
>> > >> >> Also, the default interval used to be 60 secs, but is now 6 secs.
>> > Does
>> > >> >> that make sense? Seems awfully short for a default, assuming you
>> have
>> > >> >> many regions or servers.
>> > >> >>
>> > >> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <
>> lars.geo...@gmail.com>
>> > >> >> wrote:
>> > >> >> > Hi,
>> > >> >> >
>> > >> >> > Looking at the Canary tool, it tries to ensure that all canary
>> test
>> > >> >> > table regions are spread across all region servers. If that is
>> not
>> > the
>> > >> >> > case, it calls:
>> > >> >> >
>> > >> >> > if (numberOfCoveredServers < numberOfServers) {
>> > >> >> >   admin.balancer();
>> > >> >> > }
>> > >> >> >
>> > >> >> > I doubt this will help with the StochasticLoadBalancer, which is
>> > known
>> > >> >> > to consider per-table balancing as one of many factors. In
>> > practice,
>> > >> >> > the SLB will most likely _not_ distribute the canary regions
>> > >> >> > sufficiently, leaving gap in the check. Switching on the
>> per-table
>> > >> >> > option is discouraged against to let it do its thing.
>> > >> >> >
>> > >> >> > Just pointing it out for vetting.
>> > >> >> >
>> > >> >> > Lars
>> > >> >>
>> > >>
>> >
>>