[jira] [Created] (HBASE-19943) Only allow removing sync replication peer which is in DA state

2018-02-05 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-19943:
-

 Summary: Only allow removing sync replication peer which is in DA 
state
 Key: HBASE-19943
 URL: https://issues.apache.org/jira/browse/HBASE-19943
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang


To simplify the logic of RemovePeerProcedure. Otherwise we may also need to 
reopen regions which makes the RemovePeerProcedure can not fit for both sync 
and normal replication peer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: Considering branching for 1.5 and other branch-1 release planning

2018-02-05 Thread ashish singhi
We at Huawei have been testing this for more than 8 months now, did not find 
any critical issues thus far and launched a service on Huawei public cloud also 
which is based HBase 1.3.1 version.
With that, I am +1 on moving the stable pointer.

Regards,
Ashish

-Original Message-
From: Zach York [mailto:zyork.contribut...@gmail.com] 
Sent: Tuesday, February 06, 2018 3:28 AM
To: dev@hbase.apache.org
Subject: Re: Considering branching for 1.5 and other branch-1 release planning

> If someone else is using 1.3 your feedback would be very
valuable.

EMR has shipped the 1.3 line since EMR 5.4.0 (March 08, 2017). We have not ran 
into any unresolved critical issues and it has been fairly stable overall.

I would be a +1 on moving the stable pointer.

Thanks,
Zach

On Mon, Feb 5, 2018 at 1:30 PM, Andrew Purtell 
wrote:

> Thanks so much for the feedback Francis.
>
> I think we are just about there to move the stable pointer.
>
>
> On Feb 5, 2018, at 9:48 AM, Francis Liu  wrote:
>
> >> If someone else is using 1.3 your feedback would be very
> > valuable.
> >
> > We are running 1.3 in production, full rollout ongoing. Ran into 
> > some issues but it's generally been stable. We'll prolly gonna be on 
> > 1.3 for a while.
> >
> > Cheers,
> > Francis
> >
> >
> >> On Sun, Feb 4, 2018 at 10:59 AM Andrew Purtell 
> >> 
> wrote:
> >>
> >> Hi Ted,
> >>
> >> If Hadoop 3 support is in place for an (eventual) 1.5.0 release, I 
> >> think that would be great.
> >>
> >>
> >>> On Sun, Feb 4, 2018 at 10:55 AM, Ted Yu  wrote:
> >>>
> >>> Andrew:
> >>> Do you think making 1.5 release support hadoop 3 is among the goals ?
> >>>
> >>> Cheers
> >>>
> >>> On Fri, Feb 2, 2018 at 3:28 PM, Andrew Purtell 
> >>> 
> >>> wrote:
> >>>
>  The backport of RSGroups to branch-1 triggered the opening of the 
>  1.4
> >>> code
>  line as branch-1.4 and releases 1.4.0 and 1.4.1.
> 
>  After the commit of HBASE-19858 (Backport HBASE-14061 (Support
> CF-level
>  Storage Policy) to branch-1), storage policy aware file placement
> might
> >>> be
>  useful enough to trigger a new minor release from branch-1. This 
>  would
> >> be
>  branch-1.5, and at least release 1.5.0. I am not sure about this yet.
> >> It
>  needs testing. I'd like to mock up a couple of use cases and 
>  determine
> >> if
>  what we have is sufficient on its own or more changes will be needed.
> I
>  want to get the idea of a 1.5 on your radar. though.
> 
>  Also, I would like to make one more release of branch-1.3 before 
>  we
> >>> retire
>  it. Mikhail passed the reins. We might have a volunteer to RM 1.3.2.
> If
>  not, I will do it. I'm expecting 1.4 will supersede 1.3 but this 
>  will
> >> be
>  decided organically depending on uptake.
> 
>  --
>  Best regards,
>  Andrew
> 
>  Words like orphans lost among the crosstalk, meaning torn from 
>  truth's decrepit hands
>    - A23, Crosstalk
> 
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrew
> >>
> >> Words like orphans lost among the crosstalk, meaning torn from 
> >> truth's decrepit hands
> >>   - A23, Crosstalk
> >>
>


[jira] [Created] (HBASE-19942) Fix flaky TestSimpleRpcScheduler

2018-02-05 Thread Guanghao Zhang (JIRA)
Guanghao Zhang created HBASE-19942:
--

 Summary: Fix flaky TestSimpleRpcScheduler
 Key: HBASE-19942
 URL: https://issues.apache.org/jira/browse/HBASE-19942
 Project: HBase
  Issue Type: Sub-task
Reporter: Guanghao Zhang


[https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests-branch2.0/lastSuccessfulBuild/artifact/dashboard.html]

 

https://builds.apache.org/job/HBASE-Flaky-Tests-branch2.0/1387/testReport/junit/org.apache.hadoop.hbase.ipc/TestSimpleRpcScheduler/testSoftAndHardQueueLimits/
 
h3. Stacktrace

java.lang.AssertionError at 
org.apache.hadoop.hbase.ipc.TestSimpleRpcScheduler.testSoftAndHardQueueLimits(TestSimpleRpcScheduler.java:451)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-19941) Flaky TestCreateTableProcedure times out in nightly, needs to LargeTests

2018-02-05 Thread Umesh Agashe (JIRA)
Umesh Agashe created HBASE-19941:


 Summary: Flaky TestCreateTableProcedure times out in nightly, 
needs to LargeTests
 Key: HBASE-19941
 URL: https://issues.apache.org/jira/browse/HBASE-19941
 Project: HBase
  Issue Type: Bug
  Components: build
Affects Versions: 2.0.0-beta-1
Reporter: Umesh Agashe
Assignee: Umesh Agashe
 Fix For: 2.0.0-beta-2


Currently its categorized as MediumTests but sometimes running all test in this 
class take more than 180 seconds. Here is the comparison of runtimes between 
local runs (on my dev machine) and in nightly runs:
||Test||Local (seconds)||Nihgtly (seconds)||
|testSimpleCreateWithSplits|~1.5|~12|
|testRollbackAndDoubleExecutionOnMobTable|~4.7|~21|
|testSimpleCreate|~1.7|~11|
|testRollbackAndDoubleExecution|~4.3|~18|
|testMRegions|~26.4|Timed out after 90 seconds|

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-19841) Tests against hadoop3 fail with StreamLacksCapabilityException

2018-02-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reopened HBASE-19841:
---

Reopening. Breaks launching of MR jobs on a cluster. Here is what a good launch 
looks like:
{code}
...
18/02/05 17:11:33 INFO impl.YarnClientImpl: Submitted application 
application_1517369646236_0009
18/02/05 17:11:33 INFO mapreduce.Job: The url to track the job: 
http://ve0524.halxg.cloudera.com:10134/proxy/application_1517369646236_0009/
18/02/05 17:11:33 INFO mapreduce.Job: Running job: job_1517369646236_0009
18/02/05 17:11:40 INFO mapreduce.Job: Job job_1517369646236_0009 running in 
uber mode : false
18/02/05 17:11:40 INFO mapreduce.Job:  map 0% reduce 0%
18/02/05 17:11:57 INFO mapreduce.Job:  map 14% reduce 0%
...
{code}

... but now it does this

{code}
18/02/05 17:17:54 INFO mapreduce.Job: The url to track the job: 
http://ve0524.halxg.cloudera.com:10134/proxy/application_1517369646236_0011/
18/02/05 17:17:54 INFO mapreduce.Job: Running job: job_1517369646236_0011
18/02/05 17:17:56 INFO mapreduce.Job: Job job_1517369646236_0011 running in 
uber mode : false
18/02/05 17:17:56 INFO mapreduce.Job:  map 0% reduce 0%
18/02/05 17:17:56 INFO mapreduce.Job: Job job_1517369646236_0011 failed with 
state FAILED due to: Application application_1517369646236_0011 failed 2 times 
due to AM Container for appattempt_1517369646236_0011_02 exited with  
exitCode: -1000
Failing this attempt.Diagnostics: File 
file:/tmp/stack/.staging/job_1517369646236_0011/job.splitmetainfo does not exist
java.io.FileNotFoundException: File 
file:/tmp/stack/.staging/job_1517369646236_0011/job.splitmetainfo does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

For more detailed output, check the application tracking page: 
http://ve0524.halxg.cloudera.com:8188/applicationhistory/app/application_1517369646236_0011
 Then click on links to logs of each attempt.
. Failing the application.
18/02/05 17:17:56 INFO mapreduce.Job: Counters: 0
{code}

If I revert this patch, the submit runs again.

I'd made the staging dir /tmp/stack and seemed to get further... The job 
staging was made in the local fs... but it seems like we are then looking for 
it up in hdfs. My guess is our stamping the fs as local until minihdfscluster 
starts works for the unit test case but it messes up the inference of fs that 
allows the above submission to work.

I'd like to revert this if thats ok.

> Tests against hadoop3 fail with StreamLacksCapabilityException
> --
>
> Key: HBASE-19841
> URL: https://issues.apache.org/jira/browse/HBASE-19841
> Project: HBase
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Mike Drob
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: 19841.007.patch, 19841.06.patch, 19841.v0.txt, 
> 19841.v1.txt, HBASE-19841.v10.patch, HBASE-19841.v11.patch, 
> HBASE-19841.v11.patch, HBASE-19841.v2.patch, HBASE-19841.v3.patch, 
> HBASE-19841.v4.patch, HBASE-19841.v5.patch, HBASE-19841.v7.patch, 
> HBASE-19841.v8.patch, HBASE-19841.v8.patch, HBASE-19841.v8.patch, 
> HBASE-19841.v9.patch
>
>
> The following can be observed running against hadoop3:
> {code}
> java.io.IOException: cannot get log writer
>   at 
> org.apache.hadoop.hbase.regionserver.TestCompactingMemStore.compactingSetUp(TestCompactingMemStor

[jira] [Reopened] (HBASE-19927) TestFullLogReconstruction flakey

2018-02-05 Thread Duo Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang reopened HBASE-19927:
---

A bit strange

{noformat}
2018-02-05 19:05:43,537 INFO  [Time-limited test] 
regionserver.HRegionServer(2116): * STOPPING region server 
'asf903.gq1.ygridcore.net,57911,1517857533524' *
2018-02-05 19:05:43,537 INFO  [Time-limited test] 
regionserver.HRegionServer(2130): STOPPED: Shutdown requested
2018-02-05 19:05:43,538 INFO  [Time-limited test] 
regionserver.HRegionServer(2116): * STOPPING region server 
'asf903.gq1.ygridcore.net,50054,1517857533606' *
2018-02-05 19:05:43,538 INFO  [RS:0;asf903:57911] 
regionserver.SplitLogWorker(160): Sending interrupt to stop the worker thread
2018-02-05 19:05:43,538 INFO  [Time-limited test] 
regionserver.HRegionServer(2130): STOPPED: Shutdown requested
2018-02-05 19:05:43,538 INFO  [Time-limited test] 
regionserver.HRegionServer(2116): * STOPPING region server 
'asf903.gq1.ygridcore.net,42069,1517857533678' *

2018-02-05 19:05:43,974 ERROR [regionserver/asf903:0.logRoller] 
helpers.MarkerIgnoringBase(159): * ABORTING region server 
asf903.gq1.ygridcore.net,57911,1517857533524: IOE in log roller *
{noformat}

The aborting still happens after the stopping in shutdown. Let me check.

> TestFullLogReconstruction flakey
> 
>
> Key: HBASE-19927
> URL: https://issues.apache.org/jira/browse/HBASE-19927
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19927.patch, js, out
>
>
> Fails pretty frequently in hadoopqa builds.
> There is a recent hang in 
> org.apache.hadoop.hbase.TestFullLogReconstruction.tearDownAfterClass(TestFullLogReconstruction.java:68)
> In here... 
> https://builds.apache.org/job/PreCommit-HBASE-Build/11363/testReport/org.apache.hadoop.hbase/TestFullLogReconstruction/org_apache_hadoop_hbase_TestFullLogReconstruction/
> ... see here.
> Thread 1250 (RS_CLOSE_META-edd281aedb18:59863-0):
>   State: TIMED_WAITING
>   Blocked count: 92
>   Waited count: 278
>   Stack:
> java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:133)
> 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718)
> 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:605)
> 
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullAppendTransaction(WALUtil.java:154)
> 
> org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeFlushMarker(WALUtil.java:81)
> 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2645)
> 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2356)
> 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2328)
> 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2319)
> org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1531)
> org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1437)
> 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:104)
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> We missed a signal? We need to do an interrupt? The log is not all there in 
> hadoopqa builds so hard to see all that is going on. This test is not in the 
> flakey set either



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-19915) From split/ merge procedures daughter/ merged regions get created in OFFLINE state

2018-02-05 Thread Appy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Appy resolved HBASE-19915.
--
Resolution: Fixed

> From split/ merge procedures daughter/ merged regions get created in OFFLINE 
> state
> --
>
> Key: HBASE-19915
> URL: https://issues.apache.org/jira/browse/HBASE-19915
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0-beta-1
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: hbase-19915.addendum.patch, 
> hbase-19915.master.001.patch, hbase-19915.master.001.patch
>
>
> See HBASE-19530. When regions are created initial state should be CLOSED. Bug 
> was discovered while debugging flaky test 
> TestSplitTableRegionProcedure#testRollbackAndDoubleExecution with numOfSteps 
> set to 4. After updating daughter regions in meta when master is restarted, 
> startup sequence of master assigns all OFFLINE regions. As daughter regions 
> are stored with OFFLINE state, daughter regions are assigned. This is 
> followed by re-assignment of daughter regions from resumed 
> SplitTableRegionProcedure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-19915) From split/ merge procedures daughter/ merged regions get created in OFFLINE state

2018-02-05 Thread Umesh Agashe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe reopened HBASE-19915:
--

As [~appy] pointed out, daughterA region gets stored with state CLOSED twice.

> From split/ merge procedures daughter/ merged regions get created in OFFLINE 
> state
> --
>
> Key: HBASE-19915
> URL: https://issues.apache.org/jira/browse/HBASE-19915
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0-beta-1
>Reporter: Umesh Agashe
>Assignee: Umesh Agashe
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: hbase-19915.master.001.patch, 
> hbase-19915.master.001.patch
>
>
> See HBASE-19530. When regions are created initial state should be CLOSED. Bug 
> was discovered while debugging flaky test 
> TestSplitTableRegionProcedure#testRollbackAndDoubleExecution with numOfSteps 
> set to 4. After updating daughter regions in meta when master is restarted, 
> startup sequence of master assigns all OFFLINE regions. As daughter regions 
> are stored with OFFLINE state, daughter regions are assigned. This is 
> followed by re-assignment of daughter regions from resumed 
> SplitTableRegionProcedure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Considering branching for 1.5 and other branch-1 release planning

2018-02-05 Thread Zach York
> If someone else is using 1.3 your feedback would be very
valuable.

EMR has shipped the 1.3 line since EMR 5.4.0 (March 08, 2017). We have not
ran into any unresolved critical issues and it has been fairly stable
overall.

I would be a +1 on moving the stable pointer.

Thanks,
Zach

On Mon, Feb 5, 2018 at 1:30 PM, Andrew Purtell 
wrote:

> Thanks so much for the feedback Francis.
>
> I think we are just about there to move the stable pointer.
>
>
> On Feb 5, 2018, at 9:48 AM, Francis Liu  wrote:
>
> >> If someone else is using 1.3 your feedback would be very
> > valuable.
> >
> > We are running 1.3 in production, full rollout ongoing. Ran into some
> > issues but it's generally been stable. We'll prolly gonna be on 1.3 for a
> > while.
> >
> > Cheers,
> > Francis
> >
> >
> >> On Sun, Feb 4, 2018 at 10:59 AM Andrew Purtell 
> wrote:
> >>
> >> Hi Ted,
> >>
> >> If Hadoop 3 support is in place for an (eventual) 1.5.0 release, I think
> >> that would be great.
> >>
> >>
> >>> On Sun, Feb 4, 2018 at 10:55 AM, Ted Yu  wrote:
> >>>
> >>> Andrew:
> >>> Do you think making 1.5 release support hadoop 3 is among the goals ?
> >>>
> >>> Cheers
> >>>
> >>> On Fri, Feb 2, 2018 at 3:28 PM, Andrew Purtell 
> >>> wrote:
> >>>
>  The backport of RSGroups to branch-1 triggered the opening of the 1.4
> >>> code
>  line as branch-1.4 and releases 1.4.0 and 1.4.1.
> 
>  After the commit of HBASE-19858 (Backport HBASE-14061 (Support
> CF-level
>  Storage Policy) to branch-1), storage policy aware file placement
> might
> >>> be
>  useful enough to trigger a new minor release from branch-1. This would
> >> be
>  branch-1.5, and at least release 1.5.0. I am not sure about this yet.
> >> It
>  needs testing. I'd like to mock up a couple of use cases and determine
> >> if
>  what we have is sufficient on its own or more changes will be needed.
> I
>  want to get the idea of a 1.5 on your radar. though.
> 
>  Also, I would like to make one more release of branch-1.3 before we
> >>> retire
>  it. Mikhail passed the reins. We might have a volunteer to RM 1.3.2.
> If
>  not, I will do it. I'm expecting 1.4 will supersede 1.3 but this will
> >> be
>  decided organically depending on uptake.
> 
>  --
>  Best regards,
>  Andrew
> 
>  Words like orphans lost among the crosstalk, meaning torn from truth's
>  decrepit hands
>    - A23, Crosstalk
> 
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >> Andrew
> >>
> >> Words like orphans lost among the crosstalk, meaning torn from truth's
> >> decrepit hands
> >>   - A23, Crosstalk
> >>
>


[jira] [Resolved] (HBASE-19931) TestMetaWithReplicas failing 100% of the time in testHBaseFsckWithMetaReplicas

2018-02-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-19931.
---
Resolution: Fixed

Re-resolving. Let HBASE-19840 be responsible for latest findings regards this 
test failing.

> TestMetaWithReplicas failing 100% of the time in testHBaseFsckWithMetaReplicas
> --
>
> Key: HBASE-19931
> URL: https://issues.apache.org/jira/browse/HBASE-19931
> Project: HBase
>  Issue Type: Sub-task
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19931.branch-2.001.patch
>
>
> Somehow we missed a test that depends on a run of HBCK. It fails 100% of the 
> time now because of HBASE-19726 Failed to start HMaster due to infinite 
> retrying on meta assign where we no longer update hbase:meta with the state 
> of hbase:meta; rather, hbase:meta's always-ENABLED state is inferred. It 
> broke HBCK here.
> So, disable the test and just-in-case add meta as ENABLED to hbck though hbck 
> as is is not for hbase2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HBASE-19907) TestMetaWithReplicas still flakey

2018-02-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-19907.
---
Resolution: Fixed

Resolving. HBASE-19931 has taken up the baton on the new failure types for 
TestMetaWithReplicas. Pushed to branch-2 and master.

> TestMetaWithReplicas still flakey
> -
>
> Key: HBASE-19907
> URL: https://issues.apache.org/jira/browse/HBASE-19907
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19907.master.001.patch
>
>
> Still fails because all meta replicas arrive at same server even though 
> supposedly protection against this added by me in HBASE-19840.
> ---
> Test set: org.apache.hadoop.hbase.client.TestMetaWithReplicas
> ---
> Tests run: 5, Failures: 0, Errors: 2, Skipped: 1, Time elapsed: 600.251 s <<< 
> FAILURE! - in org.apache.hadoop.hbase.client.TestMetaWithReplicas
> org.apache.hadoop.hbase.client.TestMetaWithReplicas  Time elapsed: 563.656 s  
> <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600 
> seconds
>   at 
> org.apache.hadoop.hbase.client.TestMetaWithReplicas.shutdownMetaAndDoValidations(TestMetaWithReplicas.java:255)
>   at 
> org.apache.hadoop.hbase.client.TestMetaWithReplicas.testShutdownHandling(TestMetaWithReplicas.java:181)
> org.apache.hadoop.hbase.client.TestMetaWithReplicas  Time elapsed: 563.656 s  
> <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread 
> NIOServerCxn.Factory:0.0.0.0/0.0.0.0:49912
> The move of hbase:meta actually moves it back to same server no good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Considering branching for 1.5 and other branch-1 release planning

2018-02-05 Thread Andrew Purtell
Thanks so much for the feedback Francis. 

I think we are just about there to move the stable pointer. 


On Feb 5, 2018, at 9:48 AM, Francis Liu  wrote:

>> If someone else is using 1.3 your feedback would be very
> valuable.
> 
> We are running 1.3 in production, full rollout ongoing. Ran into some
> issues but it's generally been stable. We'll prolly gonna be on 1.3 for a
> while.
> 
> Cheers,
> Francis
> 
> 
>> On Sun, Feb 4, 2018 at 10:59 AM Andrew Purtell  wrote:
>> 
>> Hi Ted,
>> 
>> If Hadoop 3 support is in place for an (eventual) 1.5.0 release, I think
>> that would be great.
>> 
>> 
>>> On Sun, Feb 4, 2018 at 10:55 AM, Ted Yu  wrote:
>>> 
>>> Andrew:
>>> Do you think making 1.5 release support hadoop 3 is among the goals ?
>>> 
>>> Cheers
>>> 
>>> On Fri, Feb 2, 2018 at 3:28 PM, Andrew Purtell 
>>> wrote:
>>> 
 The backport of RSGroups to branch-1 triggered the opening of the 1.4
>>> code
 line as branch-1.4 and releases 1.4.0 and 1.4.1.
 
 After the commit of HBASE-19858 (Backport HBASE-14061 (Support CF-level
 Storage Policy) to branch-1), storage policy aware file placement might
>>> be
 useful enough to trigger a new minor release from branch-1. This would
>> be
 branch-1.5, and at least release 1.5.0. I am not sure about this yet.
>> It
 needs testing. I'd like to mock up a couple of use cases and determine
>> if
 what we have is sufficient on its own or more changes will be needed. I
 want to get the idea of a 1.5 on your radar. though.
 
 Also, I would like to make one more release of branch-1.3 before we
>>> retire
 it. Mikhail passed the reins. We might have a volunteer to RM 1.3.2. If
 not, I will do it. I'm expecting 1.4 will supersede 1.3 but this will
>> be
 decided organically depending on uptake.
 
 --
 Best regards,
 Andrew
 
 Words like orphans lost among the crosstalk, meaning torn from truth's
 decrepit hands
   - A23, Crosstalk
 
>>> 
>> 
>> 
>> 
>> --
>> Best regards,
>> Andrew
>> 
>> Words like orphans lost among the crosstalk, meaning torn from truth's
>> decrepit hands
>>   - A23, Crosstalk
>> 


[jira] [Created] (HBASE-19940) TestMetaShutdownHandler flakey

2018-02-05 Thread stack (JIRA)
stack created HBASE-19940:
-

 Summary: TestMetaShutdownHandler flakey
 Key: HBASE-19940
 URL: https://issues.apache.org/jira/browse/HBASE-19940
 Project: HBase
  Issue Type: Sub-task
Reporter: stack


Fails 13% of the time. One of the RS won't go down. It has an errant thread 
running. Not sure what.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-19939) TestSplitTableRegion#testSplitWithoutPONR() and testRecoveryAndDoubleExecution() are failing with NPE

2018-02-05 Thread Umesh Agashe (JIRA)
Umesh Agashe created HBASE-19939:


 Summary: TestSplitTableRegion#testSplitWithoutPONR() and 
testRecoveryAndDoubleExecution() are failing with NPE
 Key: HBASE-19939
 URL: https://issues.apache.org/jira/browse/HBASE-19939
 Project: HBase
  Issue Type: Improvement
  Components: amv2
Affects Versions: 2.0.0-beta-1
Reporter: Umesh Agashe
Assignee: Umesh Agashe
 Fix For: 2.0.0-beta-2


Error is:
{code:java}
java.lang.AssertionError: found exception: java.lang.NullPointerException via 
CODE-BUG: Uncaught runtime exception: pid=154, 
state=RUNNABLE:SPLIT_TABLE_REGION_CREATE_DAUGHTER_REGIONS; 
SplitTableRegionProcedure table=testRecoveryAndDoubleExecution, 
parent=3d8d459ba395c2cf6b1e5c71aca92cfd, 
daughterA=c6531c10effa8e542159ab82a87bd75e, 
daughterB=ee34a9af88273b6c06e1a688fc50ed6e:java.lang.NullPointerException: 
at 
org.apache.hadoop.hbase.master.assignment.TestSplitTableRegionProcedure.testRecoveryAndDoubleExecution(TestSplitTableRegionProcedure.java:411){code}
Exception from the output file:
{code:java}
2018-02-05 18:00:48,205 ERROR [PEWorker-1] procedure2.ProcedureExecutor(1480): 
CODE-BUG: Uncaught runtime exception: pid=19, 
state=RUNNABLE:SPLIT_TABLE_REGION_CREATE_DAUGHTER_REGIONS; 
SplitTableRegionProcedure table=testSplitWithoutPONR, 
parent=57114194fb486a3988b232bcf10eb177, 
daughterA=749aa83c03b8f7c6b642cd73c5b51e43, 
daughterB=a53ec69e8dd2cfa6c0be2b9a7eb271bb
java.lang.NullPointerException
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.splitStoreFiles(SplitTableRegionProcedure.java:617)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.createDaughterRegions(SplitTableRegionProcedure.java:541)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:241)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:89)
at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:180)
at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1455)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1224)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:78)
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1734){code}
Value of 'htd' is null as it is initialized in the constructor but when the 
object is deserialized its null.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-19840) Flakey TestMetaWithReplicas

2018-02-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack reopened HBASE-19840:
---

I see some flakyness still. There is something weird going on. Two ServerNames 
seem to hash the same. Doesn't make sense (I made a test to try it).  Reopening 
to figure. Pushing a bit more debug... in meantime.

Reopening.

> Flakey TestMetaWithReplicas
> ---
>
> Key: HBASE-19840
> URL: https://issues.apache.org/jira/browse/HBASE-19840
> Project: HBase
>  Issue Type: Sub-task
>  Components: flakey, test
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19840.master.001.patch, 
> HBASE-19840.master.001.patch
>
>
> Failing about 15% of the time..  In testShutdownHandling.. 
> [https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests-branch2.0/lastSuccessfulBuild/artifact/dashboard.html]
>  
> Adding some debug. Its hard to follow what is going on in this test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Considering branching for 1.5 and other branch-1 release planning

2018-02-05 Thread Francis Liu
> If someone else is using 1.3 your feedback would be very
valuable.

We are running 1.3 in production, full rollout ongoing. Ran into some
issues but it's generally been stable. We'll prolly gonna be on 1.3 for a
while.

Cheers,
Francis


On Sun, Feb 4, 2018 at 10:59 AM Andrew Purtell  wrote:

> Hi Ted,
>
> If Hadoop 3 support is in place for an (eventual) 1.5.0 release, I think
> that would be great.
>
>
> On Sun, Feb 4, 2018 at 10:55 AM, Ted Yu  wrote:
>
> > Andrew:
> > Do you think making 1.5 release support hadoop 3 is among the goals ?
> >
> > Cheers
> >
> > On Fri, Feb 2, 2018 at 3:28 PM, Andrew Purtell 
> > wrote:
> >
> > > The backport of RSGroups to branch-1 triggered the opening of the 1.4
> > code
> > > line as branch-1.4 and releases 1.4.0 and 1.4.1.
> > >
> > > After the commit of HBASE-19858 (Backport HBASE-14061 (Support CF-level
> > > Storage Policy) to branch-1), storage policy aware file placement might
> > be
> > > useful enough to trigger a new minor release from branch-1. This would
> be
> > > branch-1.5, and at least release 1.5.0. I am not sure about this yet.
> It
> > > needs testing. I'd like to mock up a couple of use cases and determine
> if
> > > what we have is sufficient on its own or more changes will be needed. I
> > > want to get the idea of a 1.5 on your radar. though.
> > >
> > > Also, I would like to make one more release of branch-1.3 before we
> > retire
> > > it. Mikhail passed the reins. We might have a volunteer to RM 1.3.2. If
> > > not, I will do it. I'm expecting 1.4 will supersede 1.3 but this will
> be
> > > decided organically depending on uptake.
> > >
> > > --
> > > Best regards,
> > > Andrew
> > >
> > > Words like orphans lost among the crosstalk, meaning torn from truth's
> > > decrepit hands
> > >- A23, Crosstalk
> > >
> >
>
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>- A23, Crosstalk
>


[jira] [Resolved] (HBASE-19837) Flakey TestRegionLoad

2018-02-05 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-19837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-19837.
---
Resolution: Fixed
  Assignee: stack

Resolving. Will open new issue if still flakey to refactor the test.

> Flakey TestRegionLoad
> -
>
> Key: HBASE-19837
> URL: https://issues.apache.org/jira/browse/HBASE-19837
> Project: HBase
>  Issue Type: Sub-task
>  Components: flakey, test
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: 
> 0001-HBASE-19837-Flakey-TestRegionLoad-ADDENDUM-Report-mo.patch, 
> 0001-HBASE-19837-Flakey-TestRegionLoad.patch, HBASE-19837.branch-2.001.patch
>
>
> This one fails the most in the flakey list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-19938) Allow write request from replication but reject write request from user client when S state.

2018-02-05 Thread Zheng Hu (JIRA)
Zheng Hu created HBASE-19938:


 Summary: Allow write request from replication but reject write 
request from user client when S state.
 Key: HBASE-19938
 URL: https://issues.apache.org/jira/browse/HBASE-19938
 Project: HBase
  Issue Type: Sub-task
Reporter: Zheng Hu
Assignee: Zheng Hu


According the doc,  we should reject write request  when in S state ,however, 
the replication data from master cluster will turn to a batch mutation request 
(it's a write request). 

So, for peer in S state,  it should  distinguish write request from replication 
or from user ..  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-19937) Enable rsgroup NPE in CreateTableProcedure

2018-02-05 Thread Xiaolin Ha (JIRA)
Xiaolin Ha created HBASE-19937:
--

 Summary: Enable rsgroup NPE in CreateTableProcedure
 Key: HBASE-19937
 URL: https://issues.apache.org/jira/browse/HBASE-19937
 Project: HBase
  Issue Type: Bug
  Components: rsgroup
Affects Versions: 2.0.0-beta-2
Reporter: Xiaolin Ha
Assignee: Xiaolin Ha


When enable rsgroup, it may throws NPE as follows,

2018-02-02,16:12:45,688 ERROR 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception: pid=7, state=RUNNABLE:CREATE_TABLE_ASSIGN_REGIONS; 
CreateTableProcedure table=hbase:rsgroup
java.lang.NullPointerException
 at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:254)
 at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
 at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createRoundRobinAssignProcedures(AssignmentManager.java:603)
 at 
org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:108)
 at 
org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:51)
 at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:182)
 at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1227)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:78)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1738)

 

As a result of CreateTableProcedure.rollbackState, it may then print logs 
warning TableExistsException as follows,

2018-02-02,16:12:55,503 WARN 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker: 
Failed to perform check
java.io.IOException: Failed to create group table. 
org.apache.hadoop.hbase.TableExistsException: hbase:rsgroup
 at 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker.createRSGroupTable(RSGroupInfoManagerImpl.java:877)

 

After some auto-retries, it loops running the thread RSGroupStartupWorker, will 
print logs as follows, 

2018-02-02,16:23:17,626 INFO 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker: 
RSGroup table=hbase:rsgroup isOnline=true, regionCount=0, assignCount=0, 
rootMetaFound=true
2018-02-02,16:23:17,730 INFO 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker: 
RSGroup table=hbase:rsgroup isOnline=true, regionCount=0, assignCount=0, 
rootMetaFound=true
2018-02-02,16:23:17,834 INFO 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker: 
RSGroup table=hbase:rsgroup isOnline=true, regionCount=0, assignCount=0, 
rootMetaFound=true
2018-02-02,16:23:17,937 INFO 
org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker: 
RSGroup table=hbase:rsgroup isOnline=true, regionCount=0, assignCount=0, 
rootMetaFound=true

 

And using shells of rsgroup, it will tips that currently is in "offline mode".

 

The reason of this problem is that CreateTableProcedure used 
RSGroupBasedLoadBalancer, who has member variables initialized depending on 
return of CreateTableProcedure. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)