date:20190122

Syeda Arshiya Tabreen created HBASE-21763:
-

 Summary: [HBCK2] hbck2 options does not work and throws exceptions
 Key: HBASE-21763
 URL: https://issues.apache.org/jira/browse/HBASE-21763
 Project: HBase
  Issue Type: Bug
  Components: hbck2
Affects Versions: hbck2-1.0.0
Reporter: Syeda Arshiya Tabreen
 Fix For: hbck2-1.0.0


HBCK2 options throws below exceptions when executed


1.* --version* option throws null pointer exception
2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
3. *--zookeeper.znode.parent* option throws IllegalArgumentException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21751) WAL creation fails during region open may cause region assign forever fail



[ 
https://issues.apache.org/jira/browse/HBASE-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749607#comment-16749607
 ] 

Allan Yang commented on HBASE-21751:


{quote}
But if you do not use multi WAL, this will not cause a very big problem?
{quote}
We don not use multi WAL. Yes, no region on the RS before can cause this, but 
in our case, it's the meta wal, so the RS don't host the meta region before
{quote}
And we will retry a lot of times when rolling a WAL, so for your production, 
the first thing is that why we still fail after so many retries? The actual 
problem is on HDFS?
{quote}
Yes, it is HDFS causing this, it is because of disk full this time, but we have 
seen some other glitches in HDFS can cause roll log fail. Actually, the disk 
full problem is soon auto recovered after hfiles in archive dir deleted. But 
due to this issue, the meta region can not online forever.

> WAL creation fails during region open may cause region assign forever fail
> --
>
> Key: HBASE-21751
> URL: https://issues.apache.org/jira/browse/HBASE-21751
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21751.patch, HBASE-21751v2.patch
>
>
> During the first region opens on the RS, WALFactory will create a WAL file, 
> but if the wal creation fails, in some cases, HDFS will leave a empty file in 
> the dir(e.g. disk full, file is created succesfully but block allocation 
> fails). We have a check in AbstractFSWAL that if WAL belong to the same 
> factory exists, then a error will be throw. Thus, the region can never be 
> open on this RS later.
> {code:java}
> 2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0] 
> handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
> java.io.IOException: Target WAL already exists within directory 
> hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.(AbstractFSWAL.java:382)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.(AsyncFSWAL.java:210)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
> at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
> at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21763) [HBCK2] hbck2 options does not work and throws exceptions



[ 
https://issues.apache.org/jira/browse/HBASE-21763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749620#comment-16749620
 ] 

Syeda Arshiya Tabreen commented on HBASE-21763:
---

[~stack] ,I could not be able to assign this Jira to me, I would like to fix 
this.

> [HBCK2] hbck2 options does not work and throws exceptions
> -
>
> Key: HBASE-21763
> URL: https://issues.apache.org/jira/browse/HBASE-21763
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: hbck2-1.0.0
>Reporter: Syeda Arshiya Tabreen
>Priority: Minor
> Fix For: hbck2-1.0.0
>
>
> HBCK2 options throws below exceptions when executed
> 1. *--version* option throws NullPointerException
> 2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
> 3. *--zookeeper.znode.parent* option throws IllegalArgumentException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21763) [HBCK2] hbck2 options does not work and throws exceptions



 [ 
https://issues.apache.org/jira/browse/HBASE-21763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syeda Arshiya Tabreen updated HBASE-21763:
--
Description: 
HBCK2 options throws below exceptions when executed


1. *--version* option throws NullPointerException
2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
3. *--zookeeper.znode.parent* option throws IllegalArgumentException

  was:
HBCK2 options throws below exceptions when executed


1. *--version* option throws null pointer exception
2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
3. *--zookeeper.znode.parent* option throws IllegalArgumentException


> [HBCK2] hbck2 options does not work and throws exceptions
> -
>
> Key: HBASE-21763
> URL: https://issues.apache.org/jira/browse/HBASE-21763
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: hbck2-1.0.0
>Reporter: Syeda Arshiya Tabreen
>Priority: Minor
> Fix For: hbck2-1.0.0
>
>
> HBCK2 options throws below exceptions when executed
> 1. *--version* option throws NullPointerException
> 2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
> 3. *--zookeeper.znode.parent* option throws IllegalArgumentException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21763) [HBCK2] hbck2 options does not work and throws exceptions



 [ 
https://issues.apache.org/jira/browse/HBASE-21763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syeda Arshiya Tabreen updated HBASE-21763:
--
Description: 
HBCK2 options throws below exceptions when executed


1. *--version* option throws null pointer exception
2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
3. *--zookeeper.znode.parent* option throws IllegalArgumentException

  was:
HBCK2 options throws below exceptions when executed


1.* --version* option throws null pointer exception
2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
3. *--zookeeper.znode.parent* option throws IllegalArgumentException


> [HBCK2] hbck2 options does not work and throws exceptions
> -
>
> Key: HBASE-21763
> URL: https://issues.apache.org/jira/browse/HBASE-21763
> Project: HBase
>  Issue Type: Bug
>  Components: hbck2
>Affects Versions: hbck2-1.0.0
>Reporter: Syeda Arshiya Tabreen
>Priority: Minor
> Fix For: hbck2-1.0.0
>
>
> HBCK2 options throws below exceptions when executed
> 1. *--version* option throws null pointer exception
> 2. *--hbase.zookeeper.property.clientPort* option throws NumberFormatException
> 3. *--zookeeper.znode.parent* option throws IllegalArgumentException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21748) Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because it's an quite time consuming.) to branch-1

2019-01-22 Thread Zheng Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749591#comment-16749591
 ] 

Zheng Hu commented on HBASE-21748:
--

[~apurtell], Thanks for your patch, skimmed the patch.  Adding an atomic 
integer cellCount seems good.  there are some points need to be care: 
-  I think we should throw an UnsupportedOperationException in 
CellSkipListSet#size if the delegate is CSLM,   for avoiding others to call 
this time-consuming method in some read/write path without intention. 
-   for memstore#add, the  cellCount should increment only when putting CSLM 
successfully.   for memstore#upsert,  no need to increment the cellCount, but 
need to decrement the cell count if we ensure to remove useless cell version. 

Thanks.

> Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because 
> it's an quite time consuming.) to branch-1
> ---
>
> Key: HBASE-21748
> URL: https://issues.apache.org/jira/browse/HBASE-21748
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-21748-branch-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21715) Do not throw UnsupportedOperationException in ProcedureFuture.get



[ 
https://issues.apache.org/jira/browse/HBASE-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749601#comment-16749601
 ] 

Duo Zhang commented on HBASE-21715:
---

Please change your import order formatter? And also, better use a configured 
timeout instead of infinite wait? The timeout could be a bit long, i.e, 10 
minutes default?

> Do not throw UnsupportedOperationException in ProcedureFuture.get
> -
>
> Key: HBASE-21715
> URL: https://issues.apache.org/jira/browse/HBASE-21715
> Project: HBase
>  Issue Type: Task
>Reporter: Duo Zhang
>Assignee: Junhong Xu
>Priority: Major
> Attachments: HBase-21715.v01.patch
>
>
> This is really a bad practice, no one would expected that a Future does not 
> support get, and this can not be detected at compile time. Even though we do 
> not want user to wait for ever, we could set a long timeout, for example, 10 
> minutes,instead of throwing UnsuportedOperationException. I've already been 
> hurt many times...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21688) Address WAL filesystem issues



 [ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated HBASE-21688:
---
Labels: s3  (was: )

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>  Components: Filesystem Integration, wal
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
>  Labels: s3
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.2.patch, HBASE-21688-amend.patch, 
> HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21751) WAL creation fails during region open may cause region assign forever fail



[ 
https://issues.apache.org/jira/browse/HBASE-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749590#comment-16749590
 ] 

Duo Zhang commented on HBASE-21751:
---

But if you do not use multi WAL, this will not cause a very big problem? As 
there is no region on the RS? And we will retry a lot of times when rolling a 
WAL, so for your production, the first thing is that why we still fail after so 
many retries? The actual problem is on HDFS?

> WAL creation fails during region open may cause region assign forever fail
> --
>
> Key: HBASE-21751
> URL: https://issues.apache.org/jira/browse/HBASE-21751
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21751.patch, HBASE-21751v2.patch
>
>
> During the first region opens on the RS, WALFactory will create a WAL file, 
> but if the wal creation fails, in some cases, HDFS will leave a empty file in 
> the dir(e.g. disk full, file is created succesfully but block allocation 
> fails). We have a check in AbstractFSWAL that if WAL belong to the same 
> factory exists, then a error will be throw. Thus, the region can never be 
> open on this RS later.
> {code:java}
> 2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0] 
> handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
> java.io.IOException: Target WAL already exists within directory 
> hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.(AbstractFSWAL.java:382)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.(AsyncFSWAL.java:210)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
> at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
> at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21751) WAL creation fails during region open may cause region assign forever fail



[ 
https://issues.apache.org/jira/browse/HBASE-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749585#comment-16749585
 ] 

Allan Yang commented on HBASE-21751:


[~Apache9], yes, this issue has already cause one online failure, we are making 
sure this can not happen again.

> WAL creation fails during region open may cause region assign forever fail
> --
>
> Key: HBASE-21751
> URL: https://issues.apache.org/jira/browse/HBASE-21751
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21751.patch, HBASE-21751v2.patch
>
>
> During the first region opens on the RS, WALFactory will create a WAL file, 
> but if the wal creation fails, in some cases, HDFS will leave a empty file in 
> the dir(e.g. disk full, file is created succesfully but block allocation 
> fails). We have a check in AbstractFSWAL that if WAL belong to the same 
> factory exists, then a error will be throw. Thus, the region can never be 
> open on this RS later.
> {code:java}
> 2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0] 
> handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
> java.io.IOException: Target WAL already exists within directory 
> hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.(AbstractFSWAL.java:382)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.(AsyncFSWAL.java:210)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
> at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
> at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21762) Move some methods in ClusterConnection to Connection

Duo Zhang created HBASE-21762:
-

 Summary: Move some methods in ClusterConnection to Connection
 Key: HBASE-21762
 URL: https://issues.apache.org/jira/browse/HBASE-21762
 Project: HBase
  Issue Type: Task
Reporter: Duo Zhang


For example, clearRegionCache, getHbck, etc. The getHbck method will be marked 
as IA.LimitedPrivate to indicate that normal user should not use it. And I 
think this is OK, as it is not easy to use the Hbck interface directly, and the 
name 'hbck' also implies that it should not be used for normal case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21762) Move some methods in ClusterConnection to Connection



 [ 
https://issues.apache.org/jira/browse/HBASE-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21762:
--
Fix Version/s: 2.2.0
   3.0.0

> Move some methods in ClusterConnection to Connection
> 
>
> Key: HBASE-21762
> URL: https://issues.apache.org/jira/browse/HBASE-21762
> Project: HBase
>  Issue Type: Task
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
>
> For example, clearRegionCache, getHbck, etc. The getHbck method will be 
> marked as IA.LimitedPrivate to indicate that normal user should not use it. 
> And I think this is OK, as it is not easy to use the Hbck interface directly, 
> and the name 'hbck' also implies that it should not be used for normal case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21688) Address WAL filesystem issues



 [ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated HBASE-21688:
---
Component/s: wal
 Filesystem Integration

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>  Components: Filesystem Integration, wal
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.2.patch, HBASE-21688-amend.patch, 
> HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21688) Address WAL filesystem issues



[ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749582#comment-16749582
 ] 

Nihal Jain commented on HBASE-21688:


Hi [~vrodionov] Do you plan to submit a patch for branch-2 and branch-1? If 
not, I can take it up.

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.2.patch, HBASE-21688-amend.patch, 
> HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21751) WAL creation fails during region open may cause region assign forever fail



[ 
https://issues.apache.org/jira/browse/HBASE-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749574#comment-16749574
 ] 

Duo Zhang commented on HBASE-21751:
---

Is this very important to you [~allan163]? I need to revisit the code again.

IIRC, the assumption in the WAL implementation is that, if we fail to roll, 
then the RS will abort. So I'm not sure if there are other problems if we 
change the behavior.

> WAL creation fails during region open may cause region assign forever fail
> --
>
> Key: HBASE-21751
> URL: https://issues.apache.org/jira/browse/HBASE-21751
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21751.patch, HBASE-21751v2.patch
>
>
> During the first region opens on the RS, WALFactory will create a WAL file, 
> but if the wal creation fails, in some cases, HDFS will leave a empty file in 
> the dir(e.g. disk full, file is created succesfully but block allocation 
> fails). We have a check in AbstractFSWAL that if WAL belong to the same 
> factory exists, then a error will be throw. Thus, the region can never be 
> open on this RS later.
> {code:java}
> 2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0] 
> handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
> java.io.IOException: Target WAL already exists within directory 
> hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.(AbstractFSWAL.java:382)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.(AsyncFSWAL.java:210)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
> at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
> at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749514#comment-16749514
 ] 

Hadoop QA commented on HBASE-21735:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 26 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
1s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
10s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
42s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
34s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  8m 
22s{color} | {color:green} branch-1 passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  3m 
44s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
 9s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
45s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  4m 
31s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
31s{color} | {color:red} hbase-common: The patch generated 2 new + 13 unchanged 
- 0 fixed = 15 total (was 13) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
16s{color} | {color:red} hbase-procedure: The patch generated 2 new + 111 
unchanged - 12 fixed = 113 total (was 123) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
48s{color} | {color:red} hbase-server: The patch generated 1 new + 911 
unchanged - 43 fixed = 912 total (was 954) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  5m 
29s{color} | {color:red} root: The patch generated 5 new + 1040 unchanged - 55 
fixed = 1045 total (was 1095) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} xml {color} | {color:red}  0m  0s{color} | 
{color:red} The patch has 1 ill-formed XML file(s). {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  3m  
7s{color} | {color:blue} patch has no errors when building the reference guide. 
See footer for rendered docs, which you should manually inspect. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
59s{color} | {color:green} patch has no

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749500#comment-16749500
 ] 

Duo Zhang commented on HBASE-21754:
---

OK, +1.

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749496#comment-16749496
 ] 

Allan Yang commented on HBASE-21754:


{quote}
I haven’t seen the reply? Have you pushed the publish button?
{quote}
Strange, I have published the comments, try again?

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749490#comment-16749490
 ] 

Duo Zhang commented on HBASE-21754:
---

I haven’t seen the reply? Have you pushed the publish button?

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21750) Most of KeyValueUtil#length can be replaced by cell#getSerializedSize for better performance because the latter one has been optimized

2019-01-22 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749489#comment-16749489
 ] 

Hudson commented on HBASE-21750:


Results for branch master
[build #739 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/739/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/739//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/739//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/739//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Most of KeyValueUtil#length can be replaced by cell#getSerializedSize for 
> better performance because the latter one has been optimized
> --
>
> Key: HBASE-21750
> URL: https://issues.apache.org/jira/browse/HBASE-21750
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21750.v1.patch, HBASE-21750.v1.patch, 
> HBASE-21750.v2.patch, HBASE-21750.v3.patch, HBASE-21750.v3.patch
>
>
> After HBASE-21657, Most subclass of Cell has a cached serialized size (Except 
> those cells with tags), so I think most of the KeyValueUtil#length can be 
> replaced by cell#getSerializedSize. Such as: 
> - KeyValueUtil.length in StoreFlusher#performFlush;
> - KeyValueUtil.length in Compactor#performCompaction ; 
> and so on..
> Will prepare a patch for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749487#comment-16749487
 ] 

Allan Yang commented on HBASE-21754:


[~Apache9], thanks for the review, I replied the comments, I think we'd better 
leave it for now, if no objection, I will commit this to branch-2.0+ later 
today.

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21751) WAL creation fails during region open may cause region assign forever fail



[ 
https://issues.apache.org/jira/browse/HBASE-21751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749488#comment-16749488
 ] 

Allan Yang commented on HBASE-21751:


[~Apache9]

> WAL creation fails during region open may cause region assign forever fail
> --
>
> Key: HBASE-21751
> URL: https://issues.apache.org/jira/browse/HBASE-21751
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.2.0, 2.1.3, 2.0.5
>
> Attachments: HBASE-21751.patch, HBASE-21751v2.patch
>
>
> During the first region opens on the RS, WALFactory will create a WAL file, 
> but if the wal creation fails, in some cases, HDFS will leave a empty file in 
> the dir(e.g. disk full, file is created succesfully but block allocation 
> fails). We have a check in AbstractFSWAL that if WAL belong to the same 
> factory exists, then a error will be throw. Thus, the region can never be 
> open on this RS later.
> {code:java}
> 2019-01-17 02:15:53,320 ERROR [RS_OPEN_META-regionserver/server003:16020-0] 
> handler.OpenRegionHandler(301): Failed open of region=hbase:meta,,1.1588230740
> java.io.IOException: Target WAL already exists within directory 
> hdfs://cluster/hbase/WALs/server003.hbase.hostname.com,16020,1545269815888
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.(AbstractFSWAL.java:382)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.(AsyncFSWAL.java:210)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:72)
> at 
> org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createWAL(AsyncFSWALProvider.java:47)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:138)
> at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:57)
> at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:264)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getWAL(HRegionServer.java:2085)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:284)
> at 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
> at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21715) Do not throw UnsupportedOperationException in ProcedureFuture.get

2019-01-22 Thread Junhong Xu (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junhong Xu updated HBASE-21715:
---
Attachment: HBase-21715.v01.patch

> Do not throw UnsupportedOperationException in ProcedureFuture.get
> -
>
> Key: HBASE-21715
> URL: https://issues.apache.org/jira/browse/HBASE-21715
> Project: HBase
>  Issue Type: Task
>Reporter: Duo Zhang
>Assignee: Junhong Xu
>Priority: Major
> Attachments: HBase-21715.v01.patch
>
>
> This is really a bad practice, no one would expected that a Future does not 
> support get, and this can not be detected at compile time. Even though we do 
> not want user to wait for ever, we could set a long timeout, for example, 10 
> minutes,instead of throwing UnsuportedOperationException. I've already been 
> hurt many times...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749472#comment-16749472
 ] 

Hadoop QA commented on HBASE-21735:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 20m  
7s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 26 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  5m 
41s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
 1s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
17s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  7m 
10s{color} | {color:green} branch-1 passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  3m  
1s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
33s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
47s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
55s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
24s{color} | {color:red} hbase-common: The patch generated 2 new + 13 unchanged 
- 0 fixed = 15 total (was 13) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
15s{color} | {color:red} hbase-procedure: The patch generated 2 new + 111 
unchanged - 12 fixed = 113 total (was 123) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
27s{color} | {color:red} hbase-server: The patch generated 3 new + 911 
unchanged - 43 fixed = 914 total (was 954) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  4m 
19s{color} | {color:red} root: The patch generated 7 new + 1040 unchanged - 55 
fixed = 1047 total (was 1095) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} xml {color} | {color:red}  0m  0s{color} | 
{color:red} The patch has 1 ill-formed XML file(s). {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  2m 
37s{color} | {color:blue} patch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
26s{color} | {color:green} patch has no

[jira] [Commented] (HBASE-21744) timeout for server list refresh calls



[ 
https://issues.apache.org/jira/browse/HBASE-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749468#comment-16749468
 ] 

Hadoop QA commented on HBASE-21744:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  
0m  0s{color} | {color:orange} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
17s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
22s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
51s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
15s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
17s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
18s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
10m 55s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
6s{color} | {color:green} hbase-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}252m 26s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
52s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}309m 29s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.regionserver.TestSplitTransactionOnCluster |
|   | hadoop.hbase.client.TestSnapshotTemporaryDirectoryWithRegionReplicas |
|   | hadoop.hbase.client.TestAdmin1 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21744 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12955866/HBASE-21744.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux dd36a4875067 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018

[jira] [Commented] (HBASE-21743) stateless assignment



[ 
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749459#comment-16749459
 ] 

Duo Zhang commented on HBASE-21743:
---

You can see AMv1, this is what we did where we did not have proc-v2, and maybe 
it is something like your stateless assignment? But obviously, with years of 
stabilizing, it just becomes unmaintainable, and still have lots of bugs, like 
double assign, failed split or merge, etc...

That's why we choose proc-v2 to implement AMv2, full of blood and tears...

> stateless assignment
> 
>
> Key: HBASE-21743
> URL: https://issues.apache.org/jira/browse/HBASE-21743
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment 
> that all seem to have the same nature - split brain between 2 procedures; or 
> between procedure and master startup (meta replica bugs); or procedure and 
> master shutdown (HBASE-21742); or procedure and something else (when SCP had 
> incorrect region list persisted, don't recall the bug#). 
> To me, it starts to look like a pattern where, like in AMv1 where concurrent 
> interactions were unclear and hard to reason about, despite the cleaner 
> individual pieces in AMv2 the problem of unclear concurrent interactions has 
> been preserved and in fact increased because of the operation state 
> persistence and  isolation.
> Procedures are great for multi-step operations that need rollback and stuff 
> like that, e.g. creating a table or snapshot, or even region splitting. 
> However I'm not so sure about assignment. 
> We have the persisted information - region state in meta (incl transition 
> states like opening, or closing), server list as WAL directory list. 
> Procedure state is not any more reliable then those (we can argue that meta 
> update can fail, but so can procv2 WAL flush, so we have to handle cases of 
> out of date information regardless). So, we don't need any extra state to 
> decide on assignment, whether for recovery and balancing. In fact, as 
> mentioned in some bugs, deleting procv2 WAL is often the best way to recover 
> the cluster, because master can already figure out what to do without 
> additional state.
> I think there should be an option for stateless assignment that does that.
> It can either be as a separate pluggable assignment procedure; or an option 
> that will not recover SCP, RITs etc from WAL but always derive recovery 
> procedures from the existing cluster state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21748) Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because it's an quite time consuming.) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749457#comment-16749457
 ] 

Andrew Purtell commented on HBASE-21748:


Work in progress patch. It might get all the way through tests now. Still 
running them locally

> Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because 
> it's an quite time consuming.) to branch-1
> ---
>
> Key: HBASE-21748
> URL: https://issues.apache.org/jira/browse/HBASE-21748
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-21748-branch-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21748) Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because it's an quite time consuming.) to branch-1



 [ 
https://issues.apache.org/jira/browse/HBASE-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21748:
---
Attachment: HBASE-21748-branch-1.patch

> Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because 
> it's an quite time consuming.) to branch-1
> ---
>
> Key: HBASE-21748
> URL: https://issues.apache.org/jira/browse/HBASE-21748
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
> Attachments: HBASE-21748-branch-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21761) Align the methods in RegionLocator and AsyncTableRegionLocator

Duo Zhang created HBASE-21761:
-

 Summary: Align the methods in RegionLocator and 
AsyncTableRegionLocator
 Key: HBASE-21761
 URL: https://issues.apache.org/jira/browse/HBASE-21761
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang


When implementing HBase-21753 I found that the getAllRegionLocations method 
will return all the replicas, not only the primary replicas. But for 
AsyncTableRegionLocator we just return the primary replicas. We should have the 
same behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21743) stateless assignment

[
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749406#comment-16749406
]

Sergey Shelukhin edited comment on HBASE-21743 at 1/23/19 2:44 AM:
---

We've been running a master snapshot. Indeed, we found that sometimes procv2
deletion can lead to additional issues, however sometimes it's also the only
way forward.
[~stack] there's no way to find out about old dead servers on restart other
than WAL directories (or inferring from stale region assignments stored in
meta), because the servers are not stored anywhere else (and ZK node is gone
for a dead server, as intended).

The basic idea is to look at list of regions (meta), look at live and dead
servers - both of which master already does - and schedule procedures from
scratch as required, instead of relying on procedure WAL for RIT/SCP/etc
(optionally i.e. with a config) .
Personally (as we've discussed years ago ;)) I would prefer to have something
like actor model where a central fast actor does this in a loop and fires off
idempotent slow actions asynchronously, but within the current paradigm I think
reducing state would provide some benefit.
Right now for every bug I file (and all those I don't file that result from
subtly-incorrect/too-aggressive manual interventions needed to address other
bugs) if master was looking at cluster state it would be trivial to resolve,
but because of the split brain problem every part of the system is waiting for
some other part with incorrect assumptions; so, the whole thing is very fragile
w.r.t. both bugs, and also manual intervention that as we know are often
necessary despite best intentions (hence hbck/offlinerepair/etc).

For example the above bug with incorrect SCP for meta server resulted because
master init is waiting for SCP to fix meta, but SCP doesn't know it needs to
fix meta because of some bug. OFC if persistent SCP didn't exist it wouldn't
have the bug in the first place, but abstractly if one actor was looking at
this he'd just see meta assigned to a dead server, and recover it just like
that. No state needed other than where meta is and the list of servers.

Then, to resolve this we had to nuke the proc WAL to get rid of the bad SCP.
Some more SCPs for some servers got lost in the nuke, and we had some regions
CLOSING on dead servers that have neither SCP nor WAL directory. Again, looking
from a unified perspective we can see - woops, region closing on the server,
server has no WALs to split - just count it as closed. Whereas now close region
procedure is not responsible for this, it just waits for SCP to deal with the
server. But there's no SCP because there's no WAL directory. So, nobody looks
at these two together... so after this manual intervention (or for example
imagine there was an HDFS issue, and the WAL write did not succeed) cluster is
broken and I have to go and fix those regions.

Now I go to meta and set regions to CLOSED (pretend I'm actually hbck2). If
assignment was stateless, master would see closed regions and assign them.
Whereas now confirm-close retry loop is well-isolated so it doesn't care about
anything in the world and just blindly resets them back to CLOSING, so I have
to additionally kill -9 the master to make sure that stupid RITs go away and on
restart master actually recovers the region.

Luckily when recovered RIT procedures in this case see CLOSED region with empty
server, they just silently go away (which might technically be a bug but it
works for me ;)); I've seen other cases where some procedure sees region in an
unexpected state (due to a race condition) it either fails master (as with meta
replicas) or updates it to some other state, resulting in a strange state.

This is just one example. And on all 3.5 steps the persistent procedure is 100%
unnecessary, because master has all the information to make correct decisions.
As long as it's done in a sane way like with hybrid actor model without its own
persistent state...

was (Author: sershe):
We've been running a master snapshot. Indeed, we found that sometimes procv2
deletion can lead to additional issues, however sometimes it's also the only
way forward.
[~stack] there's no way to find out about old dead servers on restart other
than WAL directories (or inferring from stale region assignments stored in
meta), because the servers are not stored anywhere else (and ZK node is gone
for a dead server, as intended).

The basic idea is to look at list of regions (meta), look at live and dead
servers - both of which master already does - and schedule procedures from
scratch as required, instead of relying on procedure WAL.
Personally (as we've discussed years ago ;)) I would prefer to have something
like actor model where a central fast actor does this in a loop and fires off
idempotent slow actions

[jira] [Commented] (HBASE-20952) Re-visit the WAL API

2019-01-22 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749405#comment-16749405
 ] 

Hudson commented on HBASE-20952:


Results for branch HBASE-20952
[build #66 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/66/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/66//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/66//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/66//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Re-visit the WAL API
> 
>
> Key: HBASE-20952
> URL: https://issues.apache.org/jira/browse/HBASE-20952
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Josh Elser
>Priority: Major
> Attachments: 20952.v1.txt
>
>
> Take a step back from the current WAL implementations and think about what an 
> HBase WAL API should look like. What are the primitive calls that we require 
> to guarantee durability of writes with a high degree of performance?
> The API needs to take the current implementations into consideration. We 
> should also have a mind for what is happening in the Ratis LogService (but 
> the LogService should not dictate what HBase's WAL API looks like RATIS-272).
> Other "systems" inside of HBase that use WALs are replication and 
> backup Replication has the use-case for "tail"'ing the WAL which we 
> should provide via our new API. B doesn't do anything fancy (IIRC). We 
> should make sure all consumers are generally going to be OK with the API we 
> create.
> The API may be "OK" (or OK in a part). We need to also consider other methods 
> which were "bolted" on such as {{AbstractFSWAL}} and 
> {{WALFileLengthProvider}}. Other corners of "WAL use" (like the 
> {{WALSplitter}} should also be looked at to use WAL-APIs only).
> We also need to make sure that adequate interface audience and stability 
> annotations are chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21743) stateless assignment



[ 
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749406#comment-16749406
 ] 

Sergey Shelukhin commented on HBASE-21743:
--

We've been running a master snapshot. Indeed, we found that sometimes procv2 
deletion can lead to additional issues, however sometimes it's also the only 
way forward. 
[~stack] there's no way to find out about old dead servers on restart other 
than WAL directories (or inferring from stale region assignments stored in 
meta), because the servers are not stored anywhere else (and ZK node is gone 
for a dead server, as intended).

The basic idea is to look at list of regions (meta), look at live and dead 
servers - both of which master already does - and schedule procedures from 
scratch as required, instead of relying on procedure WAL. 
Personally (as we've discussed years ago ;)) I would prefer to have something 
like actor model where a central fast actor does this in a loop and fires off 
idempotent slow actions asynchronously, but within the current paradigm I think 
 reducing state (optionally i.e. with a config) would provide some benefit. 
Right now for every bug I file (and all those I don't file that result from 
subtly-incorrect/too-aggressive manual interventions needed to address other 
bugs)  if master was looking at cluster state it would be trivial to resolve, 
but because of the split brain problem every part of the system is waiting for 
some other part with incorrect assumptions; so, the whole thing is very fragile 
w.r.t. both bugs, and also manual intervention that as we know are often 
necessary despite best intentions (hence hbck/offlinerepair/etc).

For example the above bug with incorrect SCP for meta server resulted because 
master init is waiting for SCP to fix meta, but SCP doesn't know it needs to 
fix meta because of some bug. OFC if persistent SCP didn't exist it wouldn't 
have the bug in the first place, but abstractly if one actor was looking at 
this he'd just see meta assigned to a dead server, and recover it just like 
that. No state needed other than where meta is and the list of servers.

Then, to resolve this we had to nuke the proc WAL to get rid of the bad SCP. 
Some more SCPs for some servers got lost in the nuke, and we had some regions 
CLOSING on dead servers that have neither SCP nor WAL directory. Again, looking 
from a unified perspective we can see - woops, region closing on the server, 
server has no WALs to split - just count it as closed. Whereas now close region 
procedure is not responsible for this, it just waits for SCP to deal with the 
server. But there's no SCP because there's no WAL directory. So, nobody looks 
at these two together... so after this manual intervention (or for example 
imagine there was an HDFS issue, and the WAL write did not succeed) cluster is 
broken and I have to go and fix those regions.

Now I go to meta and set regions to CLOSED (pretend I'm actually hbck2). If 
assignment was stateless, master would see closed regions and assign them. 
Whereas now confirm-close retry loop is well-isolated so it doesn't care about 
anything in the world and just blindly resets them back to CLOSING, so I have 
to additionally kill -9 the master to make sure that stupid RITs go away and on 
restart master actually recovers the region.

Luckily when recovered RIT procedures in this case see CLOSED region with empty 
server, they just silently go away (which might technically be a bug but it 
works for me ;)); I've seen other cases where some procedure sees region in an 
unexpected state (due to a race condition) it either fails master (as with meta 
replicas) or updates it to some other state, resulting in a strange state.

This is just one example. And on all 3.5 steps the persistent procedure is 100% 
unnecessary, because master has all the information to make correct decisions. 
As long as it's done in a sane way like with hybrid actor model without its own 
persistent state...



> stateless assignment
> 
>
> Key: HBASE-21743
> URL: https://issues.apache.org/jira/browse/HBASE-21743
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment 
> that all seem to have the same nature - split brain between 2 procedures; or 
> between procedure and master startup (meta replica bugs); or procedure and 
> master shutdown (HBASE-21742); or procedure and something else (when SCP had 
> incorrect region list persisted, don't recall the bug#). 
> To me, it starts to look like a pattern where, like in AMv1 where concurrent 
> interactions were unclear and hard to reason about, despite the cleaner 
> individual pieces in AMv2 the problem of unclear concurrent interactions has 
> been preserved and in fact

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749404#comment-16749404
 ] 

Duo Zhang commented on HBASE-21754:
---

Overall LGTM. Left a few comments on RB, if they are not problems then +1.

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749403#comment-16749403
 ] 

Hadoop QA commented on HBASE-21742:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
31s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange}  
0m  0s{color} | {color:orange} The patch doesn't appear to include any new or 
modified tests. Please justify why no new tests are needed for this patch. Also 
please list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
25s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  6m 
52s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
49s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
15m  2s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}153m 44s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}213m 50s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21742 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12955872/HBASE-21742.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 650223cad0d3 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 
31 10:55:11 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 35ed5d6c39 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15687/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results |

[jira] [Commented] (HBASE-21720) metric to measure how actions are distributed to servers within a MultiAction



[ 
https://issues.apache.org/jira/browse/HBASE-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749402#comment-16749402
 ] 

Hadoop QA commented on HBASE-21720:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
18s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
43s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
44s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
31s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} hbase-client: The patch generated 0 new + 8 
unchanged - 39 fixed = 8 total (was 47) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
13s{color} | {color:green} The patch passed checkstyle in hbase-server {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
33s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 38s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
11s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}131m 
24s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
49s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}182m 11s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21720 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12955874/HBASE-21720.master.006.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux cb4a3f46797d 4.4.0-131-generic #157~14.04.1-Ubuntu SMP Fri Jul 
13 08:53:17 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality |

[jira] [Commented] (HBASE-21561) Backport HBASE-21413 (Empty meta log doesn't get split when restart whole cluster) to branch-1

2019-01-22 Thread Xu Cang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749382#comment-16749382
 ] 

Xu Cang commented on HBASE-21561:
-

thank you [~apurtell] for committing it.

[~allan163] I don't think it's not that "one API" straightforward to port this. 
Because ProcedureExecutor.java in branch-1 and branch-2 are quite different. 
For example, 'procedures' var in branch-2's type is "ConcurrentHashMap>" ,whereas in branch-1 it is "ConcurrentHashMap"

And "completed" var are also different, in branch-1 we only maintain 
"ProcedureInfo" , branch-2 maintains "Procedure" in the completed map. 
So, I decided to take it out and assess the best solution there. (from both 
correctness and performance aspects)

> Backport HBASE-21413 (Empty meta log doesn't get split when restart whole 
> cluster) to branch-1
> --
>
> Key: HBASE-21561
> URL: https://issues.apache.org/jira/browse/HBASE-21561
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Xu Cang
>Priority: Minor
> Fix For: 1.5.0, 1.4.10, 1.3.4
>
> Attachments: HBASE-21561.branch-1.001.patch, 
> HBASE-21561.branch-1.002.patch, HBASE-21561.branch1.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21748) Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because it's an quite time consuming.) to branch-1

2019-01-22 Thread Zheng Hu (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749389#comment-16749389
 ] 

Zheng Hu commented on HBASE-21748:
--

[~apurtell], Thanks for you work. Mind share your patch ? I'll take a look, 
have not see  how to fix in the branch-1 yet. 

> Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because 
> it's an quite time consuming.) to branch-1
> ---
>
> Key: HBASE-21748
> URL: https://issues.apache.org/jira/browse/HBASE-21748
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HBASE-21760) Unify the upper/lower case for title



 [ 
https://issues.apache.org/jira/browse/HBASE-21760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang reassigned HBASE-21760:
-

Assignee: xubo245

> Unify the upper/lower case for title
> 
>
> Key: HBASE-21760
> URL: https://issues.apache.org/jira/browse/HBASE-21760
> Project: HBase
>  Issue Type: Bug
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Attachments: hbase.png
>
>
> Unify the upper/lower case for title



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749380#comment-16749380
 ] 

Duo Zhang commented on HBASE-21754:
---

+1 on the general approach. Let me take a look at the patch.

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21754) ReportRegionStateTransitionRequest should be executed in priority executor



[ 
https://issues.apache.org/jira/browse/HBASE-21754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749378#comment-16749378
 ] 

Allan Yang commented on HBASE-21754:


[~Apache9] any comments on the patch? QA passed

> ReportRegionStateTransitionRequest should be executed in priority executor
> --
>
> Key: HBASE-21754
> URL: https://issues.apache.org/jira/browse/HBASE-21754
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.2, 2.0.4
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21754.patch, HBASE-21754v2.patch
>
>
> Now, ReportRegionStateTransitionRequest is executed in default handler, only 
> region of system table is executed in priority handler. That is because we 
> have only two kinds of handler default and priority in master(replication 
> handler is for replication specifically), if the transition report for all 
> region is executed in priority handler, there is a dead lock situation that 
> other regions' transition report take all handler and need to update meta, 
> but meta region is not able to report online since all handler is 
> taken(addressed in the comments of MasterAnnotationReadingPriorityFunction).
> But there is another dead lock case that user's DDL requests (or other sync 
> op like moveregion) take over all default handlers, making region transition 
> report is not possible, thus those sync ops can't complete either. A simple 
> UT provided in the patch shows this case.
> To resolve this problem, I added a new metaTransitionExecutor to execute meta 
> region transition report only, and all the other region's report are executed 
> in priority handlers, separating them from user's requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749377#comment-16749377
 ] 

Sergey Shelukhin commented on HBASE-21742:
--

Well, there are other shutdown activities... rather than examine every one to 
make sure it doesn't affect state I wanted to put shutdown in the beginning.
As for a test for this one, I don't think master shutdown is very amenable to 
unit testing... we've seen this issue on the cluster.

> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21748) Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because it's an quite time consuming.) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749375#comment-16749375
 ] 

Andrew Purtell commented on HBASE-21748:


A patch for branch-1 is very simple, just a few lines of diff. I'm testing it, 
will attach once I can confirm tests pass. However I don't think branch-1 has 
the same exposure to the perf problem. I'm not sure how to demonstrate a 
benefit. I will try some simple benchmarking but may need to resort to JMH to 
quantify it to any degree of certainty. Fortunately the change is simple, easy 
to reason about, and so no harm to include regardless (IMHO).

> Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because 
> it's an quite time consuming.) to branch-1
> ---
>
> Key: HBASE-21748
> URL: https://issues.apache.org/jira/browse/HBASE-21748
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749372#comment-16749372
 ] 

Allan Yang commented on HBASE-21742:


{quote}
 as per above "Stopping assigment manager", AM state including region map got 
cleared.
This SCP gets persisted, so when the next master starts, it waits forever for 
meta to be onlined, while there's no SCP with meta=true to online it.
{quote}
Make sense to me, but can you provide a UT to reproduce this before providing a 
fix?
Other thing is that procedureStore.stop() is already called during master 
shutdown, but indeed it is after assignmentManger shutdown, so the region 
states may not right after regionsates cleared. Adjusting their order may fix it
in stopServiceThreads() of HMaster:
{code}
if (this.assignmentManager != null) this.assignmentManager.stop();
stopProcedureExecutor();
{code}


> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21760) Unify the upper/lower case for title

2019-01-22 Thread xubo245 (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated HBASE-21760:

 Attachment: hbase.png
Description: 
Unify the upper/lower case for title



  was:Unify the upper/lower case for title


> Unify the upper/lower case for title
> 
>
> Key: HBASE-21760
> URL: https://issues.apache.org/jira/browse/HBASE-21760
> Project: HBase
>  Issue Type: Bug
>Reporter: xubo245
>Priority: Major
> Attachments: hbase.png
>
>
> Unify the upper/lower case for title



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21760) Unify the upper/lower case for title

2019-01-22 Thread xubo245 (JIRA)

xubo245 created HBASE-21760:
---

 Summary: Unify the upper/lower case for title
 Key: HBASE-21760
 URL: https://issues.apache.org/jira/browse/HBASE-21760
 Project: HBase
  Issue Type: Bug
Reporter: xubo245


Unify the upper/lower case for title



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21688) Address WAL filesystem issues



 [ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Rodionov updated HBASE-21688:
--
Attachment: HBASE-21688-amend.2.patch

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.2.patch, HBASE-21688-amend.patch, 
> HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21688) Address WAL filesystem issues



[ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749365#comment-16749365
 ] 

Vladimir Rodionov commented on HBASE-21688:
---

Yeah, my bad. Attached updated patch.

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.2.patch, HBASE-21688-amend.patch, 
> HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Issue Comment Deleted] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



 [ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21735:
---
Comment: was deleted

(was: | (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 25 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
53s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
41s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
21s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
18s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  6m 
56s{color} | {color:green} branch-1 passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  2m 
55s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
29s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
49s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
57s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
24s{color} | {color:red} hbase-common: The patch generated 2 new + 13 unchanged 
- 0 fixed = 15 total (was 13) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
15s{color} | {color:red} hbase-procedure: The patch generated 2 new + 111 
unchanged - 12 fixed = 113 total (was 123) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
18s{color} | {color:red} hbase-server: The patch generated 3 new + 905 
unchanged - 41 fixed = 908 total (was 946) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  4m 
28s{color} | {color:red} root: The patch generated 7 new + 1034 unchanged - 53 
fixed = 1041 total (was 1087) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} xml {color} | {color:red}  0m  0s{color} | 
{color:red} The patch has 1 ill-formed XML file(s). {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  2m 
32s{color} | {color:blue} patch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
25s{color} | {color:green} patch has no

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749359#comment-16749359
 ] 

Andrew Purtell commented on HBASE-21735:


Fix three checkstyle nits

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch, 
> HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



 [ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21735:
---
Attachment: HBASE-21735-branch-1.patch

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch, 
> HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21575) memstore above high watermark message is logged too much



[ 
https://issues.apache.org/jira/browse/HBASE-21575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749347#comment-16749347
 ] 

Andrew Purtell commented on HBASE-21575:


Thanks [~sershe]!

> memstore above high watermark message is logged too much
> 
>
> Key: HBASE-21575
> URL: https://issues.apache.org/jira/browse/HBASE-21575
> Project: HBase
>  Issue Type: Bug
>  Components: logging, regionserver
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.2.0
>
> Attachments: HBASE-21575.01.patch, HBASE-21575.patch
>
>
> 100s of Mb of logs like this, in a tight loop:
> {noformat}
> 2018-12-08 10:27:00,462 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3646ms
> 2018-12-08 10:27:00,463 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3647ms
> 2018-12-08 10:27:00,463 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3647ms
> 2018-12-08 10:27:00,464 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3648ms
> 2018-12-08 10:27:00,464 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3648ms
> 2018-12-08 10:27:00,465 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3649ms
> 2018-12-08 10:27:00,465 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3649ms
> 2018-12-08 10:27:00,466 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3650ms
> 2018-12-08 10:27:00,466 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3650ms
> 2018-12-08 10:27:00,467 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3651ms
> 2018-12-08 10:27:00,469 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3653ms
> 2018-12-08 10:27:00,470 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3654ms
> 2018-12-08 10:27:00,470 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3654ms
> 2018-12-08 10:27:00,471 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3655ms
> 2018-12-08 10:27:00,471 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3655ms
> 2018-12-08 10:27:00,472 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3656ms
> 2018-12-08 10:27:00,472 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3656ms
> 2018-12-08 10:27:00,473 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3657ms
> 2018-12-08 10:27:00,474 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3658ms
> 2018-12-08 10:27:00,475 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3659ms
> 2018-12-08 10:27:00,476 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3660ms
> 2018-12-08 10:27:00,476 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3660ms
> 2018-12-08 10:27:00,477 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is

[jira] [Updated] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



 [ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21735:
---
Status: Patch Available  (was: Open)

Resubmit for HadoopQA

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch, 
> HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



 [ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21735:
---
Status: Open  (was: Patch Available)

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch, 
> HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749355#comment-16749355
 ] 

Hadoop QA commented on HBASE-21735:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 25 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
53s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
41s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
21s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
18s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  6m 
56s{color} | {color:green} branch-1 passed {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  2m 
55s{color} | {color:blue} branch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
29s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
49s{color} | {color:green} branch-1 passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m 
57s{color} | {color:green} branch-1 passed with JDK v1.7.0_201 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed with JDK v1.8.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed with JDK v1.7.0_201 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
19s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
24s{color} | {color:red} hbase-common: The patch generated 2 new + 13 unchanged 
- 0 fixed = 15 total (was 13) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
15s{color} | {color:red} hbase-procedure: The patch generated 2 new + 111 
unchanged - 12 fixed = 113 total (was 123) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
18s{color} | {color:red} hbase-server: The patch generated 3 new + 905 
unchanged - 41 fixed = 908 total (was 946) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  4m 
28s{color} | {color:red} root: The patch generated 7 new + 1034 unchanged - 53 
fixed = 1041 total (was 1087) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} xml {color} | {color:red}  0m  0s{color} | 
{color:red} The patch has 1 ill-formed XML file(s). {color} |
| {color:blue}0{color} | {color:blue} refguide {color} | {color:blue}  2m 
32s{color} | {color:blue} patch has no errors when building the reference 
guide. See footer for rendered docs, which you should manually inspect. {color} 
|
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
25s{color} | {color:green} patch has no

[jira] [Commented] (HBASE-21575) memstore above high watermark message is logged too much



[ 
https://issues.apache.org/jira/browse/HBASE-21575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749343#comment-16749343
 ] 

Sergey Shelukhin commented on HBASE-21575:
--

Committed to branch-2 and branch-1

> memstore above high watermark message is logged too much
> 
>
> Key: HBASE-21575
> URL: https://issues.apache.org/jira/browse/HBASE-21575
> Project: HBase
>  Issue Type: Bug
>  Components: logging, regionserver
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.2.0
>
> Attachments: HBASE-21575.01.patch, HBASE-21575.patch
>
>
> 100s of Mb of logs like this, in a tight loop:
> {noformat}
> 2018-12-08 10:27:00,462 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3646ms
> 2018-12-08 10:27:00,463 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3647ms
> 2018-12-08 10:27:00,463 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3647ms
> 2018-12-08 10:27:00,464 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3648ms
> 2018-12-08 10:27:00,464 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3648ms
> 2018-12-08 10:27:00,465 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3649ms
> 2018-12-08 10:27:00,465 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3649ms
> 2018-12-08 10:27:00,466 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3650ms
> 2018-12-08 10:27:00,466 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3650ms
> 2018-12-08 10:27:00,467 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3651ms
> 2018-12-08 10:27:00,469 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3653ms
> 2018-12-08 10:27:00,470 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3654ms
> 2018-12-08 10:27:00,470 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3654ms
> 2018-12-08 10:27:00,471 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3655ms
> 2018-12-08 10:27:00,471 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3655ms
> 2018-12-08 10:27:00,472 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3656ms
> 2018-12-08 10:27:00,472 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3656ms
> 2018-12-08 10:27:00,473 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3657ms
> 2018-12-08 10:27:00,474 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3658ms
> 2018-12-08 10:27:00,475 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3659ms
> 2018-12-08 10:27:00,476 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3660ms
> 2018-12-08 10:27:00,476 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3660ms
> 2018-12-08 10:27:00,477 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
>

[jira] [Updated] (HBASE-21575) memstore above high watermark message is logged too much



 [ 
https://issues.apache.org/jira/browse/HBASE-21575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21575:
-
Fix Version/s: 2.2.0
   1.5.0

> memstore above high watermark message is logged too much
> 
>
> Key: HBASE-21575
> URL: https://issues.apache.org/jira/browse/HBASE-21575
> Project: HBase
>  Issue Type: Bug
>  Components: logging, regionserver
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.2.0
>
> Attachments: HBASE-21575.01.patch, HBASE-21575.patch
>
>
> 100s of Mb of logs like this, in a tight loop:
> {noformat}
> 2018-12-08 10:27:00,462 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3646ms
> 2018-12-08 10:27:00,463 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3647ms
> 2018-12-08 10:27:00,463 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3647ms
> 2018-12-08 10:27:00,464 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3648ms
> 2018-12-08 10:27:00,464 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3648ms
> 2018-12-08 10:27:00,465 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3649ms
> 2018-12-08 10:27:00,465 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3649ms
> 2018-12-08 10:27:00,466 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3650ms
> 2018-12-08 10:27:00,466 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3650ms
> 2018-12-08 10:27:00,467 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3651ms
> 2018-12-08 10:27:00,469 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3653ms
> 2018-12-08 10:27:00,470 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3654ms
> 2018-12-08 10:27:00,470 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3654ms
> 2018-12-08 10:27:00,471 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3655ms
> 2018-12-08 10:27:00,471 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3655ms
> 2018-12-08 10:27:00,472 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3656ms
> 2018-12-08 10:27:00,472 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3656ms
> 2018-12-08 10:27:00,473 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3657ms
> 2018-12-08 10:27:00,474 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3658ms
> 2018-12-08 10:27:00,475 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3659ms
> 2018-12-08 10:27:00,476 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3660ms
> 2018-12-08 10:27:00,476 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above high water mark and block 
> 3660ms
> 2018-12-08 10:27:00,477 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=2,queue=2,port=17020] 
> regionserver.MemStoreFlusher: Memstore is above

[jira] [Commented] (HBASE-21688) Address WAL filesystem issues



[ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749339#comment-16749339
 ] 

Nihal Jain commented on HBASE-21688:


{code}
final FileSystem fs = FSUtils.getCurrentFileSystem(conf);
{code}
Don't we need to change this to wal fs?

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.patch, HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21755) RS aborts while performing replication with wal dir on hdfs, root dir on s3



[ 
https://issues.apache.org/jira/browse/HBASE-21755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749336#comment-16749336
 ] 

Vladimir Rodionov commented on HBASE-21755:
---

HBASE-21688 amendment is ready for review. [~nihaljain.cs], [~busbey], [~zyork].

> RS aborts while performing replication with wal dir on hdfs, root dir on s3
> ---
>
> Key: HBASE-21755
> URL: https://issues.apache.org/jira/browse/HBASE-21755
> Project: HBase
>  Issue Type: Bug
>  Components: Filesystem Integration, Replication, wal
>Affects Versions: 1.5.0, 2.1.3
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Critical
>  Labels: s3
>
> *Environment/Configuration*
>  - _hbase.wal.dir_ : Configured to be on hdfs
>  - _hbase.rootdir_ : Configured to be on s3
> In replication scenario, while trying to get archived log dir (using method 
> [WALEntryStream.java#L314|https://github.com/apache/hbase/blob/da92b3e0061a7c67aa9a3e403d68f3b56bf59370/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L314])
>  we get the following exception:
> {code:java}
> 2019-01-21 17:43:55,440 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.ReplicationSource: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  
> currentPath=hdfs://dummy_path/hbase/WALs/host2,2,1548063439555/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594
> java.lang.IllegalArgumentException: Wrong FS: 
> s3a://xx/hbase128/oldWALs/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594,
>  expected: hdfs://dummy_path
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:246)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1622)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1619)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1634)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:465)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1742)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.getArchivedLog(WALEntryStream.java:319)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.resetReader(WALEntryStream.java:404)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.reset(WALEntryStream.java:161)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:148)
> 2019-01-21 17:43:55,444 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.HRegionServer: * ABORTING region server 
> host2,2,1548063439555: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  *
> java.lang.IllegalArgumentException: Wrong FS: 
> s3a://xx/hbase128/oldWALs/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594,
>  expected: hdfs://dummy_path
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:246)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1622)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1619)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1634)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:465)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1742)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.getArchivedLog(WALEntryStream.java:319)
>   at 
>

[jira] [Updated] (HBASE-21688) Address WAL filesystem issues



 [ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Rodionov updated HBASE-21688:
--
Attachment: HBASE-21688-amend.patch

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-amend.patch, HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21755) RS aborts while performing replication with wal dir on hdfs, root dir on s3



[ 
https://issues.apache.org/jira/browse/HBASE-21755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749314#comment-16749314
 ] 

Vladimir Rodionov commented on HBASE-21755:
---

Yes, this happened because I did search on HConstants.HREGION_LOGDIR_NAME only. 
Let me amend HBASE-21688.

> RS aborts while performing replication with wal dir on hdfs, root dir on s3
> ---
>
> Key: HBASE-21755
> URL: https://issues.apache.org/jira/browse/HBASE-21755
> Project: HBase
>  Issue Type: Bug
>  Components: Filesystem Integration, Replication, wal
>Affects Versions: 1.5.0, 2.1.3
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Critical
>  Labels: s3
>
> *Environment/Configuration*
>  - _hbase.wal.dir_ : Configured to be on hdfs
>  - _hbase.rootdir_ : Configured to be on s3
> In replication scenario, while trying to get archived log dir (using method 
> [WALEntryStream.java#L314|https://github.com/apache/hbase/blob/da92b3e0061a7c67aa9a3e403d68f3b56bf59370/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L314])
>  we get the following exception:
> {code:java}
> 2019-01-21 17:43:55,440 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.ReplicationSource: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  
> currentPath=hdfs://dummy_path/hbase/WALs/host2,2,1548063439555/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594
> java.lang.IllegalArgumentException: Wrong FS: 
> s3a://xx/hbase128/oldWALs/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594,
>  expected: hdfs://dummy_path
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:246)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1622)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1619)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1634)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:465)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1742)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.getArchivedLog(WALEntryStream.java:319)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.resetReader(WALEntryStream.java:404)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.reset(WALEntryStream.java:161)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:148)
> 2019-01-21 17:43:55,444 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.HRegionServer: * ABORTING region server 
> host2,2,1548063439555: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  *
> java.lang.IllegalArgumentException: Wrong FS: 
> s3a://xx/hbase128/oldWALs/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594,
>  expected: hdfs://dummy_path
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:246)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1622)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1619)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1634)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:465)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1742)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.getArchivedLog(WALEntryStream.java:319)
>   at 
>

[jira] [Reopened] (HBASE-21688) Address WAL filesystem issues



 [ 
https://issues.apache.org/jira/browse/HBASE-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Rodionov reopened HBASE-21688:
---

Opened for amendment.

> Address WAL filesystem issues
> -
>
> Key: HBASE-21688
> URL: https://issues.apache.org/jira/browse/HBASE-21688
> Project: HBase
>  Issue Type: Bug
>Reporter: Vladimir Rodionov
>Assignee: Vladimir Rodionov
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21688-v1.patch
>
>
> Scan and fix code base to use new way of instantiating WAL File System. 
> https://issues.apache.org/jira/browse/HBASE-21457?focusedCommentId=16734688=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16734688



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HBASE-21576) master should proactively reassign meta when killing a RS with it



 [ 
https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HBASE-21576.
--
Resolution: Not A Problem

> master should proactively reassign meta when killing a RS with it
> -
>
> Key: HBASE-21576
> URL: https://issues.apache.org/jira/browse/HBASE-21576
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Master has killed an RS that was hosting meta due to some HDFS issue (most 
> likely; I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if 
> I see repro), and a long time to restart; meanwhile master never tried to 
> reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow 
> abort/startup, as well as to issues causing master to kill it, so it would 
> make sense for master to immediately relocate meta once meta-hosting RS is 
> dead after a kill; or even when killing the RS. In the former case (if the RS 
> needs to die for meta to be reassigned safely), perhaps the RS hosting meta 
> in particular should try to die fast in such circumstances, and not do any 
> cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] 
> master.MasterRpcServices: ,17020,1544264858183 reported a fatal 
> error:
> * ABORTING region server ,17020,1544264858183: Replay of WAL 
> required. Forcing server shutdown *
>  [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server 
> ,17020,1544264858183 aborting, details=row '...' on table 
> 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] 
> client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, 
> started=392702 ms ago, cancelled=false, msg=Call to  failed on 
> connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: 
> connection timed out: , details=row '...' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=,17020,1544264858183, 
> seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] 
> client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, 
> started=41137 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
> ,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: 
> * ABORTING master ...,17000,1544230401860: FAILED persisting region=... 
> state=OPEN *^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21576) master should proactively reassign meta when killing a RS with it



[ 
https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749312#comment-16749312
 ] 

Sergey Shelukhin commented on HBASE-21576:
--

I filed a separate bug somewhere, RS aborting due to DroppedSnapshot actually 
does try to close regions, that is the part that took a long time.
HBASE-21577 should address this at some point.
Also we've seen an issue once where for whatever reason master didn't detect 
that RS died via ZK node until some other RS also died (ZK notification was 
lost somehow?)... I filed HBASE-21744 to mitigate that.

One thing I found since then is that master's "aborting RS" message is actually 
purely informational, RS sends a message saying it's going to die and master 
logs it. So this issue is not really relevant, because indeed master would have 
to wait for SCP to do recovery (I was assuming master could delay the death of 
the RS and move meta first, then let RS proceed). 

> master should proactively reassign meta when killing a RS with it
> -
>
> Key: HBASE-21576
> URL: https://issues.apache.org/jira/browse/HBASE-21576
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Master has killed an RS that was hosting meta due to some HDFS issue (most 
> likely; I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if 
> I see repro), and a long time to restart; meanwhile master never tried to 
> reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow 
> abort/startup, as well as to issues causing master to kill it, so it would 
> make sense for master to immediately relocate meta once meta-hosting RS is 
> dead after a kill; or even when killing the RS. In the former case (if the RS 
> needs to die for meta to be reassigned safely), perhaps the RS hosting meta 
> in particular should try to die fast in such circumstances, and not do any 
> cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] 
> master.MasterRpcServices: ,17020,1544264858183 reported a fatal 
> error:
> * ABORTING region server ,17020,1544264858183: Replay of WAL 
> required. Forcing server shutdown *
>  [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server 
> ,17020,1544264858183 aborting, details=row '...' on table 
> 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] 
> client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, 
> started=392702 ms ago, cancelled=false, msg=Call to  failed on 
> connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: 
> connection timed out: , details=row '...' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=,17020,1544264858183, 
> seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] 
> client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, 
> started=41137 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
> ,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: 
> * ABORTING master ...,17000,1544230401860: FAILED persisting region=... 
> state=OPEN *^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21755) RS aborts while performing replication with wal dir on hdfs, root dir on s3



[ 
https://issues.apache.org/jira/browse/HBASE-21755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749307#comment-16749307
 ] 

Nihal Jain edited comment on HBASE-21755 at 1/22/19 11:57 PM:
--

All branches, including master still has the following (See 
[AbstractFSWALProvider.java#L434|https://github.com/apache/hbase/blob/fa3946fbeaaffd6acfbd8530e22f85e0bf3321eb/hbase-server/src/main/java/org/apache/hadoop/hbase/wal/AbstractFSWALProvider.java#L434]):
{code:java}
  public static Path getArchivedLogPath(Path path, Configuration conf) throws 
IOException {
Path rootDir = FSUtils.getRootDir(conf);  <= SHOULD BE WAL ROOT DIR
Path oldLogDir = new Path(rootDir, HConstants.HREGION_OLDLOGDIR_NAME);
if (conf.getBoolean(SEPARATE_OLDLOGDIR, DEFAULT_SEPARATE_OLDLOGDIR)) {
  ServerName serverName = getServerNameFromWALDirectoryName(path);
  if (serverName == null) {
LOG.error("Couldn't locate log: " + path);
return path;
  }
  oldLogDir = new Path(oldLogDir, serverName.getServerName());
}
Path archivedLogLocation = new Path(oldLogDir, path.getName());
final FileSystem fs = FSUtils.getCurrentFileSystem(conf); <= SHOULD BE 
WAL FS

if (fs.exists(archivedLogLocation)) {
  LOG.info("Log " + path + " was moved to " + archivedLogLocation);
  return archivedLogLocation;
} else {
  LOG.error("Couldn't locate log: " + path);
  return path;
}
  }
{code}
HBASE-21688 somehow missed it.


was (Author: nihaljain.cs):
All branches, including master still has the following:
{code:java}
  public static Path getArchivedLogPath(Path path, Configuration conf) throws 
IOException {
Path rootDir = FSUtils.getRootDir(conf);  <= SHOULD BE WAL ROOT DIR
Path oldLogDir = new Path(rootDir, HConstants.HREGION_OLDLOGDIR_NAME);
if (conf.getBoolean(SEPARATE_OLDLOGDIR, DEFAULT_SEPARATE_OLDLOGDIR)) {
  ServerName serverName = getServerNameFromWALDirectoryName(path);
  if (serverName == null) {
LOG.error("Couldn't locate log: " + path);
return path;
  }
  oldLogDir = new Path(oldLogDir, serverName.getServerName());
}
Path archivedLogLocation = new Path(oldLogDir, path.getName());
final FileSystem fs = FSUtils.getCurrentFileSystem(conf); <= SHOULD BE 
WAL FS

if (fs.exists(archivedLogLocation)) {
  LOG.info("Log " + path + " was moved to " + archivedLogLocation);
  return archivedLogLocation;
} else {
  LOG.error("Couldn't locate log: " + path);
  return path;
}
  }
{code}
HBASE-21688 somehow missed it.

> RS aborts while performing replication with wal dir on hdfs, root dir on s3
> ---
>
> Key: HBASE-21755
> URL: https://issues.apache.org/jira/browse/HBASE-21755
> Project: HBase
>  Issue Type: Bug
>  Components: Filesystem Integration, Replication, wal
>Affects Versions: 1.5.0, 2.1.3
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Critical
>  Labels: s3
>
> *Environment/Configuration*
>  - _hbase.wal.dir_ : Configured to be on hdfs
>  - _hbase.rootdir_ : Configured to be on s3
> In replication scenario, while trying to get archived log dir (using method 
> [WALEntryStream.java#L314|https://github.com/apache/hbase/blob/da92b3e0061a7c67aa9a3e403d68f3b56bf59370/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L314])
>  we get the following exception:
> {code:java}
> 2019-01-21 17:43:55,440 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.ReplicationSource: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  
> currentPath=hdfs://dummy_path/hbase/WALs/host2,2,1548063439555/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594
> java.lang.IllegalArgumentException: Wrong FS: 
> s3a://xx/hbase128/oldWALs/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594,
>  expected: hdfs://dummy_path
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:246)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1622)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1619)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
>

[jira] [Commented] (HBASE-21744) timeout for server list refresh calls



[ 
https://issues.apache.org/jira/browse/HBASE-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749309#comment-16749309
 ] 

Sergey Shelukhin commented on HBASE-21744:
--

If the refresh is requested immediately (e.g. due to rapid changes in ZK) then 
it should be done immediately, and if it's not requested the wait() call will 
wait for the max timeout, so it shouldn't be necessary to sleep. Current code 
will queue refresh calls on every change with no delay, too.

Nanosec is used because it's the only reliable way to measure intervals, 
currentMillis is affected by ntpd and such and can produce artifacts, including 
unchecked exceptions from sleep/wait/etc. because time went backwards. Actually 
I should file a bug to fix it in other places, I can see many places in the 
code where currentMillis is used for intervals. Hopefully nothing critical is 
affected by this.

I'll look to see if a test can be added.

> timeout for server list refresh calls 
> --
>
> Key: HBASE-21744
> URL: https://issues.apache.org/jira/browse/HBASE-21744
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21744.patch
>
>
> Not sure why yet, but we are seeing the case when cluster is in overall a bad 
> state, where after RS dies and deletes its znode, the notification looks like 
> it's lost, so the master doesn't detect the failure. ZK itself appears to be 
> healthy and doesn't report anything special.
> After some other change is made to the server list, master rescans the list 
> and picks up the stale change. Might make sense to add a config that would 
> trigger the refresh if it hasn't happened for a while (e.g. 1 minute).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21755) RS aborts while performing replication with wal dir on hdfs, root dir on s3



[ 
https://issues.apache.org/jira/browse/HBASE-21755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749307#comment-16749307
 ] 

Nihal Jain commented on HBASE-21755:


All branches, including master still has the following:
{code:java}
  public static Path getArchivedLogPath(Path path, Configuration conf) throws 
IOException {
Path rootDir = FSUtils.getRootDir(conf);  <= SHOULD BE WAL ROOT DIR
Path oldLogDir = new Path(rootDir, HConstants.HREGION_OLDLOGDIR_NAME);
if (conf.getBoolean(SEPARATE_OLDLOGDIR, DEFAULT_SEPARATE_OLDLOGDIR)) {
  ServerName serverName = getServerNameFromWALDirectoryName(path);
  if (serverName == null) {
LOG.error("Couldn't locate log: " + path);
return path;
  }
  oldLogDir = new Path(oldLogDir, serverName.getServerName());
}
Path archivedLogLocation = new Path(oldLogDir, path.getName());
final FileSystem fs = FSUtils.getCurrentFileSystem(conf); <= SHOULD BE 
WAL FS

if (fs.exists(archivedLogLocation)) {
  LOG.info("Log " + path + " was moved to " + archivedLogLocation);
  return archivedLogLocation;
} else {
  LOG.error("Couldn't locate log: " + path);
  return path;
}
  }
{code}
HBASE-21688 somehow missed it.

> RS aborts while performing replication with wal dir on hdfs, root dir on s3
> ---
>
> Key: HBASE-21755
> URL: https://issues.apache.org/jira/browse/HBASE-21755
> Project: HBase
>  Issue Type: Bug
>  Components: Filesystem Integration, Replication, wal
>Affects Versions: 1.5.0, 2.1.3
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Critical
>  Labels: s3
>
> *Environment/Configuration*
>  - _hbase.wal.dir_ : Configured to be on hdfs
>  - _hbase.rootdir_ : Configured to be on s3
> In replication scenario, while trying to get archived log dir (using method 
> [WALEntryStream.java#L314|https://github.com/apache/hbase/blob/da92b3e0061a7c67aa9a3e403d68f3b56bf59370/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L314])
>  we get the following exception:
> {code:java}
> 2019-01-21 17:43:55,440 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.ReplicationSource: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  
> currentPath=hdfs://dummy_path/hbase/WALs/host2,2,1548063439555/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594
> java.lang.IllegalArgumentException: Wrong FS: 
> s3a://xx/hbase128/oldWALs/host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1.1548063492594,
>  expected: hdfs://dummy_path
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:246)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1622)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1619)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1634)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:465)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1742)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.getArchivedLog(WALEntryStream.java:319)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.resetReader(WALEntryStream.java:404)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.reset(WALEntryStream.java:161)
>   at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:148)
> 2019-01-21 17:43:55,444 ERROR 
> [RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2]
>  regionserver.HRegionServer: * ABORTING region server 
> host2,2,1548063439555: Unexpected exception in 
> RS_REFRESH_PEER-regionserver/host2:2-1.replicationSource,2.replicationSource.wal-reader.host2%2C2%2C1548063439555.host2%2C2%2C1548063439555.regiongroup-1,2
>  *
> java.lang.IllegalArgumentException: Wrong FS: 
>

[jira] [Commented] (HBASE-21750) Most of KeyValueUtil#length can be replaced by cell#getSerializedSize for better performance because the latter one has been optimized

2019-01-22 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749297#comment-16749297
 ] 

Hudson commented on HBASE-21750:


Results for branch branch-2
[build #1628 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1628/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1628//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1628//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1628//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Most of KeyValueUtil#length can be replaced by cell#getSerializedSize for 
> better performance because the latter one has been optimized
> --
>
> Key: HBASE-21750
> URL: https://issues.apache.org/jira/browse/HBASE-21750
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21750.v1.patch, HBASE-21750.v1.patch, 
> HBASE-21750.v2.patch, HBASE-21750.v3.patch, HBASE-21750.v3.patch
>
>
> After HBASE-21657, Most subclass of Cell has a cached serialized size (Except 
> those cells with tags), so I think most of the KeyValueUtil#length can be 
> replaced by cell#getSerializedSize. Such as: 
> - KeyValueUtil.length in StoreFlusher#performFlush;
> - KeyValueUtil.length in Compactor#performCompaction ; 
> and so on..
> Will prepare a patch for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749287#comment-16749287
 ] 

Andrew Purtell commented on HBASE-21735:


Latest patch fixes TestDefaultWALProvider

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch, 
> HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21756) Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2



[ 
https://issues.apache.org/jira/browse/HBASE-21756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749281#comment-16749281
 ] 

Hadoop QA commented on HBASE-21756:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
54s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
47s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} branch-2 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} branch-2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} rubocop {color} | {color:red}  0m  
7s{color} | {color:red} The patch generated 82 new + 221 unchanged - 81 fixed = 
303 total (was 302) {color} |
| {color:orange}-0{color} | {color:orange} ruby-lint {color} | {color:orange}  
0m 21s{color} | {color:orange} The patch generated 185 new + 418 unchanged - 
184 fixed = 603 total (was 602) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 1 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
21s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
10m 49s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  8m 
26s{color} | {color:green} hbase-shell in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 44m 50s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:42ca976 |
| JIRA Issue | HBASE-21756 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12955870/HBASE-21756.branch-2.001.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  rubocop  ruby_lint  |
| uname | Linux e4bba0773afc 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | branch-2 / 4eba6b3656 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| rubocop | v0.62.0 |
| rubocop

[jira] [Updated] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



 [ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21735:
---
Attachment: HBASE-21735-branch-1.patch

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch, 
> HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21626) log the regions blocking WAL from being archived



[ 
https://issues.apache.org/jira/browse/HBASE-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749276#comment-16749276
 ] 

Sergey Shelukhin edited comment on HBASE-21626 at 1/22/19 11:23 PM:


Committed to master. Thanks for the review!
I confirmed it works on our cluster.


was (Author: sershe):
Committed to master. Thanks for the review!

> log the regions blocking WAL from being archived
> 
>
> Key: HBASE-21626
> URL: https://issues.apache.org/jira/browse/HBASE-21626
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21626.01.patch, HBASE-21626.02.patch, 
> HBASE-21626.ADDENDUM.patch, HBASE-21626.patch
>
>
> The WALs not being archived for a long time can result in a long recovery 
> later. It's useful to know what regions are blocking the WALs from being 
> archived, to be able to debug flush logic and tune configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21720) metric to measure how actions are distributed to servers within a MultiAction

2019-01-22 Thread Tommy Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommy Li updated HBASE-21720:
-
Attachment: HBASE-21720.master.006.patch

> metric to measure how actions are distributed to servers within a MultiAction
> -
>
> Key: HBASE-21720
> URL: https://issues.apache.org/jira/browse/HBASE-21720
> Project: HBase
>  Issue Type: Improvement
>  Components: Client, metrics, monitoring
>Reporter: Tommy Li
>Assignee: Tommy Li
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HBASE-21720.master.001.patch, 
> HBASE-21720.master.002.patch, HBASE-21720.master.003.patch, 
> HBASE-21720.master.004.patch, HBASE-21720.master.005.patch, 
> HBASE-21720.master.006.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HBASE-21626) log the regions blocking WAL from being archived



 [ 
https://issues.apache.org/jira/browse/HBASE-21626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HBASE-21626.
--
Resolution: Fixed

Committed to master. Thanks for the review!

> log the regions blocking WAL from being archived
> 
>
> Key: HBASE-21626
> URL: https://issues.apache.org/jira/browse/HBASE-21626
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21626.01.patch, HBASE-21626.02.patch, 
> HBASE-21626.ADDENDUM.patch, HBASE-21626.patch
>
>
> The WALs not being archived for a long time can result in a long recovery 
> later. It's useful to know what regions are blocking the WALs from being 
> archived, to be able to debug flush logic and tune configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21720) metric to measure how actions are distributed to servers within a MultiAction



[ 
https://issues.apache.org/jira/browse/HBASE-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749271#comment-16749271
 ] 

Sergey Shelukhin commented on HBASE-21720:
--

+1

> metric to measure how actions are distributed to servers within a MultiAction
> -
>
> Key: HBASE-21720
> URL: https://issues.apache.org/jira/browse/HBASE-21720
> Project: HBase
>  Issue Type: Improvement
>  Components: Client, metrics, monitoring
>Reporter: Tommy Li
>Assignee: Tommy Li
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HBASE-21720.master.001.patch, 
> HBASE-21720.master.002.patch, HBASE-21720.master.003.patch, 
> HBASE-21720.master.004.patch, HBASE-21720.master.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21720) metric to measure how actions are distributed to servers within a MultiAction

2019-01-22 Thread Tommy Li (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749256#comment-16749256
 ] 

Tommy Li commented on HBASE-21720:
--

The checkstyle is complaining about some things not directly introduced by my 
patch. I've fixed the largest issue - the indentation of the large case 
statement in MetricsConnection#updateRpc, but I can undo that if anyone feels 
it's polluting the meat of the patch.

> metric to measure how actions are distributed to servers within a MultiAction
> -
>
> Key: HBASE-21720
> URL: https://issues.apache.org/jira/browse/HBASE-21720
> Project: HBase
>  Issue Type: Improvement
>  Components: Client, metrics, monitoring
>Reporter: Tommy Li
>Assignee: Tommy Li
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HBASE-21720.master.001.patch, 
> HBASE-21720.master.002.patch, HBASE-21720.master.003.patch, 
> HBASE-21720.master.004.patch, HBASE-21720.master.005.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21744) timeout for server list refresh calls

2019-01-22 Thread Xu Cang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749253#comment-16749253
 ] 

Xu Cang commented on HBASE-21744:
-

in this while loop : 
{code:java}
while (!server.isStopped()) {
{code}

I think we need to have some delay in each iteration. Such as "Thread.sleep()" 
Why using nanoSec? It's a bit overkilled for me.
Also, is that possible to write a test case to verify this feature, such as 
counting how many times this list gets refreshed within a certain time range?

Thanks.


> timeout for server list refresh calls 
> --
>
> Key: HBASE-21744
> URL: https://issues.apache.org/jira/browse/HBASE-21744
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21744.patch
>
>
> Not sure why yet, but we are seeing the case when cluster is in overall a bad 
> state, where after RS dies and deletes its znode, the notification looks like 
> it's lost, so the master doesn't detect the failure. ZK itself appears to be 
> healthy and doesn't report anything special.
> After some other change is made to the server list, master rescans the list 
> and picks up the stale change. Might make sense to add a config that would 
> trigger the refresh if it hasn't happened for a while (e.g. 1 minute).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



 [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21742:
-
Status: Patch Available  (was: Open)

> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749246#comment-16749246
 ] 

Sergey Shelukhin edited comment on HBASE-21742 at 1/22/19 11:00 PM:


Attempt at a simple fix... shutting down procedure store first, so that 
procedures couldn't be saved during shutdown. 
I'm not sure this is the best approach but I suspect that a proper fix would 
require massive refactoring - all the procedures are currently independent and 
they'd all have to check they are not relying on incorrect state in any class 
during shutdown. For now, it should be enough to  at least prevent master from 
saving any state that could be incorrect - it's still supposed to be able to 
recover if e.g. kill -9 is run against it, or a machine physically dies, so not 
saving state should be ok.
[~allan163] does this make sense to you?


was (Author: sershe):
Attempt at a simple fix... shutting down procedure store first, so that 
procedures couldn't be saved during shutdown. 
I'm not sure this is the best approach but I suspect that a proper fix would 
require massive refactoring - all the procedures are currently independent and 
they'd all have to check they are not relying on incorrect state in any class 
during shutdown. For now, it should be enough to  at least prevent master from 
saving any state that could be incorrect - it's still supposed to be able to 
recover if e.g. kill -9 is run against it, or a machine physically dies, so not 
saving state should be ok.

> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749246#comment-16749246
 ] 

Sergey Shelukhin commented on HBASE-21742:
--

Attempt at a simple fix... shutting down procedure store first, so that 
procedures couldn't be saved during shutdown. 
I'm not sure this is the best approach but I suspect that a proper fix would 
require massive refactoring - all the procedures are currently independent and 
they'd all have to check they are not relying on incorrect state in any class 
during shutdown. For now, it should be enough to  at least prevent master from 
saving any state that could be incorrect - it's still supposed to be able to 
recover if e.g. kill -9 is run against it, or a machine physically dies, so not 
saving state should be ok.

> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



 [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21742:
-
Attachment: HBASE-21742.patch

> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



 [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned HBASE-21742:


Assignee: Sergey Shelukhin

> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21720) metric to measure how actions are distributed to servers within a MultiAction



[ 
https://issues.apache.org/jira/browse/HBASE-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749238#comment-16749238
 ] 

Hadoop QA commented on HBASE-21720:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
41s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
43s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
34s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
14s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
43s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
32s{color} | {color:red} hbase-client: The patch generated 1 new + 8 unchanged 
- 39 fixed = 9 total (was 47) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
34s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 39s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
16s{color} | {color:green} hbase-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}179m 
17s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  2m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}232m 28s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21720 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12955838/HBASE-21720.master.005.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux eb68459e6704 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 
31 10:55:11 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 35ed5d6c39 |
| maven | version: Apache Maven 3.5.4

[jira] [Comment Edited] (HBASE-17370) Fix or provide shell scripts to drain and decommission region server



[ 
https://issues.apache.org/jira/browse/HBASE-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749227#comment-16749227
 ] 

Nihal Jain edited comment on HBASE-17370 at 1/22/19 10:43 PM:
--

Submitted patch for HBASE-21756. On top on 
[HBASE-21756.branch-2.001.patch|https://issues.apache.org/jira/secure/attachment/12955870/HBASE-21756.branch-2.001.patch],
 [^HBASE-17370.master.002.patch] applies to branch-2 as well. We can merge this 
after HBASE-21756 is resolved.


was (Author: nihaljain.cs):
Submitted patch for HBASE-21756. On top on HBASE-21756, 
[^HBASE-17370.master.002.patch] applies to branch-2 as well. We can merge this 
after HBASE-21756 is resolved.

> Fix or provide shell scripts to drain and decommission region server
> 
>
> Key: HBASE-17370
> URL: https://issues.apache.org/jira/browse/HBASE-17370
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jerry He
>Assignee: Nihal Jain
>Priority: Major
>  Labels: operability
> Attachments: HBASE-17370.branch-2.001.patch, 
> HBASE-17370.master.001.patch, HBASE-17370.master.002.patch
>
>
> 1. Update the existing shell scripts to use the new drain related API.
> 2  Or provide new shell scripts.
> 3. Provide a 'decommission' shell tool that puts the server in drain mode and 
> offload the server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-17370) Fix or provide shell scripts to drain and decommission region server



[ 
https://issues.apache.org/jira/browse/HBASE-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749227#comment-16749227
 ] 

Nihal Jain commented on HBASE-17370:


Submitted patch for HBASE-21756. On top on HBASE-21756, 
[^HBASE-17370.master.002.patch] applies to branch-2 as well. We can merge this 
after HBASE-21756 is resolved.

> Fix or provide shell scripts to drain and decommission region server
> 
>
> Key: HBASE-17370
> URL: https://issues.apache.org/jira/browse/HBASE-17370
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Jerry He
>Assignee: Nihal Jain
>Priority: Major
>  Labels: operability
> Attachments: HBASE-17370.branch-2.001.patch, 
> HBASE-17370.master.001.patch, HBASE-17370.master.002.patch
>
>
> 1. Update the existing shell scripts to use the new drain related API.
> 2  Or provide new shell scripts.
> 3. Provide a 'decommission' shell tool that puts the server in drain mode and 
> offload the server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21756) Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2



 [ 
https://issues.apache.org/jira/browse/HBASE-21756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated HBASE-21756:
---
Attachment: (was: HBASE-21756.master.001.patch)

> Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2
> --
>
> Key: HBASE-21756
> URL: https://issues.apache.org/jira/browse/HBASE-21756
> Project: HBase
>  Issue Type: Test
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21756.branch-2.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21756) Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2



 [ 
https://issues.apache.org/jira/browse/HBASE-21756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated HBASE-21756:
---
Status: Patch Available  (was: Open)

> Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2
> --
>
> Key: HBASE-21756
> URL: https://issues.apache.org/jira/browse/HBASE-21756
> Project: HBase
>  Issue Type: Test
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21756.branch-2.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21756) Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2



 [ 
https://issues.apache.org/jira/browse/HBASE-21756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated HBASE-21756:
---
Attachment: HBASE-21756.branch-2.001.patch

> Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2
> --
>
> Key: HBASE-21756
> URL: https://issues.apache.org/jira/browse/HBASE-21756
> Project: HBase
>  Issue Type: Test
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21756.branch-2.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21756) Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2



 [ 
https://issues.apache.org/jira/browse/HBASE-21756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nihal Jain updated HBASE-21756:
---
Attachment: HBASE-21756.master.001.patch

> Backport HBASE-21279 (Split TestAdminShell into several tests) to branch-2
> --
>
> Key: HBASE-21756
> URL: https://issues.apache.org/jira/browse/HBASE-21756
> Project: HBase
>  Issue Type: Test
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Attachments: HBASE-21756.master.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21758) Update hadoop-three.version on branch-1 to 3.0.3



[ 
https://issues.apache.org/jira/browse/HBASE-21758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749219#comment-16749219
 ] 

Andrew Purtell commented on HBASE-21758:


Command line: {{mvn clean install -Dhadoop.profile=hadoop-3.0}}

Maven version:
{noformat}
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 
2018-10-24T11:41:47-07:00)
Maven home: /usr/local/Cellar/maven/3.6.0/libexec
Java version: 1.8.0_172, vendor: Azul Systems, Inc., runtime: 
/Users/apurtell/blt/tools/Darwin/jdk/openjdk1.8.0_172_x64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac"
{noformat}

> Update hadoop-three.version on branch-1 to 3.0.3
> 
>
> Key: HBASE-21758
> URL: https://issues.apache.org/jira/browse/HBASE-21758
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Trivial
> Fix For: 1.5.0
>
> Attachments: HBASE-21758-branch-1.patch
>
>
> Sync the branch-1 POM with master and branch-2 with respect to the default 
> version of {{hadoop-three.version}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21758) Update hadoop-three.version on branch-1 to 3.0.3

2019-01-22 Thread Sean Busbey (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749205#comment-16749205
 ] 

Sean Busbey commented on HBASE-21758:
-

I had only tried on branch-2.1 for Hadoop 3. Haven't tried any branch-1 builds 
for Hadoop 3. I can take a look tonight.

Can you post your entire mvn command line and version?

> Update hadoop-three.version on branch-1 to 3.0.3
> 
>
> Key: HBASE-21758
> URL: https://issues.apache.org/jira/browse/HBASE-21758
> Project: HBase
>  Issue Type: Task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Trivial
> Fix For: 1.5.0
>
> Attachments: HBASE-21758-branch-1.patch
>
>
> Sync the branch-1 POM with master and branch-2 with respect to the default 
> version of {{hadoop-three.version}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749204#comment-16749204
 ] 

Andrew Purtell commented on HBASE-21735:


Will try that

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21735) Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / etc should query outputstream capabilities) to branch-1

2019-01-22 Thread Sean Busbey (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749203#comment-16749203
 ] 

Sean Busbey commented on HBASE-21735:
-

Would building against a Hadoop 2.y with StreamCapabilities? Hadoop 2.9.2 has 
it.

> Port HBASE-18784 (Use of filesystem that requires hflush / hsync / append / 
> etc should query outputstream capabilities) to branch-1
> ---
>
> Key: HBASE-21735
> URL: https://issues.apache.org/jira/browse/HBASE-21735
> Project: HBase
>  Issue Type: Sub-task
>  Components: fs, wal
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
>  Labels: s3
> Fix For: 1.5.0
>
> Attachments: HBASE-21735-branch-1.patch, HBASE-21735-branch-1.patch
>
>
> HBASE-18784 has nice checks for fs capabilities and logged warnings, 
> especially useful on recent versions of hadoop. The refactors are minor and 
> are compatible with a minor release. Port to branch-1. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21744) timeout for server list refresh calls



 [ 
https://issues.apache.org/jira/browse/HBASE-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21744:
-
Attachment: HBASE-21744.patch

> timeout for server list refresh calls 
> --
>
> Key: HBASE-21744
> URL: https://issues.apache.org/jira/browse/HBASE-21744
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21744.patch
>
>
> Not sure why yet, but we are seeing the case when cluster is in overall a bad 
> state, where after RS dies and deletes its znode, the notification looks like 
> it's lost, so the master doesn't detect the failure. ZK itself appears to be 
> healthy and doesn't report anything special.
> After some other change is made to the server list, master rescans the list 
> and picks up the stale change. Might make sense to add a config that would 
> trigger the refresh if it hasn't happened for a while (e.g. 1 minute).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749199#comment-16749199
 ] 

Sergey Shelukhin edited comment on HBASE-21742 at 1/22/19 10:08 PM:


This is on master (mid-December).
The problem is not that SCP gave up... the problem is that SCP was created with 
incorrect state during master shutdown, and preserved in procedure WAL.
So, there was no SCP at all with meta flag set (the server holding meta had SCP 
with meta=false), and all SCPs were waiting for someone else to recover meta.
I've updated the description to pinpoint the problem.



was (Author: sershe):
This is on master (mid-December).
The problem is not that SCP gave up... the problem is that SCP was created with 
incorrect state during master shutdown, and preserved in procedure WAL.
So, there was no SCP at all with meta flag set (the server holding meta had SCP 
with meta=false), and all SCPs were waiting for someone else to recover meta.


> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



 [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21742:
-
Description: 
Some small HDFS hiccup causes master and meta RS to fail together. Master goes 
first:
{noformat}
2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
master master,17000,1547604554447: FAILED [blah] *
...
2019-01-18 10:01:17,087 INFO  [master/master:17000] 
assignment.AssignmentManager: Stopping assignment manager
{noformat}
Bunch of stuff keeps happening, including procedure retries, which is also 
suspect, but not the point here:
{noformat}
2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
{noformat}


Then the meta RS decides it's time to go:
{noformat}
2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=104313
{noformat}
Note that the SCP for this server has meta=false, even though it is holding the 
meta. That is because, as per above "Stopping assigment manager", AM state 
including region map got cleared.
This SCP gets persisted, so when the next master starts, it waits forever for 
meta to be onlined, while there's no SCP with meta=true to online it.

The only way around this is to delete the procv2 WAL - master has all the 
information here, as it often does in bugs I've found recently, but some split 
brain procedures cause it to get stuck one way or another.

I will file a separate bug about that.

  was:
Some small HDFS hiccup causes master and meta RS to fail together. Master goes 
first:
{noformat}
2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
master master,17000,1547604554447: FAILED [blah] *
...
2019-01-18 10:01:17,087 INFO  [master/master:17000] 
assignment.AssignmentManager: Stopping assignment manager
{noformat}
Bunch of stuff keeps happening, including procedure retries, which is also 
suspect, but not the point here:
{noformat}
2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
{noformat}


Then the meta RS decides it's time to go:
{noformat}
2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=104313
{noformat}
This SCP gets persisted, so when the next master starts, it waits forever for 
meta to be onlined, while there's no SCP with meta=true to online it.

The only way around this is to delete the procv2 WAL - master has all the 
information here, as it often does in bugs I've found recently, but some split 
brain procedures cause it to get stuck one way or another.

I will file a separate bug about that.


> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031,

[jira] [Updated] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



 [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21742:
-
Description: 
Some small HDFS hiccup causes master and meta RS to fail together. Master goes 
first:
{noformat}
2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
master master,17000,1547604554447: FAILED [blah] *
...
2019-01-18 10:01:17,087 INFO  [master/master:17000] 
assignment.AssignmentManager: Stopping assignment manager
{noformat}
Bunch of stuff keeps happening, including procedure retries, which is also 
suspect, but not the point here:
{noformat}
2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
{noformat}


Then the meta RS decides it's time to go:
{noformat}
2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=104313
{noformat}
This SCP gets persisted, so when the next master starts, it waits forever for 
meta to be onlined, while there's no SCP with meta=true to online it.

The only way around this is to delete the procv2 WAL - master has all the 
information here, as it often does in bugs I've found recently, but some split 
brain procedures cause it to get stuck one way or another.

I will file a separate bug about that.

  was:
Some small HDFS hiccup causes master and meta RS to fail together. Master goes 
first:
{noformat}
2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
master master,17000,1547604554447: FAILED [blah] *
...
2019-01-18 10:01:17,087 INFO  [master/master:17000] 
assignment.AssignmentManager: Stopping assignment manager
{noformat}
Bunch of stuff keeps happening, including procedure retries, which is also 
suspect, but not the point here:
{noformat}
2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
{noformat}
{noformat}

Then the meta RS decides it's time to go:
{noformat}
2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=104313
{noformat}
This SCP gets persisted, so when the next master starts, it waits forever for 
meta to be onlined, while there's no SCP with meta=true to online it.

The only way around this is to delete the procv2 WAL - master has all the 
information here, as it often does in bugs I've found recently, but some split 
brain procedures cause it to get stuck one way or another.

I will file a separate bug about that.


> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker:

[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749199#comment-16749199
 ] 

Sergey Shelukhin commented on HBASE-21742:
--

This is on master.
The problem is not that SCP gave up... the problem is that SCP was created with 
incorrect state during master shutdown, and preserved in procedure WAL.
So, there was no SCP at all with meta flag set (the server holding meta had SCP 
with meta=false), and all SCPs were waiting for someone else to recover meta.


> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable



[ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749199#comment-16749199
 ] 

Sergey Shelukhin edited comment on HBASE-21742 at 1/22/19 10:06 PM:


This is on master (mid-December).
The problem is not that SCP gave up... the problem is that SCP was created with 
incorrect state during master shutdown, and preserved in procedure WAL.
So, there was no SCP at all with meta flag set (the server holding meta had SCP 
with meta=false), and all SCPs were waiting for someone else to recover meta.



was (Author: sershe):
This is on master.
The problem is not that SCP gave up... the problem is that SCP was created with 
incorrect state during master shutdown, and preserved in procedure WAL.
So, there was no SCP at all with meta flag set (the server holding meta had SCP 
with meta=false), and all SCPs were waiting for someone else to recover meta.


> master can create bad procedures during abort, making entire cluster unusable
> -
>
> Key: HBASE-21742
> URL: https://issues.apache.org/jira/browse/HBASE-21742
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: * ABORTING 
> master master,17000,1547604554447: FAILED [blah] *
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable