date:20190211



[ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764731#comment-16764731
 ] 

Peter Somogyi commented on HBASE-21854:
---

Yes, that's what I observed.

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Major
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0002.log]
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
>

[jira] [Updated] (HBASE-21868) Remove legacy bulk load support



 [ 
https://issues.apache.org/jira/browse/HBASE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21868:
--
Attachment: HBASE-21868-v1.patch

> Remove legacy bulk load support
> ---
>
> Key: HBASE-21868
> URL: https://issues.apache.org/jira/browse/HBASE-21868
> Project: HBase
>  Issue Type: Task
>  Components: mapreduce
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21868-v1.patch, HBASE-21868.patch
>
>
> Bulk load has already been integrated into HBase core and 
> SecureBulkLoadEndpoint has been marked as deprecated on 2.x. Let's remove the 
> related stuffs on master. This is useful for implementing HBASE-21512 since 
> we can remove several references to ClientServiceCallable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21868) Remove legacy bulk load support



[ 
https://issues.apache.org/jira/browse/HBASE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764736#comment-16764736
 ] 

Duo Zhang commented on HBASE-21868:
---

The error prone errors are from generated code.

I think we should polish the configurations for error prone? For now it does 
not useful, always report unrelated errors...

> Remove legacy bulk load support
> ---
>
> Key: HBASE-21868
> URL: https://issues.apache.org/jira/browse/HBASE-21868
> Project: HBase
>  Issue Type: Task
>  Components: mapreduce
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21868-v1.patch, HBASE-21868.patch
>
>
> Bulk load has already been integrated into HBase core and 
> SecureBulkLoadEndpoint has been marked as deprecated on 2.x. Let's remove the 
> related stuffs on master. This is useful for implementing HBASE-21512 since 
> we can remove several references to ClientServiceCallable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-18484) VerifyRep by snapshot does not work when Yarn / SourceHBase / PeerHBase located in different HDFS clusters

2019-02-11 Thread Zheng Hu (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Hu updated HBASE-18484:
-
Attachment: HBASE-18484.v3.patch

> VerifyRep by snapshot  does not work when Yarn / SourceHBase / PeerHBase 
> located in different HDFS clusters
> ---
>
> Key: HBASE-18484
> URL: https://issues.apache.org/jira/browse/HBASE-18484
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0-alpha-1
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: HBASE-18484.v1.patch, HBASE-18484.v2.patch, 
> HBASE-18484.v3.patch
>
>
> As HBASE-16466 commented. 
> Seems like that when source hbase cluster / peer hbase cluster / yarn cluster 
> locate in three different HDFS cluster , it has one problem.
> when restoring the snapshot into tmpdir , we need to create region by 
> following code (HRegion#createHRegion)
> {code}
> public static HRegion createHRegion(final HRegionInfo info, final Path 
> rootDir,
> final Configuration conf, final TableDescriptor hTableDescriptor,
> final WAL wal, final boolean initialize)
>   throws IOException {
> LOG.info("creating HRegion " + info.getTable().getNameAsString()
> + " HTD == " + hTableDescriptor + " RootDir = " + rootDir +
> " Table name == " + info.getTable().getNameAsString());
> FileSystem fs = FileSystem.get(conf);  
> <---  Here our code use  fs.defaultFs configuration to create 
> region.
> Path tableDir = FSUtils.getTableDir(rootDir, info.getTable());
> HRegionFileSystem.createRegionOnFileSystem(conf, fs, tableDir, info);
> HRegion region = HRegion.newHRegion(tableDir, wal, fs, conf, info, 
> hTableDescriptor, null);
> if (initialize) region.initialize(null);
> return region;
>   }
> {code}
> When source cluster & peer cluster locate in two difference file systems , 
> then their fs.defaultFs should be difference, so at least one cluster will 
> fail when restore snapshot into tmpdir . after I added the following fix, it 
> works fine for me.
> {code}
> -FileSystem fs = FileSystem.get(conf);  
> +FileSystem fs = rootDir.getFileSystem(conf);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21854) Race condition in TestProcedureSkipPersistence



 [ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21854:
--
Status: Patch Available  (was: Open)

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Major
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0002.log]
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-07

[jira] [Updated] (HBASE-21854) Race condition in TestProcedureSkipPersistence



 [ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21854:
--
Attachment: HBASE-21854.patch

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Major
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0002.log]
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-07 15:48:

[jira] [Commented] (HBASE-21854) Race condition in TestProcedureSkipPersistence



[ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764766#comment-16764766
 ] 

Duo Zhang commented on HBASE-21854:
---

Check active executor count before restarting the ProcedureExecutor. We will 
update the procedure state in executeProcedure method, and the decreasing of 
activeExecutorCount is performed after this method, so I think this could solve 
the problem. Could you please try if it works for you? [~psomogyi] Thanks.

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Major
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data

[jira] [Commented] (HBASE-21854) Race condition in TestProcedureSkipPersistence



[ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764793#comment-16764793
 ] 

Hadoop QA commented on HBASE-21854:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
55s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
34s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
26s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
35s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 57s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
27s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 35m 33s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21854 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958219/HBASE-21854.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux b155a60c2102 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 
31 10:55:11 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / c48438fcb0 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15923/testReport/ |
| Max. process+thread count | 286 (vs. ulimit of 1) |
| modules | C: hbase-procedure U: hbase-procedure |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15923/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



>

[jira] [Created] (HBASE-21869) Add canonical link to javadoc

Peter Somogyi created HBASE-21869:
-

 Summary: Add canonical link to javadoc
 Key: HBASE-21869
 URL: https://issues.apache.org/jira/browse/HBASE-21869
 Project: HBase
  Issue Type: Improvement
  Components: website
Reporter: Peter Somogyi


SEO could be improved by adding rel=canonical to javadoc. By adding this to 
earlier releases someone searching for HBaseConfiguration will bring up the 
latest release first.

What needs to be considered is how to identify javadoc that are changed between 
versions or moved to different packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21868) Remove legacy bulk load support



[ 
https://issues.apache.org/jira/browse/HBASE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764864#comment-16764864
 ] 

Hadoop QA commented on HBASE-21868:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 8s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
44s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  2m 
21s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 31s{color} 
| {color:red} hbase-endpoint generated 11 new + 112 unchanged - 13 fixed = 123 
total (was 125) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 5s{color} | {color:green} hbase-server: The patch generated 0 new + 69 
unchanged - 5 fixed = 69 total (was 74) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} The patch passed checkstyle in hbase-endpoint 
{color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 6s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 36s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green}  
0m 53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}128m  
1s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m  
4s{color} | {color:green} hbase-endpoint in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
53s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}173m 50s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21868 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958212/HBASE-21868-v1.patch |
| Optiona

[jira] [Commented] (HBASE-18484) VerifyRep by snapshot does not work when Yarn / SourceHBase / PeerHBase located in different HDFS clusters



[ 
https://issues.apache.org/jira/browse/HBASE-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764887#comment-16764887
 ] 

Hadoop QA commented on HBASE-18484:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
37s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
36s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
36s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
18s{color} | {color:red} hbase-mapreduce: The patch generated 3 new + 12 
unchanged - 0 fixed = 15 total (was 12) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
32s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 55s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}131m  
2s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 17m 
16s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
50s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}195m  4s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-18484 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12958215/HBASE-18484.v3.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux d1f6da2f55a5 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 
31 10:55:11 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / c48438fcb0 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf0

[jira] [Commented] (HBASE-21868) Remove legacy bulk load support



[ 
https://issues.apache.org/jira/browse/HBASE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764895#comment-16764895
 ] 

Duo Zhang commented on HBASE-21868:
---

Ping [~stack]. This is used to simplify the patch for HBASE-21585.

> Remove legacy bulk load support
> ---
>
> Key: HBASE-21868
> URL: https://issues.apache.org/jira/browse/HBASE-21868
> Project: HBase
>  Issue Type: Task
>  Components: mapreduce
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21868-v1.patch, HBASE-21868.patch
>
>
> Bulk load has already been integrated into HBase core and 
> SecureBulkLoadEndpoint has been marked as deprecated on 2.x. Let's remove the 
> related stuffs on master. This is useful for implementing HBASE-21512 since 
> we can remove several references to ClientServiceCallable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21865) Put up 2.1.3RC1



[ 
https://issues.apache.org/jira/browse/HBASE-21865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764898#comment-16764898
 ] 

Duo Zhang commented on HBASE-21865:
---

Tag 2.1.3RC1 at
{noformat}
commit da5ec9e4c06c537213883cca8f3cc9a7c19daf67
Author: zhangduo 
Date:   Sun Feb 10 17:28:33 2019 +0800

HBASE-21819 Addendum include resolved new issues since RC0
{noformat}

> Put up 2.1.3RC1
> ---
>
> Key: HBASE-21865
> URL: https://issues.apache.org/jira/browse/HBASE-21865
> Project: HBase
>  Issue Type: Sub-task
>  Components: release
>Reporter: Duo Zhang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764903#comment-16764903
 ] 

Duo Zhang commented on HBASE-21636:
---

2.1.3RC1 has been tagged. Can push to branch-2.1 now. Thanks guys.

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21870) Remove /0.94 content and add redirect rule

Peter Somogyi created HBASE-21870:
-

 Summary: Remove /0.94 content and add redirect rule
 Key: HBASE-21870
 URL: https://issues.apache.org/jira/browse/HBASE-21870
 Project: HBase
  Issue Type: Sub-task
  Components: website
Affects Versions: 3.0.0
Reporter: Peter Somogyi
Assignee: Peter Somogyi
 Fix For: 3.0.0


0.94 release is almost 4 years old so it can be removed from hbase.apache.org. 
To fix broken link add a redirect rule to .htaccess file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21779) Reimplement BulkLoadHFilesTool to use AsyncClusterConnection



 [ 
https://issues.apache.org/jira/browse/HBASE-21779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21779:
--
Hadoop Flags: Incompatible change,Reviewed  (was: Reviewed)
Release Note: The old LoadIncrementalHFiles are removed, please use 
BulkLoadHFiles instead if you want to do bulk load in your code. And for doing 
bulk load from command line, do not reference the LoadIncrementalHFiles 
directly any more, use './hbase completebulkload xxx' instead.

> Reimplement BulkLoadHFilesTool to use AsyncClusterConnection
> 
>
> Key: HBASE-21779
> URL: https://issues.apache.org/jira/browse/HBASE-21779
> Project: HBase
>  Issue Type: Sub-task
>  Components: mapreduce
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: HBASE-21512
>
> Attachments: HBASE-21779-HBASE-21512-v1.patch, 
> HBASE-21779-HBASE-21512-v2.patch, HBASE-21779-HBASE-21512-v3.patch, 
> HBASE-21779-HBASE-21512.patch
>
>
> So we will not rely on the RpcRetryingCaller and ServiceCallable any more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21857) Do not need to check clusterKey if replicationEndpoint is provided when adding a peer



[ 
https://issues.apache.org/jira/browse/HBASE-21857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764915#comment-16764915
 ] 

Hudson commented on HBASE-21857:


Results for branch master
[build #787 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/787/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/787//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/787//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/787//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Do not need to check clusterKey if replicationEndpoint is provided when 
> adding a peer
> -
>
> Key: HBASE-21857
> URL: https://issues.apache.org/jira/browse/HBASE-21857
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21857-addendum.patch, HBASE-21857.patch
>
>
> The clusterKey check is done in HBASE-19630, which is part of the work for 
> HBASE-19397.
> In HBASE-19630 we claim that we always check clusterKey when adding a peer at 
> RS side, but this is not true, as clusterKey could be null. And it will be 
> strange that if we implement a ReplicationEndpoint to kafka and we still need 
> to provide a cluster key in the hbase format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21505) Several inconsistencies on information reported for Replication Sources by hbase shell status 'replication' command.

2019-02-11 Thread Wellington Chevreuil (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wellington Chevreuil updated HBASE-21505:
-
Labels:   (was: Replication metrics shell)

> Several inconsistencies on information reported for Replication Sources by 
> hbase shell status 'replication' command.
> 
>
> Key: HBASE-21505
> URL: https://issues.apache.org/jira/browse/HBASE-21505
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Major
> Attachments: 
> 0001-HBASE-21505-initial-version-for-more-detailed-report.patch, 
> HBASE-21505-master.001.patch, HBASE-21505-master.002.patch, 
> HBASE-21505-master.003.patch, HBASE-21505-master.004.patch, 
> HBASE-21505-master.005.patch, HBASE-21505-master.006.patch
>
>
> While reviewing hbase shell status 'replication' command, noticed the 
> following issues related to replication source section:
> 1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when 
> no new edits were added to source, so nothing was really shipped. Test steps 
> performed:
> 1.1) Source cluster with only one table targeted to replication;
> 1.2) Added a new row, confirmed the row appeared in Target cluster;
> 1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
> shows current timestamp T1.
> 1.4) Waited 30 seconds, no new data added to source. Issued status 
> 'replication' command, now shows timestamp T2.
> 2) When replication is stuck due some connectivity issues or target 
> unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
> is wrongly showing same value as "Replication Lag". This is incorrect, 
> AgeOfLastShippedOp should not change until there's indeed another edit 
> shipped to target. Test steps performed:
> 2.1) Source cluster with only one table targeted to replication;
> 2.2) Stopped target cluster RS;
> 2.3) Put a new row on source. Running status 'replication' command does show 
> lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
> updates as described on bullet #1 above.
> 2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some 
> time before it got finally shipped to target. Test steps performed:
> 3.1) Source cluster with only one table targeted to replication;
> 3.2) Stopped target cluster RS;
> 3.3) Put a new row on source. 
> 3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> T1:
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> T2:
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3.5) Restart target cluster RS and verified the new row appeared there. No 
> new edit added, but status 'replication' command reports AgeOfLastShippedOp 
> as 0, while it should be the diff between the time it concluded shipping at 
> target and the time it was added in source:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
> {noformat}
> 4) When replication is stuck due some connectivity issues or target 
> unavailability, if RS is restarted, once recovered queue source is started, 
> TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 
> GMT 1970, for example), thus "Replication Lag" also gives a complete 
> inaccurate value. 
> Tests performed:
> 4.1) Source cluster with only one table targeted to replication;
> 4.2) Stopped target cluster RS;
> 4.3) Put a new row on source, restart RS on source, waited a few seconds for 
> recovery queue source to startup, then it gives:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
> Lag=9223372036854775807
> {noformat}
> Also, we should report status to all sources running, current output format 
> gives the impression there’s only one, even when there are recovery queues,

[jira] [Updated] (HBASE-21505) Several inconsistencies on information reported for Replication Sources by hbase shell status 'replication' command.

2019-02-11 Thread Wellington Chevreuil (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wellington Chevreuil updated HBASE-21505:
-
Affects Version/s: 2.2.0
   3.0.0
   1.4.6

> Several inconsistencies on information reported for Replication Sources by 
> hbase shell status 'replication' command.
> 
>
> Key: HBASE-21505
> URL: https://issues.apache.org/jira/browse/HBASE-21505
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 3.0.0, 1.4.6, 2.2.0
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Major
> Attachments: 
> 0001-HBASE-21505-initial-version-for-more-detailed-report.patch, 
> HBASE-21505-master.001.patch, HBASE-21505-master.002.patch, 
> HBASE-21505-master.003.patch, HBASE-21505-master.004.patch, 
> HBASE-21505-master.005.patch, HBASE-21505-master.006.patch
>
>
> While reviewing hbase shell status 'replication' command, noticed the 
> following issues related to replication source section:
> 1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when 
> no new edits were added to source, so nothing was really shipped. Test steps 
> performed:
> 1.1) Source cluster with only one table targeted to replication;
> 1.2) Added a new row, confirmed the row appeared in Target cluster;
> 1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
> shows current timestamp T1.
> 1.4) Waited 30 seconds, no new data added to source. Issued status 
> 'replication' command, now shows timestamp T2.
> 2) When replication is stuck due some connectivity issues or target 
> unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
> is wrongly showing same value as "Replication Lag". This is incorrect, 
> AgeOfLastShippedOp should not change until there's indeed another edit 
> shipped to target. Test steps performed:
> 2.1) Source cluster with only one table targeted to replication;
> 2.2) Stopped target cluster RS;
> 2.3) Put a new row on source. Running status 'replication' command does show 
> lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
> updates as described on bullet #1 above.
> 2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some 
> time before it got finally shipped to target. Test steps performed:
> 3.1) Source cluster with only one table targeted to replication;
> 3.2) Stopped target cluster RS;
> 3.3) Put a new row on source. 
> 3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> T1:
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> T2:
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3.5) Restart target cluster RS and verified the new row appeared there. No 
> new edit added, but status 'replication' command reports AgeOfLastShippedOp 
> as 0, while it should be the diff between the time it concluded shipping at 
> target and the time it was added in source:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
> {noformat}
> 4) When replication is stuck due some connectivity issues or target 
> unavailability, if RS is restarted, once recovered queue source is started, 
> TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 
> GMT 1970, for example), thus "Replication Lag" also gives a complete 
> inaccurate value. 
> Tests performed:
> 4.1) Source cluster with only one table targeted to replication;
> 4.2) Stopped target cluster RS;
> 4.3) Put a new row on source, restart RS on source, waited a few seconds for 
> recovery queue source to startup, then it gives:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
> Lag=9223372036854775807
> {noformat}
> Also, we should report status to all sources running, current output form

[jira] [Updated] (HBASE-21505) Several inconsistencies on information reported for Replication Sources by hbase shell status 'replication' command.

2019-02-11 Thread Wellington Chevreuil (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wellington Chevreuil updated HBASE-21505:
-
Component/s: Replication

> Several inconsistencies on information reported for Replication Sources by 
> hbase shell status 'replication' command.
> 
>
> Key: HBASE-21505
> URL: https://issues.apache.org/jira/browse/HBASE-21505
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Major
>  Labels: Replication, metrics, shell
> Attachments: 
> 0001-HBASE-21505-initial-version-for-more-detailed-report.patch, 
> HBASE-21505-master.001.patch, HBASE-21505-master.002.patch, 
> HBASE-21505-master.003.patch, HBASE-21505-master.004.patch, 
> HBASE-21505-master.005.patch, HBASE-21505-master.006.patch
>
>
> While reviewing hbase shell status 'replication' command, noticed the 
> following issues related to replication source section:
> 1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when 
> no new edits were added to source, so nothing was really shipped. Test steps 
> performed:
> 1.1) Source cluster with only one table targeted to replication;
> 1.2) Added a new row, confirmed the row appeared in Target cluster;
> 1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
> shows current timestamp T1.
> 1.4) Waited 30 seconds, no new data added to source. Issued status 
> 'replication' command, now shows timestamp T2.
> 2) When replication is stuck due some connectivity issues or target 
> unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
> is wrongly showing same value as "Replication Lag". This is incorrect, 
> AgeOfLastShippedOp should not change until there's indeed another edit 
> shipped to target. Test steps performed:
> 2.1) Source cluster with only one table targeted to replication;
> 2.2) Stopped target cluster RS;
> 2.3) Put a new row on source. Running status 'replication' command does show 
> lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
> updates as described on bullet #1 above.
> 2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some 
> time before it got finally shipped to target. Test steps performed:
> 3.1) Source cluster with only one table targeted to replication;
> 3.2) Stopped target cluster RS;
> 3.3) Put a new row on source. 
> 3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> T1:
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> T2:
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3.5) Restart target cluster RS and verified the new row appeared there. No 
> new edit added, but status 'replication' command reports AgeOfLastShippedOp 
> as 0, while it should be the diff between the time it concluded shipping at 
> target and the time it was added in source:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
> {noformat}
> 4) When replication is stuck due some connectivity issues or target 
> unavailability, if RS is restarted, once recovered queue source is started, 
> TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 
> GMT 1970, for example), thus "Replication Lag" also gives a complete 
> inaccurate value. 
> Tests performed:
> 4.1) Source cluster with only one table targeted to replication;
> 4.2) Stopped target cluster RS;
> 4.3) Put a new row on source, restart RS on source, waited a few seconds for 
> recovery queue source to startup, then it gives:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
> Lag=9223372036854775807
> {noformat}
> Also, we should report status to all sources running, current output format 
> gives the impression there’s only one, even

[jira] [Commented] (HBASE-21854) Race condition in TestProcedureSkipPersistence



[ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764977#comment-16764977
 ] 

Peter Somogyi commented on HBASE-21854:
---

I did not succeed to reproduce the race condition on my local machine so I 
added some sleep before _throw new ProcedureSuspendedException()_. By this the 
original test failed with the same error as I reported it in the description 
and with your patch it succeeded 100/100 times.

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Major
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7d

[jira] [Updated] (HBASE-21854) Race condition in TestProcedureSkipPersistence



 [ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Somogyi updated HBASE-21854:
--
Priority: Minor  (was: Major)

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Minor
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0002.log]
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-07 15:48:08,734 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-0

[jira] [Updated] (HBASE-21854) Race condition in TestProcedureSkipPersistence



 [ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Somogyi updated HBASE-21854:
--
   Resolution: Fixed
Fix Version/s: 2.1.4
   2.3.0
   2.2.0
   3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch [~Apache9]. Pushed to branch-2.1+

> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Minor
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-07 15:48:08,732 INFO  [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1217): Remove all state logs with ID less than 1, since 
> all the active procedures are in the latest log
> 2019-02-07 15:48:08,733 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore(1239): Removed 
> log=file:/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0001.log,
>  
> activeLogs=[/Users/peter.somogyi/Cloudera/hbase/hbase-procedure/target/test-data/b9a1969a-85a4-15e8-7da5-6198f5acf2de/proc-logs/pv2-0002.log]

[jira] [Commented] (HBASE-21857) Do not need to check clusterKey if replicationEndpoint is provided when adding a peer



[ 
https://issues.apache.org/jira/browse/HBASE-21857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765008#comment-16765008
 ] 

Hudson commented on HBASE-21857:


Results for branch branch-2.1
[build #854 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Do not need to check clusterKey if replicationEndpoint is provided when 
> adding a peer
> -
>
> Key: HBASE-21857
> URL: https://issues.apache.org/jira/browse/HBASE-21857
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21857-addendum.patch, HBASE-21857.patch
>
>
> The clusterKey check is done in HBASE-19630, which is part of the work for 
> HBASE-19397.
> In HBASE-19630 we claim that we always check clusterKey when adding a peer at 
> RS side, but this is not true, as clusterKey could be null. And it will be 
> strange that if we implement a ReplicationEndpoint to kafka and we still need 
> to provide a cluster key in the hbase format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21819) Generate CHANGES.md and RELEASENOTES.md for 2.1.3



[ 
https://issues.apache.org/jira/browse/HBASE-21819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765009#comment-16765009
 ] 

Hudson commented on HBASE-21819:


Results for branch branch-2.1
[build #854 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/854//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Generate CHANGES.md and RELEASENOTES.md for 2.1.3
> -
>
> Key: HBASE-21819
> URL: https://issues.apache.org/jira/browse/HBASE-21819
> Project: HBase
>  Issue Type: Sub-task
>  Components: documentation, release
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.1.3
>
> Attachments: HBASE-21819-branch-2.1-addendum-v1.patch, 
> HBASE-21819-branch-2.1-addendum-v2.patch, 
> HBASE-21819-branch-2.1-addendum-v3.patch, 
> HBASE-21819-branch-2.1-addendum-v4.patch, 
> HBASE-21819-branch-2.1-addendum.patch, HBASE-21819-branch-2.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21857) Do not need to check clusterKey if replicationEndpoint is provided when adding a peer



[ 
https://issues.apache.org/jira/browse/HBASE-21857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765042#comment-16765042
 ] 

Hudson commented on HBASE-21857:


Results for branch branch-2
[build #1676 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1676/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1676//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1676//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1676//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Do not need to check clusterKey if replicationEndpoint is provided when 
> adding a peer
> -
>
> Key: HBASE-21857
> URL: https://issues.apache.org/jira/browse/HBASE-21857
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21857-addendum.patch, HBASE-21857.patch
>
>
> The clusterKey check is done in HBASE-19630, which is part of the work for 
> HBASE-19397.
> In HBASE-19630 we claim that we always check clusterKey when adding a peer at 
> RS side, but this is not true, as clusterKey could be null. And it will be 
> strange that if we implement a ReplicationEndpoint to kafka and we still need 
> to provide a cluster key in the hbase format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21512) Introduce an AsyncClusterConnection and replace the usage of ClusterConnection



[ 
https://issues.apache.org/jira/browse/HBASE-21512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765040#comment-16765040
 ] 

Hudson commented on HBASE-21512:


Results for branch HBASE-21512
[build #95 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/95/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/95//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/95//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-21512/95//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Introduce an AsyncClusterConnection and replace the usage of ClusterConnection
> --
>
> Key: HBASE-21512
> URL: https://issues.apache.org/jira/browse/HBASE-21512
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
>
> At least for the RSProcedureDispatcher, with CompletableFuture we do not need 
> to set a delay and use a thread pool any more, which could reduce the 
> resource usage and also the latency.
> Once this is done, I think we can remove the ClusterConnection completely, 
> and start to rewrite the old sync client based on the async client, which 
> could reduce the code base a lot for our client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21871) Support to specify a peer table name in VerifyReplication tool

2019-02-11 Thread Toshihiro Suzuki (JIRA)

Toshihiro Suzuki created HBASE-21871:


 Summary: Support to specify a peer table name in VerifyReplication 
tool
 Key: HBASE-21871
 URL: https://issues.apache.org/jira/browse/HBASE-21871
 Project: HBase
  Issue Type: Improvement
Reporter: Toshihiro Suzuki
Assignee: Toshihiro Suzuki


After HBASE-21201, we can specify peerQuorumAddress instead of peerId in 
VerifyReplication tool. So it no longer requires peerId to be setup when using 
this tool. However, we don't have a way to specify a peer table name in 
VerifyReplication for now.

So I would like to propose to update the tool to pass as a peer table name as 
an argument (ex. --peerTableName=).

After resolving this Jira, we will be able to compare any 2 tables in any 
remote clusters.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765054#comment-16765054
 ] 

Peter Somogyi commented on HBASE-21861:
---

Because the contents of the patches are significantly different I think it 
would be more trackable with different JIRAs.

For example _hbase-21861.master.001.patch_ modifies check-website-links.sh 
while _hbase-21861.branch-1.2.001.patch_ mostly changed javadoc configuration.

Are there any branches where we need similar javadoc configuration change? We 
can have those under this issue.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21857) Do not need to check clusterKey if replicationEndpoint is provided when adding a peer



[ 
https://issues.apache.org/jira/browse/HBASE-21857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765068#comment-16765068
 ] 

Hudson commented on HBASE-21857:


Results for branch branch-2.2
[build #33 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/33/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/33//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/33//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/33//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Do not need to check clusterKey if replicationEndpoint is provided when 
> adding a peer
> -
>
> Key: HBASE-21857
> URL: https://issues.apache.org/jira/browse/HBASE-21857
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21857-addendum.patch, HBASE-21857.patch
>
>
> The clusterKey check is done in HBASE-19630, which is part of the work for 
> HBASE-19397.
> In HBASE-19630 we claim that we always check clusterKey when adding a peer at 
> RS side, but this is not true, as clusterKey could be null. And it will be 
> strange that if we implement a ReplicationEndpoint to kafka and we still need 
> to provide a cluster key in the hbase format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21872) Clean up getBytes() calls without charsets provided

Josh Elser created HBASE-21872:
--

 Summary: Clean up getBytes() calls without charsets provided
 Key: HBASE-21872
 URL: https://issues.apache.org/jira/browse/HBASE-21872
 Project: HBase
  Issue Type: Task
Reporter: Josh Elser
Assignee: Josh Elser
 Fix For: 3.0.0


As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
Charset can result is some compiler warnings. Let's just get rid of these 
calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21201) Support to run VerifyReplication MR tool without peerid



[ 
https://issues.apache.org/jira/browse/HBASE-21201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765100#comment-16765100
 ] 

Josh Elser commented on HBASE-21201:


{quote}It seems like they are unrelated to the batch. I will commit the latest 
patch.
{quote}
Yup, you're good! Thanks for looking carefully at the QA reports!

Let me spin out another issue to fix up those calls.

> Support to run VerifyReplication MR tool without peerid
> ---
>
> Key: HBASE-21201
> URL: https://issues.apache.org/jira/browse/HBASE-21201
> Project: HBase
>  Issue Type: Improvement
>  Components: hbase-operator-tools
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sujit P
>Assignee: Toshihiro Suzuki
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.3.0
>
> Attachments: HBASE-21201.master.001.patch, 
> HBASE-21201.master.002.patch, HBASE-21201.master.003.patch, 
> HBASE-21201.master.003.patch, HBASE-21201.master.004.patch
>
>
> In some use cases, hbase clients writes to separate clusters(probably 
> different datacenters) tables for redundancy. As an administrator/application 
> architect, I would like to find out if both cluster tables are in the same 
> state (cell by cell). One of the tools that is readily available to use is 
> VerifyRep which is part of replication.
> However, it requires peerId to be setup on atleast of the involved cluster. 
> PeerId is unnecessary in this use-case scenario and possibly cause unintended 
> consequences as the clusters aren't really replication peers neither do We 
> prefer them to be.
> Looking at the code:
> Tool attempts to get only the clusterKey which is essentially ZooKeeper 
> quorum url
>  
> {code:java}
> //VerifyReplication.java
> private static Pair 
> getPeerQuorumConfig(final Configuration conf, String peerId)
> .
> .
> return Pair.newPair(peerConfig,
>         ReplicationUtils.getPeerClusterConfiguration(peerConfig, conf));
> //ReplicationUtils.java
> public static Configuration getPeerClusterConfiguration(ReplicationPeerConfig 
> peerConfig, Configuration baseConf) throws ReplicationException {
> Configuration otherConf;
> try {
> otherConf = HBaseConfiguration.createClusterConf(baseConf, 
> peerConfig.getClusterKey());{code}
>  
>  
> So I would like to propose to update the tool to pass the remote cluster 
> ZkQuorum as an argument (ex. --peerQuorumAddress 
> clusterBzk1,clusterBzk2,clusterBzk3:2181/hbase-secure ) and use it 
> effectively without dependence on replication peerId, similar to 
> peerFSAddress. The are certain advantages in doing so as follows:
>  * Reduce the development/maintenance of separate tool for above scenario
>  * Allow the tool to be more useful for other scenarios as well such as 
>  ** validating backups in remote cluster HBASE-19106
>  ** compare cloned tableA and original tableA in same/remote cluster incase 
> of user error before restoring snapshot to original table to find the records 
> that need to be added/invalid/missing etc
>  ** Allow backup operators who are non-Hbase admins(who shouldn't be adding 
> the peerId) to run the tool, since currently only Hbase superuser can add a 
> peerId for reasons discussed in HBASE-21163.
> Please post your comments
> Thanks
> cc: [~clayb], [~brfrn169] , [~vrodionov] , [~rashidaligee]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21872) Clean up getBytes() calls without charsets provided

2019-02-11 Thread Kevin Risden (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765120#comment-16765120
 ] 

Kevin Risden commented on HBASE-21872:
--

Just a plug for forbiddenapis (1) here - It would prevent future usage of 
default charset/locale. Lucene/Solr uses it as well as Calcite (CALCITE-1667) 
since we found some charset related issues. 

1. https://github.com/policeman-tools/forbidden-apis

> Clean up getBytes() calls without charsets provided
> ---
>
> Key: HBASE-21872
> URL: https://issues.apache.org/jira/browse/HBASE-21872
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0
>
>
> As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
> Charset can result is some compiler warnings. Let's just get rid of these 
> calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.



 [ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-21636:
--
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.1.4
 Release Note: 
Allows shell to set Scan options previously not exposed. See additions as part 
of the scan help by typing following hbase shell:

hbase> help 'scan'
   Status: Resolved  (was: Patch Available)

Pushed to branch-2.1 Resolving. Thanks for nice patch [~nihaljain.cs]

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0, 2.1.4
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.

2019-02-11 Thread Nihal Jain (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765140#comment-16765140
 ] 

Nihal Jain commented on HBASE-21636:


Thanks for review and commit [~stack], [~Apache9].

Sir this didn't go into master. Was it intentional?

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0, 2.1.4
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21748) Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because it's an quite time consuming.) to branch-1

2019-02-11 Thread Sean Busbey (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765138#comment-16765138
 ] 

Sean Busbey commented on HBASE-21748:
-

aren't those measurements on HBase 2* though?

I was referring to [~apurtell]'s comment here:

bq. However I don't think branch-1 has the same exposure to the perf problem. 
I'm not sure how to demonstrate a benefit. I will try some simple benchmarking 
but may need to resort to JMH to quantify it to any degree of certainty.

> Port HBASE-21738 (Remove all the CLSM#size operation in our memstore because 
> it's an quite time consuming.) to branch-1
> ---
>
> Key: HBASE-21748
> URL: https://issues.apache.org/jira/browse/HBASE-21748
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.10, 1.3.4
>
> Attachments: HBASE-21748-branch-1.patch, HBASE-21748-branch-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21868) Remove legacy bulk load support



 [ 
https://issues.apache.org/jira/browse/HBASE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-21868:
--
Release Note: Remove SecureBulkLoadEndpoint and related classes and tests 
(Bulk load has been integrated into HBase core and SecureBulkLoadEndpoint has 
been marked as deprecated since 2.x). Remove the support for non secure bulk 
load. Notice that the 'secure' here does not mean you need to enable kerberos. 
For bulk load we will always obtain a 'bulkToken' when calling prepareBulkLoad, 
even if you do not enable kerberos.  (was: Remove SecureBulkLoadEndpoint and 
related classes and tests.
Remove the support for non secure bulk load. Notice that the 'secure' here does 
not mean you need to enable kerberos. For bulk load we will always obtain a 
'bulkToken' when calling prepareBulkLoad, even if you do not enable kerberos.)

> Remove legacy bulk load support
> ---
>
> Key: HBASE-21868
> URL: https://issues.apache.org/jira/browse/HBASE-21868
> Project: HBase
>  Issue Type: Task
>  Components: mapreduce
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21868-v1.patch, HBASE-21868.patch
>
>
> Bulk load has already been integrated into HBase core and 
> SecureBulkLoadEndpoint has been marked as deprecated on 2.x. Let's remove the 
> related stuffs on master. This is useful for implementing HBASE-21512 since 
> we can remove several references to ClientServiceCallable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21868) Remove legacy bulk load support



[ 
https://issues.apache.org/jira/browse/HBASE-21868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765150#comment-16765150
 ] 

stack commented on HBASE-21868:
---

+1

> Remove legacy bulk load support
> ---
>
> Key: HBASE-21868
> URL: https://issues.apache.org/jira/browse/HBASE-21868
> Project: HBase
>  Issue Type: Task
>  Components: mapreduce
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21868-v1.patch, HBASE-21868.patch
>
>
> Bulk load has already been integrated into HBase core and 
> SecureBulkLoadEndpoint has been marked as deprecated on 2.x. Let's remove the 
> related stuffs on master. This is useful for implementing HBASE-21512 since 
> we can remove several references to ClientServiceCallable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-20053) Remove .cmake file extension from .gitignore

2019-02-11 Thread Sean Busbey (JIRA)



 [ 
https://issues.apache.org/jira/browse/HBASE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-20053:

   Resolution: Cannot Reproduce
 Assignee: (was: Norbert Kalmar)
Fix Version/s: (was: HBASE-14850)
   Status: Resolved  (was: Patch Available)

taking the silence as affirmation. Nightly tests don't look any worse than 
master. pushed rebased version of HBASE-14850 branch and removed the staged 
version named for this jira.

> Remove .cmake file extension from .gitignore
> 
>
> Key: HBASE-20053
> URL: https://issues.apache.org/jira/browse/HBASE-20053
> Project: HBase
>  Issue Type: Sub-task
>  Components: build, community
>Affects Versions: HBASE-14850
>Reporter: Ted Yu
>Priority: Minor
>  Labels: build
> Attachments: HBASE-20053-HBASE-14850.v001.patch
>
>
> There are .cmake files under hbase-native-client/cmake/ which are under 
> source control.
> The .cmake extension should be taken out of hbase-native-client/.gitignore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21860) WALFactory should switch to default provider if multiwal provider is defined for meta wal (Per suggestions on HBASE-21843)

2019-02-11 Thread Sean Busbey (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765179#comment-16765179
 ] 

Sean Busbey commented on HBASE-21860:
-

do we have a unit test that covers expected behavior for "you configured a 
nonsense WAL provider" as a placeholder for configuring a walprovider that has 
some runtime requirement we can't meet?

the change to the try/catch structure has me thinking about it.

> WALFactory should switch to default provider if multiwal provider is defined 
> for meta wal (Per suggestions on HBASE-21843) 
> ---
>
> Key: HBASE-21860
> URL: https://issues.apache.org/jira/browse/HBASE-21860
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Critical
> Attachments: HBASE-21860.branch-1.4.001.patch, 
> HBASE-21860.branch-2.0.001.patch, HBASE-21860.branch-2.2.001.patch, 
> HBASE-21860.master.001.patch, HBASE-21860.master.002.patch
>
>
> Following discussions on HBASE-21843, one of the suggestion was to make wal 
> provider for meta wal switch to default provider if multi wal is defined as 
> the target provider. Quoting [~busbey]: 
> {quote}
> I don't think it's a good idea to revert HBASE-20856. the principles of that 
> issue are still sound.
> We already have logic somewhere for the AsyncDFS based WAL that falls back to 
> the default if something goes wrong with the needed HDFS hooks. Can we do 
> something similar for the region grouping provider and make the check 
> something like "did you ask for a provider for meta"?
> {quote}
> Am uploading a patch that switches to default whenever WALFactory finds 
> multiwal provider is defined for meta wal (either explicitly by defining 
> "hbase.wal.meta_provider", or indirectly, by loading what is in 
> "hbase.wal.provider")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21636) Enhance the shell scan command to support missing scanner specifications like ReadType, IsolationLevel etc.



[ 
https://issues.apache.org/jira/browse/HBASE-21636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765197#comment-16765197
 ] 

stack commented on HBASE-21636:
---

bq. Sir this didn't go into master. Was it intentional?

Not inentional. Thanks for noticing. Fixed.

> Enhance the shell scan command to support missing scanner specifications like 
> ReadType, IsolationLevel etc.
> ---
>
> Key: HBASE-21636
> URL: https://issues.apache.org/jira/browse/HBASE-21636
> Project: HBase
>  Issue Type: Improvement
>  Components: shell
>Affects Versions: 3.0.0, 2.0.0, 2.1.2
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.5, 2.3.0, 2.1.4
>
> Attachments: HBASE-21636.branch-2.0.001.patch, 
> HBASE-21636.master.001.patch, HBASE-21636.master.002.patch
>
>
> Enhance the shell scan command to support scanner specifications:
>  - ReadType
>  - IsolationLevel
>  - Region replica id
>  - Allow partial results
>  - Batch
>  - Max result size
> Also, make use of \{{limit}} and set it in the scan object to limit the 
> number of rows returned by the scanner.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta



[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765320#comment-16765320
 ] 

Sergey Shelukhin commented on HBASE-21844:
--

Anyway, I think we can split this issue. We should definitely add logging to 
ORP to ascertain what it's doing, esp. after recovery. The sooner we see it the 
sooner we'll get  to the root cause, or at least determine if this is merely 
ORP issue, or proc WAL issue (or smth else).

We can do a more complex procWAL bug search separately.

> Master could get stuck in initializing state while waiting for meta
> ---
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
>  Issue Type: Bug
>  Components: master, meta
>Affects Versions: 3.0.0
>Reporter: Bahram Chehrazy
>Assignee: Bahram Chehrazy
>Priority: Major
> Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21864) add region state version and reinstate YouAreDead exception in region report



[ 
https://issues.apache.org/jira/browse/HBASE-21864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765323#comment-16765323
 ] 

Sergey Shelukhin commented on HBASE-21864:
--

[~stack] it's just the regular heartbeat. 
When RS reported incorrect state, master used to kill it (YouAreDeadException), 
but that was removed because of these races.

I was thinking storing a version per region (not sure yet if it can be in 
memory only, or if we'd have to store in meta too). It would be incremented by 
master on every change. It would just store the last version RS acked  for this 
region, and discard all messages before that.
One additional possible benefit is for the current crop of races with double 
assignment. If RS reports something like "I opened this region you never 
expected me to open", it would be easier to look and see that it's acting on a 
stale message and kill it conditionally to avoid data loss.

> add region state version and reinstate YouAreDead exception in region report
> 
>
> Key: HBASE-21864
> URL: https://issues.apache.org/jira/browse/HBASE-21864
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> The state version will ensure we don't have network-related races  (e.g. the 
> one I reported in some other bug -
> {code}
> RS: send report {R1} ...
> M: close R1
> RS: I closed R1
> M ... receive report {R1}
> M: you shouldn't have R1, die
> {code}).
> Then we can revert the change that removed YouAreDead exception... RS in 
> incorrect state should be either brought into correct state or killed because 
> it means there's some bug; right now if double assignment happens (I found 2 
> different cases just this week ;)) master lets RS with incorrect assignment 
> keep it forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta



[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765300#comment-16765300
 ] 

Sergey Shelukhin commented on HBASE-21844:
--

We definitely hit a lot of corrupted proc WAL issues, esp. on [~bahramch] 
tests. We should investigate what's going

[~Apache9] [~stack] actually, on a side note, I wonder why we have a custom WAL 
implementation for procedures? Couldn't procedures be stored in a multi-version 
HBase table for everything except meta and that table (like we already have 
custom recovery for meta?).
It's a little extra complexity for the special case, but would allow us to 
avoid bunch of extra complex code in procWAL, and also make it much easier to 
modify/debug, even hbck.

> Master could get stuck in initializing state while waiting for meta
> ---
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
>  Issue Type: Bug
>  Components: master, meta
>Affects Versions: 3.0.0
>Reporter: Bahram Chehrazy
>Assignee: Bahram Chehrazy
>Priority: Major
> Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (HBASE-21873) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

Sergey Shelukhin created HBASE-21873:


 Summary: IPCUtil.wrapException should keep the original exception 
types for all the connection exceptions
 Key: HBASE-21873
 URL: https://issues.apache.org/jira/browse/HBASE-21873
 Project: HBase
  Issue Type: Bug
Affects Versions: 3.0.0, 2.2.0
Reporter: Sergey Shelukhin
Assignee: Duo Zhang
 Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
 Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
HBASE-21862-v2.patch, HBASE-21862.patch

It's a classic bug, sort of... the call times out to open the region, but RS 
actually processes it alright. It could also happen if the response didn't make 
it back due to a network issue.
As a result region is opened on two servers.
There are some mitigations possible to narrow down the race window.
1) Don't process expired open calls, fail them. Won't help for network issues.
2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
will require fixing other network races where master kills RS, which would 
require adding state versioning to the protocol.

The fundamental fix though would require either
1) an unknown failure from open to ascertain the state of the region from the 
server. Again, this would probably require protocol changes to make sure we 
ascertain the region is not opened, and also that the already-failed-on-master 
open is NOT going to be processed if it's some queue or even in transit on the 
network (via a nonce-like mechanism)?
2) some form of a distributed lock per region, e.g. in ZK
3) some form of 2PC? but the participant list cannot be determined in a manner 
that's both scalable and guaranteed correct. Theoretically it could be all RSes.


{noformat}
2019-02-08 03:21:31,715 INFO  [PEWorker-7] procedure.MasterProcedureScheduler: 
Took xlock for pid=260626, ppid=260595, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN
2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
location=server1,17020,1549567999303; forceNewPlan=false, retain=true
2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
regionState=OPENING, regionLocation=server1,17020,1549623714617
2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
ppid=260626, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... to 
server server1,17020,1549623714617 failed
java.io.IOException: Call to server1/...:17020 failed on local exception: 
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
waitTime=60145, rpcTimeout=6^M
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
...
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
waitTime=60145, rpcTimeout=6^M
at 
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
... 4 more^M
{noformat}
RS:
{noformat}
hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
[RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: Open 
...d0214809147e43dc6870005742d5d204.
...
hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
[RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
Opened ...d0214809147e43dc6870005742d5d204.
{noformat}
Retry:
{noformat}
2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=260626, 
ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
hasLock=true; TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
location=server1,17020,1549623714617
2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; 
forceNewPlan=true, retain=false
2019-02-08 03:22:33,238 INFO  [PEWorker-7] assignment.RegionStateStore: 
pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
regionState=OPENING, regionLocation=server2,17020,1549569075319
{noformat}
The ignore-message
{noformat}
2019-02-08 03:25:44,754 WARN  
[RpcServe

[jira] [Updated] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21862:
-
Summary: region can be assigned to 2 servers due to a timed-out call or an 
unknown exception  (was: IPCUtil.wrapException should keep the original 
exception types for all the connection exceptions)

> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_

[jira] [Updated] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21862:
-
Attachment: (was: HBASE-21862-v1.patch)

> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=nu

[jira] [Updated] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21862:
-
Attachment: (was: HBASE-21862.patch)

> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null;

[jira] [Updated] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21862:
-
Attachment: (was: HBASE-21862-v2.patch)

> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=nu

[jira] [Updated] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21862:
-
Attachment: (was: HBASE-21862-forUT.patch)

> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location

[jira] [Commented] (HBASE-21872) Clean up getBytes() calls without charsets provided



[ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765333#comment-16765333
 ] 

Josh Elser commented on HBASE-21872:


{quote}Just a plug for forbiddenapis (1) here - It would prevent future usage 
of default charset/locale. Lucene/Solr uses it as well as Calcite 
(CALCITE-1667) since we found some charset related issues. 
{quote}
Thanks, Kevin! That'd be another good one for me to follow-up on.

> Clean up getBytes() calls without charsets provided
> ---
>
> Key: HBASE-21872
> URL: https://issues.apache.org/jira/browse/HBASE-21872
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HBASE-21782.001.patch
>
>
> As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
> Charset can result is some compiler warnings. Let's just get rid of these 
> calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21864) add region state version and reinstate YouAreDead exception in region report



[ 
https://issues.apache.org/jira/browse/HBASE-21864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765323#comment-16765323
 ] 

Sergey Shelukhin edited comment on HBASE-21864 at 2/11/19 7:27 PM:
---

[~stack] it's just the regular heartbeat. 
When RS reported incorrect state, master used to kill it (YouAreDeadException), 
but that was removed because of these races.

I was thinking storing a version per region (not sure yet if it can be in 
memory only, or if we'd have to store in meta too). It would be incremented by 
master on every change. It would just store the last version RS acked  for this 
region, and discard all messages before that.
One additional possible benefit is for the current crop of races with double 
assignment. If RS reports something like "I opened this region you never 
expected me to open", it would be easier to look and see that it's acting on a 
stale message and doesn't know the current state, and kill it conditionally to 
avoid data loss.


was (Author: sershe):
[~stack] it's just the regular heartbeat. 
When RS reported incorrect state, master used to kill it (YouAreDeadException), 
but that was removed because of these races.

I was thinking storing a version per region (not sure yet if it can be in 
memory only, or if we'd have to store in meta too). It would be incremented by 
master on every change. It would just store the last version RS acked  for this 
region, and discard all messages before that.
One additional possible benefit is for the current crop of races with double 
assignment. If RS reports something like "I opened this region you never 
expected me to open", it would be easier to look and see that it's acting on a 
stale message and kill it conditionally to avoid data loss.

> add region state version and reinstate YouAreDead exception in region report
> 
>
> Key: HBASE-21864
> URL: https://issues.apache.org/jira/browse/HBASE-21864
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> The state version will ensure we don't have network-related races  (e.g. the 
> one I reported in some other bug -
> {code}
> RS: send report {R1} ...
> M: close R1
> RS: I closed R1
> M ... receive report {R1}
> M: you shouldn't have R1, die
> {code}).
> Then we can revert the change that removed YouAreDead exception... RS in 
> incorrect state should be either brought into correct state or killed because 
> it means there's some bug; right now if double assignment happens (I found 2 
> different cases just this week ;)) master lets RS with incorrect assignment 
> keep it forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21872) Clean up getBytes() calls without charsets provided



 [ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-21872:
---
Status: Patch Available  (was: Open)

> Clean up getBytes() calls without charsets provided
> ---
>
> Key: HBASE-21872
> URL: https://issues.apache.org/jira/browse/HBASE-21872
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HBASE-21782.001.patch
>
>
> As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
> Charset can result is some compiler warnings. Let's just get rid of these 
> calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved



 [ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bahram Chehrazy updated HBASE-21788:

Attachment: WAL-Orphan.log

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
> Attachments: WAL-Orphan.log
>
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21872) Clean up getBytes() calls without charsets provided



 [ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-21872:
---
Attachment: HBASE-21782.001.patch

> Clean up getBytes() calls without charsets provided
> ---
>
> Key: HBASE-21872
> URL: https://issues.apache.org/jira/browse/HBASE-21872
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HBASE-21782.001.patch
>
>
> As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
> Charset can result is some compiler warnings. Let's just get rid of these 
> calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (HBASE-21862) IPCUtil.wrapException should keep the original exception types for all the connection exceptions



 [ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reopened HBASE-21862:
--
  Assignee: Sergey Shelukhin  (was: Duo Zhang)

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN

[jira] [Commented] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765338#comment-16765338
 ] 

Sergey Shelukhin commented on HBASE-21862:
--

Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; it may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).



> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server ser

[jira] [Comment Edited] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765338#comment-16765338
 ] 

Sergey Shelukhin edited comment on HBASE-21862 at 2/11/19 7:50 PM:
---

Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; that may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).




was (Author: sershe):
Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; it may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).



> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta



[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765342#comment-16765342
 ] 

stack commented on HBASE-21844:
---

bq. Duo Zhang stack actually, on a side note, I wonder why we have a custom WAL 
implementation for procedures? 

Its an unfortunate situation. We should have one WAL impl only.

bq. Couldn't procedures be stored in a multi-version HBase table for everything 
except meta and that table (like we already have custom recovery for meta?).

You'd have to say more. As I read it, we'd have to start up too much too much 
hbase to support a table. As is, it is a WAL and an in-memory data structure.


> Master could get stuck in initializing state while waiting for meta
> ---
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
>  Issue Type: Bug
>  Components: master, meta
>Affects Versions: 3.0.0
>Reporter: Bahram Chehrazy
>Assignee: Bahram Chehrazy
>Priority: Major
> Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765338#comment-16765338
 ] 

Sergey Shelukhin edited comment on HBASE-21862 at 2/11/19 8:01 PM:
---

Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; that may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).

The fundamental problem is that after some errors we don't know the state of 
the region. Retrying the same message as is doesn't let us learn the state of 
the region... it's just slightly better blind guessing. We have to ensure we 
know the state of the region. That is why in other places we have proc states 
like confirm_closed, and not just send the message and hope for the best...



was (Author: sershe):
Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; that may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).



> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result regio

[jira] [Commented] (HBASE-21873) IPCUtil.wrapException should keep the original exception types for all the connection exceptions

2019-02-11 Thread Xu Cang (JIRA)



[ 
https://issues.apache.org/jira/browse/HBASE-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765347#comment-16765347
 ] 

Xu Cang commented on HBASE-21873:
-

Seems branch-1 code might have the same issue. Any objection or concern 
back-porting it? [~Apache9]

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21873
> URL: https://issues.apache.org/jira/browse/HBASE-21873
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Start

[jira] [Comment Edited] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765338#comment-16765338
 ] 

Sergey Shelukhin edited comment on HBASE-21862 at 2/11/19 8:01 PM:
---

Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; that may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).

The core problem is that after some errors we don't know the state of the 
region. Retrying the same message as is doesn't let us learn the state of the 
region... it's just slightly better blind guessing. We have to ensure we know 
the state of the region. That is why in other places we have proc states like 
confirm_closed, and not just send the message and hope for the best...



was (Author: sershe):
Can you please not take over JIRAs like that and change the root cause 
analysis? If you think there's a different bug, please file a different JIRA 
unless there's some discussion... I cloned this for the new fix. I disagree it 
actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because 
as the comment mentions we don't know how such errors can manifest and what 
exceptions are network exceptions. If there's some other error we didn't 
account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead 
of just reassigning to another one. Network timeouts can be very long and 
there's no way for current model to terminate one. If we were to base on 
whitelist per (1), we'd also retry for some exception that we didn't account 
for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling 
"regionOpened" again (as per the WARN it logs that the region is already 
opened), so master will get stuck. This is a fundamental issue that assignment 
is based on actions and not target states; that may not be necessary to resolve 
here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case 
as is, before all the other complications. It seems like it should work, but 
I'm not sure there can't be a combination of errors (e.g. - the same situation 
happens as here, but on retry we get a non-network error from RS,like a 
BUSY/call queue full - looks like we'd count a message as failed, even though 
it didn't).

The fundamental problem is that after some errors we don't know the state of 
the region. Retrying the same message as is doesn't let us learn the state of 
the region... it's just slightly better blind guessing. We have to ensure we 
know the state of the region. That is why in other places we have proc states 
like confirm_closed, and not just send the message and hope for the best...


> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>

[jira] [Updated] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved



 [ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bahram Chehrazy updated HBASE-21788:

Attachment: (was: WAL-Orphan.log)

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved



 [ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bahram Chehrazy updated HBASE-21788:

Attachment: WAL-Orphan.log

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
> Attachments: WAL-Orphan.log
>
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21844) Master could get stuck in initializing state while waiting for meta



[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765354#comment-16765354
 ] 

Bahram Chehrazy commented on HBASE-21844:
-

[~Apache9], I've attached some more logs and details in the HBASE-21788, since 
these two seem to have the same root cause, let's keep the discussion there.

> Master could get stuck in initializing state while waiting for meta
> ---
>
> Key: HBASE-21844
> URL: https://issues.apache.org/jira/browse/HBASE-21844
> Project: HBase
>  Issue Type: Bug
>  Components: master, meta
>Affects Versions: 3.0.0
>Reporter: Bahram Chehrazy
>Assignee: Bahram Chehrazy
>Priority: Major
> Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved



[ 
https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765348#comment-16765348
 ] 

Bahram Chehrazy commented on HBASE-21788:
-

I have attached a partial log files going back to the day before when the 
master got restarted a few times. The log shows that one of the Orphan TRSP 
originally initialized at 07:40, then got stuck for about 18 min, then the 
procedure fails while trying to update the meta at 07:58. Perhaps meta also 
crashed at the same time because I see a lot of similar errors for other 
procedures. Shortly after the master crashes and becomes backup. When it become 
active master again in about an hour, it can't read the procWAL logs because 
some of them were corrupted. Unfortunately, the other master in between the gap 
was re-imaged. So, no visibility in between, But I think it's clear now that 
this problem happens when the procWALs get corrupted during master transition.

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> 
>
> Key: HBASE-21788
> URL: https://issues.apache.org/jira/browse/HBASE-21788
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Assignee: stack
>Priority: Critical
> Attachments: WAL-Orphan.log
>
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after 
> master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure 
> (also recovered) is stuck in Runnable and never does anything for hours. I 
> cannot find logs on the target server indicating that it ever tried to do 
> anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and 
> maybe a timeout so it unconditionally fails after a configurable period (1 
> hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I 
> wonder if it's somehow related to the region status check, but this is just a 
> hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception



[ 
https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765358#comment-16765358
 ] 

Sergey Shelukhin commented on HBASE-21862:
--

Btw, one more simpler way to fix it correctly (but still slowly) is to actually 
count opened as successful on persistent failure, then mark RS for death.
If RS ever comes back it will be killed and region processed again.
If RS is actually dead it will be processed correctly by an SCP eventually.

> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> ---
>
> Key: HBASE-21862
> URL: https://issues.apache.org/jira/browse/HBASE-21862
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-0

[jira] [Commented] (HBASE-20053) Remove .cmake file extension from .gitignore



[ 
https://issues.apache.org/jira/browse/HBASE-20053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765364#comment-16765364
 ] 

Hudson commented on HBASE-20053:


Results for branch HBASE-20053
[build #9 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/9/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/9//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/9//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20053/9//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Remove .cmake file extension from .gitignore
> 
>
> Key: HBASE-20053
> URL: https://issues.apache.org/jira/browse/HBASE-20053
> Project: HBase
>  Issue Type: Sub-task
>  Components: build, community
>Affects Versions: HBASE-14850
>Reporter: Ted Yu
>Priority: Minor
>  Labels: build
> Attachments: HBASE-20053-HBASE-14850.v001.patch
>
>
> There are .cmake files under hbase-native-client/cmake/ which are under 
> source control.
> The .cmake extension should be taken out of hbase-native-client/.gitignore



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HBASE-21625) a runnable procedure v2 does not run



 [ 
https://issues.apache.org/jira/browse/HBASE-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HBASE-21625.
--
Resolution: Cannot Reproduce

Probably a dup of OpenRegionProcedure issues

> a runnable procedure v2 does not run
> 
>
> Key: HBASE-21625
> URL: https://issues.apache.org/jira/browse/HBASE-21625
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Affects Versions: 3.0.0
>Reporter: Sergey Shelukhin
>Priority: Critical
>
> This is on master snapshot as of a few weeks ago.
> Haven't looked at the code much yet, but it seems rather fundamental. The 
> procedure comes from meta replica assignment (HBASE-21624), in case it 
> matters w.r.t. the engine initialization; however, the master is functional 
> and other procedures run fine. I can also see lots of other open region 
> procedures with a similar patterns that were initialized before this one and 
> have run fine.
> Currently, there are no other runnable procedures on master - a lot of 
> succeeded procedures since then, the parent blocked on this procedure, and 
> one unrelated RIT procedure waiting with timeout and being updated 
> periodically.
> The procedure itself is 
> {noformat}
> 157156157155  RUNNABLEhadoop  
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure   Wed Dec 19 
> 17:20:27 PST 2018Wed Dec 19 17:20:28 PST 2018[ { region => { 
> regionId => '1', tableName => { ... }, startKey => '', endKey => '', offline 
> => 'false', split => 'false', replicaId => '1' }, targetServer => { hostName 
> => 'server1', port => '17020', startCode => '1545266805778' } }, {} ]
> {noformat}
> This is in PST so it's been like that for ~19 hours.
> The only line involving this PID in the log is {noformat}
> 2018-12-19 17:20:27,974 INFO  [PEWorker-4] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=157156, ppid=157155, state=RUNNABLE, 
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> There are no other useful logs for either this PID, parent PID, or region in 
> question since. This PEWorker (4) is also alive and did some work since then, 
> so it's not like the thread errored out somewhere.
> All the PEWorker-s are waiting for work:
> {noformat}
> Thread 158 (PEWorker-16):
>   State: TIMED_WAITING
>   Blocked count: 1340
>   Waited count: 5064
>   Stack:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> 
> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:171)
> 
> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:153)
> 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1949)
> {noformat}
> The main assignment procedure for this region is blocked on it:
> {noformat}
> 157155WAITING hadoop  TransitRegionStateProcedure 
> table=hbase:meta, region=534574363, ASSIGN  Wed Dec 19 17:20:27 PST 2018
> Wed Dec 19 17:20:27 PST 2018[ { state => [ '1', '2', '3' ] }, { 
> regionId => '1', tableName => { ... }, startKey => '', endKey => '', offline 
> => 'false', split => 'false', replicaId => '1' }, { initialState => 
> 'REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE', lastState => 
> 'REGION_STATE_TRANSITION_CONFIRM_OPENED', assignCandidate => { hostName => 
> 'server1', port => '17020', startCode => '1545266805778' }, forceNewPlan => 
> 'false' } ]
> 2018-12-19 17:20:27,673 INFO  [PEWorker-9] 
> procedure.MasterProcedureScheduler: Took xlock for pid=157155, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=hbase:meta, region=..., ASSIGN
> 2018-12-19 17:20:27,809 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=157155, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=hbase:meta, region=..., ASSIGN; 
> rit=OFFLINE, location=server1,17020,1545266805778; forceNewPlan=false, 
> retain=false
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (HBASE-21873) IPCUtil.wrapException should keep the original exception types for all the connection exceptions



 [ 
https://issues.apache.org/jira/browse/HBASE-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HBASE-21873.
--
Resolution: Fixed

> IPCUtil.wrapException should keep the original exception types for all the 
> connection exceptions
> 
>
> Key: HBASE-21873
> URL: https://issues.apache.org/jira/browse/HBASE-21873
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Sergey Shelukhin
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
> Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
> at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=6^M
> at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
> ... 4 more^M
> {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> Transit

[jira] [Updated] (HBASE-21743) declarative assignment



 [ 
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21743:
-
Summary: declarative assignment  (was: stateless assignment)

> declarative assignment
> --
>
> Key: HBASE-21743
> URL: https://issues.apache.org/jira/browse/HBASE-21743
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment 
> that all seem to have the same nature - split brain between 2 procedures; or 
> between procedure and master startup (meta replica bugs); or procedure and 
> master shutdown (HBASE-21742); or procedure and something else (when SCP had 
> incorrect region list persisted, don't recall the bug#). 
> To me, it starts to look like a pattern where, like in AMv1 where concurrent 
> interactions were unclear and hard to reason about, despite the cleaner 
> individual pieces in AMv2 the problem of unclear concurrent interactions has 
> been preserved and in fact increased because of the operation state 
> persistence and  isolation.
> Procedures are great for multi-step operations that need rollback and stuff 
> like that, e.g. creating a table or snapshot, or even region splitting. 
> However I'm not so sure about assignment. 
> We have the persisted information - region state in meta (incl transition 
> states like opening, or closing), server list as WAL directory list. 
> Procedure state is not any more reliable then those (we can argue that meta 
> update can fail, but so can procv2 WAL flush, so we have to handle cases of 
> out of date information regardless). So, we don't need any extra state to 
> decide on assignment, whether for recovery and balancing. In fact, as 
> mentioned in some bugs, deleting procv2 WAL is often the best way to recover 
> the cluster, because master can already figure out what to do without 
> additional state.
> I think there should be an option for stateless assignment that does that.
> It can either be as a separate pluggable assignment procedure; or an option 
> that will not recover SCP, RITs etc from WAL but always derive recovery 
> procedures from the existing cluster state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765477#comment-16765477
 ] 

Sakthi commented on HBASE-21861:


Will look at other branches and see if we need similar javadoc config changes.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21743) declarative assignment



[ 
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765480#comment-16765480
 ] 

Sergey Shelukhin commented on HBASE-21743:
--

Ok, after reading on this a little bit I think the better term I'm looking for 
is declarative assignment.
The approach to assignment that is much less error prone (IMO) is to always 
operate from "this is the current state" vs "this is the desired state" (which 
HBase already has e.g. in heartbeat, but doesn't use like that), as opposed to 
imperative approach "do this", "I did this", "ok now do that", given the 
distributed nature of the system. It is also more resilient because AM can 
always see the state and doesn't depend on sequence of operations, lost 
messages, etc., so it can resolve the situation in most error cases.

It can work with procedures that require multi step operations that can still 
be imperative. Assuming only one high-level procedure at a time (e.g. region 
cannot be splitting and also merging), existence of an attached procedure to do 
something for a region is just a piece of state that declarative assignment can 
consider. Alternatively, in per-region processing, master can both move 
procedures forward and process state in a single thread (per region; there can 
be multiple threads each handling one region at a time if desired). The latter 
approach can simplify things because no sync is needed and all the interactions 
are visible. Other components like master startup, or load balancer, can issue 
commands (e.g. to move a region). The procedures can also issue desired-state 
changes (e.g. unassign for split), and also optionally process current state 
changes. If there's no procedure, or procedure refuses to react to state 
changes, the standard handler can compare desired and actual state and drive 
assignment. As long as state (e.g. OPENING) set correctly, which is already a 
requirement, this will always get region into correct state eventually 
regardless of what's going on. It will also not have as much racing potential 
with procedures because procedures will operate on the same notifications on 
the same thread, and can override default processing.





> declarative assignment
> --
>
> Key: HBASE-21743
> URL: https://issues.apache.org/jira/browse/HBASE-21743
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment 
> that all seem to have the same nature - split brain between 2 procedures; or 
> between procedure and master startup (meta replica bugs); or procedure and 
> master shutdown (HBASE-21742); or procedure and something else (when SCP had 
> incorrect region list persisted, don't recall the bug#). 
> To me, it starts to look like a pattern where, like in AMv1 where concurrent 
> interactions were unclear and hard to reason about, despite the cleaner 
> individual pieces in AMv2 the problem of unclear concurrent interactions has 
> been preserved and in fact increased because of the operation state 
> persistence and  isolation.
> Procedures are great for multi-step operations that need rollback and stuff 
> like that, e.g. creating a table or snapshot, or even region splitting. 
> However I'm not so sure about assignment. 
> We have the persisted information - region state in meta (incl transition 
> states like opening, or closing), server list as WAL directory list. 
> Procedure state is not any more reliable then those (we can argue that meta 
> update can fail, but so can procv2 WAL flush, so we have to handle cases of 
> out of date information regardless). So, we don't need any extra state to 
> decide on assignment, whether for recovery and balancing. In fact, as 
> mentioned in some bugs, deleting procv2 WAL is often the best way to recover 
> the cluster, because master can already figure out what to do without 
> additional state.
> I think there should be an option for stateless assignment that does that.
> It can either be as a separate pluggable assignment procedure; or an option 
> that will not recover SCP, RITs etc from WAL but always derive recovery 
> procedures from the existing cluster state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21743) declarative assignment



[ 
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765480#comment-16765480
 ] 

Sergey Shelukhin edited comment on HBASE-21743 at 2/11/19 10:52 PM:


Ok, after reading on this a little bit I think the better term I'm looking for 
is declarative assignment.
The approach to assignment that is much less error prone (IMO) is to always 
operate from "this is the current state" vs "this is the desired state" (which 
HBase already has e.g. in heartbeat, but doesn't use like that), as opposed to 
imperative approach "do this", "I did this", "ok now do that", given the 
distributed nature of the system. It is also more resilient because AM can 
always see the state and doesn't depend on sequence of operations, lost 
messages, incorrect hbck or manual interventions messing things up or even just 
racing with master itself;  so it can resolve the situation in most error cases.

It can work with procedures that require multi step operations that can still 
be imperative. Assuming only one high-level procedure at a time (e.g. region 
cannot be splitting and also merging), existence of an attached procedure to do 
something for a region is just a piece of state that declarative assignment can 
consider. Alternatively, in per-region processing, master can both move 
procedures forward and process state in a single thread (per region; there can 
be multiple threads each handling one region at a time if desired). The latter 
approach can simplify things because no sync is needed and all the interactions 
are visible. Other components like master startup, or load balancer, can issue 
commands (e.g. to move a region). The procedures can also issue desired-state 
changes (e.g. unassign for split), and also optionally process current state 
changes. If there's no procedure, or procedure refuses to react to state 
changes, the standard handler can compare desired and actual state and drive 
assignment. As long as state (e.g. OPENING) set correctly, which is already a 
requirement, this will always get region into correct state eventually 
regardless of what's going on. 
In general, if it's trivial to save state declaratively (like in case of 
assignment, where looking at the cluster it's always possible to determine what 
to do), it should be stored as such; if not, then procedures should be used.
It will also not have as much racing potential with procedures because 
procedures will operate on the same notifications on the same thread, and can 
override default processing.






was (Author: sershe):
Ok, after reading on this a little bit I think the better term I'm looking for 
is declarative assignment.
The approach to assignment that is much less error prone (IMO) is to always 
operate from "this is the current state" vs "this is the desired state" (which 
HBase already has e.g. in heartbeat, but doesn't use like that), as opposed to 
imperative approach "do this", "I did this", "ok now do that", given the 
distributed nature of the system. It is also more resilient because AM can 
always see the state and doesn't depend on sequence of operations, lost 
messages, etc., so it can resolve the situation in most error cases.

It can work with procedures that require multi step operations that can still 
be imperative. Assuming only one high-level procedure at a time (e.g. region 
cannot be splitting and also merging), existence of an attached procedure to do 
something for a region is just a piece of state that declarative assignment can 
consider. Alternatively, in per-region processing, master can both move 
procedures forward and process state in a single thread (per region; there can 
be multiple threads each handling one region at a time if desired). The latter 
approach can simplify things because no sync is needed and all the interactions 
are visible. Other components like master startup, or load balancer, can issue 
commands (e.g. to move a region). The procedures can also issue desired-state 
changes (e.g. unassign for split), and also optionally process current state 
changes. If there's no procedure, or procedure refuses to react to state 
changes, the standard handler can compare desired and actual state and drive 
assignment. As long as state (e.g. OPENING) set correctly, which is already a 
requirement, this will always get region into correct state eventually 
regardless of what's going on. It will also not have as much racing potential 
with procedures because procedures will operate on the same notifications on 
the same thread, and can override default processing.





> declarative assignment
> --
>
> Key: HBASE-21743
> URL: https://issues.apache.org/jira/browse/HBASE-21743
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Majo

[jira] [Comment Edited] (HBASE-21743) declarative assignment

[
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765480#comment-16765480
]

Sergey Shelukhin edited comment on HBASE-21743 at 2/11/19 10:54 PM:

Ok, after reading on this a little bit I think the better term I'm looking for
is declarative assignment.
The approach to assignment that is much less error prone (IMO) is to always
operate from "this is the current state" vs "this is the desired state" (which
HBase already has e.g. in heartbeat, but doesn't use like that), as opposed to
imperative approach "do this", "I did this", "ok now do that", given the
distributed nature of the system. It is also more resilient because AM can
always see the state and doesn't depend on sequence of operations, lost
messages, incorrect hbck or manual interventions messing things up or even just
racing with master itself; so it can resolve the situation in most error cases.

It can work with procedures that require multi step operations that can still
be imperative. Assuming only one high-level procedure at a time (e.g. region
cannot be splitting and also merging), existence of an attached procedure to do
something for a region is just a piece of state that declarative assignment can
consider. Alternatively, in per-region processing, master can both move
procedures forward and process state in a single thread (per region; there can
be multiple threads each handling one region at a time if desired). The latter
approach can simplify things because no sync is needed and all the interactions
are visible. Other components like master startup, or load balancer, can issue
commands (e.g. to move a region). The procedures can also issue desired-state
changes (e.g. unassign for split), and also optionally process current state
changes. If there's no procedure, or procedure refuses to react to state
changes, the standard handler can compare desired and actual state and drive
assignment. As long as state (e.g. OPENING) set correctly, which is already a
requirement, this will always get region into correct state eventually
regardless of what's going on. It will also not have as much racing potential
with procedures because procedures will operate on the same notifications on
the same thread, and can override default processing.

In general, in this approach, if it's trivial to save state declaratively (like
in case of assignment, where looking at the cluster it's always possible to
determine what to do), it should be stored as such; if not, then procedures
should be used. I frankly think splits/merges can also be declarative and
mostly procedures are needed for table-wide operations, but I can see how being
a multi-region operations split and merge can also benefit from imperative
approach.

was (Author: sershe):
Ok, after reading on this a little bit I think the better term I'm looking for
is declarative assignment.
The approach to assignment that is much less error prone (IMO) is to always
operate from "this is the current state" vs "this is the desired state" (which
HBase already has e.g. in heartbeat, but doesn't use like that), as opposed to
imperative approach "do this", "I did this", "ok now do that", given the
distributed nature of the system. It is also more resilient because AM can
always see the state and doesn't depend on sequence of operations, lost
messages, incorrect hbck or manual interventions messing things up or even just
racing with master itself; so it can resolve the situation in most error cases.

It can work with procedures that require multi step operations that can still
be imperative. Assuming only one high-level procedure at a time (e.g. region
cannot be splitting and also merging), existence of an attached procedure to do
something for a region is just a piece of state that declarative assignment can
consider. Alternatively, in per-region processing, master can both move
procedures forward and process state in a single thread (per region; there can
be multiple threads each handling one region at a time if desired). The latter
approach can simplify things because no sync is needed and all the interactions
are visible. Other components like master startup, or load balancer, can issue
commands (e.g. to move a region). The procedures can also issue desired-state
changes (e.g. unassign for split), and also optionally process current state
changes. If there's no procedure, or procedure refuses to react to state
changes, the standard handler can compare desired and actual state and drive
assignment. As long as state (e.g. OPENING) set correctly, which is already a
requirement, this will always get region into correct state eventually
regardless of what's going on.
In general, if it's trivial to save state declaratively (like in case of
assignment, where looking at the cluster it's always p

[jira] [Comment Edited] (HBASE-21743) declarative assignment

[
https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765480#comment-16765480
]

Sergey Shelukhin edited comment on HBASE-21743 at 2/11/19 10:55 PM:

In general, in this approach, if it's trivial to save state declaratively (like
in case of assignment, where looking at the cluster it's always possible to
determine what to do), it should be stored as such; if not, then procedures
should be used. I frankly think splits/merges can also be declarative and
mostly procedures are needed for table-wide and other high-level operations,
but I can see how being a multi-region operations split and merge can also
benefit from imperative approach.

[jira] [Commented] (HBASE-21577) do not close regions when RS is dying due to a broken WAL



[ 
https://issues.apache.org/jira/browse/HBASE-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765494#comment-16765494
 ] 

Sergey Shelukhin commented on HBASE-21577:
--

[~busbey] does this patch make sense to you? small patch. We see RS taking a 
very long time to shutdown when HDFS produces a lot of failures.

> do not close regions when RS is dying due to a broken WAL
> -
>
> Key: HBASE-21577
> URL: https://issues.apache.org/jira/browse/HBASE-21577
> Project: HBase
>  Issue Type: Improvement
>  Components: regionserver, wal
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Critical
> Attachments: HBASE-21577.master.001.patch, 
> HBASE-21577.master.002.patch
>
>
> See HBASE-21576. DroppedSnapshot can be an FS failure; also, when WAL is 
> broken, some regions whose flushes are already in flight keep retrying, 
> resulting in minutes-long shutdown times. Since WAL will be replayed anyway 
> flushing regions doesn't provide much benefit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21854) Race condition in TestProcedureSkipPersistence



[ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765515#comment-16765515
 ] 

Hudson commented on HBASE-21854:


Results for branch branch-2
[build #1677 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1677/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1677//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1677//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1677//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Minor
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT,

[jira] [Commented] (HBASE-21872) Clean up getBytes() calls without charsets provided



[ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765522#comment-16765522
 ] 

Hadoop QA commented on HBASE-21872:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 50 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
27s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
50s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
46s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
53s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
41s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
55s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
19s{color} | {color:green} hbase-zookeeper generated 0 new + 29 unchanged - 6 
fixed = 29 total (was 35) {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} hbase-http generated 0 new + 10 unchanged - 7 fixed 
= 10 total (was 17) {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  2m  1s{color} 
| {color:red} hbase-server generated 19 new + 169 unchanged - 19 fixed = 188 
total (was 188) {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 32s{color} 
| {color:red} hbase-mapreduce generated 30 new + 80 unchanged - 78 fixed = 110 
total (was 158) {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
27s{color} | {color:green} hbase-backup generated 0 new + 38 unchanged - 21 
fixed = 38 total (was 59) {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
30s{color} | {color:green} hbase-it generated 0 new + 50 unchanged - 1 fixed = 
50 total (was 51) {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
25s{color} | {color:green} hbase-examples in the patch passed. {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
21s{color} | {color:red} hbase-server: The patch generated 52 new + 406 
unchanged - 49 fixed = 458 total (was 455) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
20s{color} | {color:red} hbase-mapreduce: The patch generated 1 new + 133 
unchanged - 3 fixed = 134 total (was 136) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
13s{color} | {color:red} hbase-backup: The patch generated 1 new + 1 unchanged 
- 0 fixed = 2 total (was 1) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
13s{color} | {color:red} hbase-examples: The patch generated 1 new + 3 
unchanged - 0 fixed = 4 total (was 3) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
32s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:

[jira] [Commented] (HBASE-21854) Race condition in TestProcedureSkipPersistence



[ 
https://issues.apache.org/jira/browse/HBASE-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765533#comment-16765533
 ] 

Hudson commented on HBASE-21854:


Results for branch branch-2.2
[build #34 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/34/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/34//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/34//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/34//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Race condition in TestProcedureSkipPersistence 
> ---
>
> Key: HBASE-21854
> URL: https://issues.apache.org/jira/browse/HBASE-21854
> Project: HBase
>  Issue Type: Bug
>  Components: proc-v2
>Affects Versions: 2.1.3
>Reporter: Peter Somogyi
>Assignee: Peter Somogyi
>Priority: Minor
> Fix For: 3.0.0, 2.2.0, 2.3.0, 2.1.4
>
> Attachments: HBASE-21854.patch
>
>
> There is a race condition in TestProcedureSkipPersistence. After the 
> procedure is added, the test stops ProcedureExecutor. In some cases the 
> procedure is not added to the queue in time.
> Failing execution:
> {noformat}
> 2019-02-06 14:18:11,133 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549491521133
> 2019-02-06 14:18:11,135 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549491493135
> 2019-02-06 14:18:11,137 INFO  [Time-limited test] hbase.Waiter(189): Waiting 
> up to [30,000] milli-secs(wait.for.ratio=[1])
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(125): RESTART - Stop
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] 
> procedure2.ProcedureExecutor(702): Stopping
> 2019-02-06 14:18:11,139 INFO  [Time-limited test] wal.WALProcedureStore(331): 
> Stopping the WAL Procedure Store, isAbort=false
> 2019-02-06 14:18:11,140 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure 
> as the 0th rollback step
> 2019-02-06 14:18:11,141 WARN  [PEWorker-1] 
> procedure2.ProcedureExecutor$WorkerThread(2074): Worker terminating 
> UNNATURALLY null
> java.lang.RuntimeException: the store must be running before inserting data
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:710)
>at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:603)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.updateStoreOnExec(ProcedureExecutor.java:1943)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1809)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1481)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:78)
>at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2058)
> 2019-02-06 14:18:11,145 INFO  [Time-limited test] 
> procedure2.ProcedureTestingUtility(137): RESTART - Start{noformat}
> In a successful run the ProcExecutor is stopped AFTER the procedure is 
> actually in the queue.
> Successful:
> {noformat}
> 2019-02-07 15:48:08,731 INFO  [Time-limited test] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=-1, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.procedure2.CompletedProcedureCleaner; timeout=3, 
> timestamp=1549550918731
> 2019-02-07 15:48:08,731 INFO  [PEWorker-1] 
> procedure2.TimeoutExecutorThread(82): ADDED pid=1, state=WAITING_TIMEOUT, 
> locked=true; 
> org.apache.hadoop.hbase.procedure2.TestProcedureSkipPersistence$TestProcedure;
>  timeout=2000, timestamp=1549550890731
> 2019-02-07 15:48:08,732 DEBUG [PEWorker-1] 
> procedure2.RootProcedureState(153): Add procedure pid=1, 
> state=WAITING_TIMEOUT, lo

[jira] [Comment Edited] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765553#comment-16765553
 ] 

Sergey Shelukhin edited comment on HBASE-21863 at 2/12/19 12:05 AM:


[~stack] can you elaborate on extra states from deadline? If the message did 
expire (master is no longer waiting), we avoid doing something master doesn't 
expect. If it doesn't expire and we respond with error, it happens before any 
work, so the master will just handle it like a regular error. It's not ideal 
but should be rare and doesn't add new states.

I'd like to add it to region report, however it causes some issues: HBASE-21522 
and especially HBASE-21531 (that is resolved as a dup of HBASE-21421, without 
fixing the actual race) that is a race that happens a lot.
So it was removed in HBASE-21421.
I filed a separate JIRA to add it back. I think given that TRSP is one place 
that sort of knows what's going on, it's a good place to have it for now :) 
Ignoring some RS reporting region open doesn't seem to be correct. 
I can replace with a more specific exception.

There's discussion in the other bug about the root cause...
However, for production use it's better to prevent double assignment due to 
unknown bugs, to avoid data loss...


was (Author: sershe):
[~stack] can you elaborate on extra states from deadline? If the message did 
expire (master is no longer waiting), we avoid doing something master doesn't 
expect. If it doesn't expire and we respond with error, it happens before any 
work, so the master will just handle it like a regular error. It's not ideal 
but should be rare and doesn't add new states.

I'd like to add it to region report, however it causes some issues: HBASE-21522 
and especially HBASE-21531 that is a race that happens a lot.
So it was removed in HBASE-21421.
I filed a separate JIRA to add it back. I think given that TRSP is one place 
that sort of knows what's going on, it's a good place to have it for now :) 
Ignoring some RS reporting region open doesn't seem to be correct. 
I can replace with a more specific exception.

There's discussion in the other bug about the root cause...
However, for production use it's better to prevent double assignment due to 
unknown bugs, to avoid data loss...

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765553#comment-16765553
 ] 

Sergey Shelukhin commented on HBASE-21863:
--

[~stack] can you elaborate on extra states from deadline? If the message did 
expire (master is no longer waiting), we avoid doing something master doesn't 
expect. If it doesn't expire and we respond with error, it happens before any 
work, so the master will just handle it like a regular error. It's not ideal 
but should be rare and doesn't add new states.

I'd like to add it to region report, however it causes some issues: HBASE-21522 
and especially HBASE-21531 that is a race that happens a lot.
So it was removed in HBASE-21421.
I filed a separate JIRA to add it back. I think given that TRSP is one place 
that sort of knows what's going on, it's a good place to have it for now :) 
Ignoring some RS reporting region open doesn't seem to be correct. 
I can replace with a more specific exception.

There's discussion in the other bug about the root cause...
However, for production use it's better to prevent double assignment due to 
unknown bugs, to avoid data loss...

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21872) Clean up getBytes() calls without charsets provided



[ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765563#comment-16765563
 ] 

Josh Elser commented on HBASE-21872:


.002 fixes the checkstyle issues (some masked, some newly introduced ;))

> Clean up getBytes() calls without charsets provided
> ---
>
> Key: HBASE-21872
> URL: https://issues.apache.org/jira/browse/HBASE-21872
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HBASE-21782.001.patch, HBASE-21782.002.patch
>
>
> As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
> Charset can result is some compiler warnings. Let's just get rid of these 
> calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21872) Clean up getBytes() calls without charsets provided



 [ 
https://issues.apache.org/jira/browse/HBASE-21872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser updated HBASE-21872:
---
Attachment: HBASE-21782.002.patch

> Clean up getBytes() calls without charsets provided
> ---
>
> Key: HBASE-21872
> URL: https://issues.apache.org/jira/browse/HBASE-21872
> Project: HBase
>  Issue Type: Task
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HBASE-21782.001.patch, HBASE-21782.002.patch
>
>
> As we saw over in HBASE-21201, the use of {{String.getBytes()}} without a 
> Charset can result is some compiler warnings. Let's just get rid of these 
> calls. There are only a handful anymore in master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765570#comment-16765570
 ] 

stack commented on HBASE-21863:
---

bq. stack can you elaborate on extra states from deadline?

If you are asking about why we do not timeout AMv2 commands, its because the 
spec is as [~Apache9] has noted a few times either the command is ack'd, 
error'd or we get an SCP.  There is merit in our spec being this basic at least 
in a first version of AMv2. If all calls now can also  timeout, then every 
command needs to handle have timeout handling and cancelling messaging; more 
possible states, more moving parts.

bq. If the message did expire (master is no longer waiting), we avoid doing 
something master doesn't expect. If it doesn't expire and we respond with 
error, it happens before any work, so the master will just handle it like a 
regular error. It's not ideal but should be rare and doesn't add new states.

... whats the thing the master doesn't expect? Pardon me, I'm having trouble 
understanding above paragraph (bad context-switch). Say more please.

On killing RS if reports it has a Region it shouldn't have, yeah, makes sense. 
I like your idea of versioning commands/rpcs so we know when we can safely 
ignore reports.

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765568#comment-16765568
 ] 

Sergey Shelukhin commented on HBASE-21863:
--

Added a new explicit exception type, added a comment to region report 
explaining why we don't do it there, fixed checkstyle.
[~stack] let me know if you want me to remove the deadline, I can do that, I 
don't like that fix that much either.

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765573#comment-16765573
 ] 

Sergey Shelukhin edited comment on HBASE-21863 at 2/12/19 12:51 AM:


[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).


was (Author: sershe):
[~stack] the timeout here, if the master sees it, looks to master just like the 
call failing with some random error. So, it's just another case of the call 
errored, no change to the spec - we will just, very infrequently, error out the 
call by mistake that could have succeeded.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765573#comment-16765573
 ] 

Sergey Shelukhin edited comment on HBASE-21863 at 2/12/19 12:51 AM:


[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded and 
retry it elsewhere.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).


was (Author: sershe):
[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765573#comment-16765573
 ] 

Sergey Shelukhin commented on HBASE-21863:
--

[~stack] the timeout here, if the master sees it, looks to master just like the 
call failing with some random error. So, it's just another case of the call 
errored, no change to the spec - we will just, very infrequently, error out the 
call by mistake that could have succeeded.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765573#comment-16765573
 ] 

Sergey Shelukhin edited comment on HBASE-21863 at 2/12/19 1:17 AM:
---

[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded and 
retry it elsewhere.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call (or rather, 
not execute it and throw this exception) but on master the call will have 
already timed out - due to a network issue, call queue issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).


was (Author: sershe):
[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded and 
retry it elsewhere.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21863) narrow down the double-assignment race window



 [ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated HBASE-21863:
-
Attachment: HBASE-21863.01.patch

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765584#comment-16765584
 ] 

Sakthi commented on HBASE-21861:


Currently working on branch-2.0. I stumbled upon this: 
https://hbase.apache.org/2.0/book.html is missing the 
https://github.com/apache/hbase/blob/branch-2.0/src/main/asciidoc/_chapters/developer.adoc#becoming-a-committer
 section. So the adoc has, but the book.html doesn't have that section. Weird.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765589#comment-16765589
 ] 

Sakthi commented on HBASE-21861:


In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a package-private class members. As user doc is set to show public & 
protected classes only, this results in a "missing file" issue. 

After this change, now these private-annotated javadoc wouldn't be visible from 
user doc but would still be accessible from dev doc.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765589#comment-16765589
 ] 

Sakthi edited comment on HBASE-21861 at 2/12/19 1:46 AM:
-

In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. As user doc is set to 
show public & protected classes only, this results in a "missing file" issue. 

After this change, now these private-annotated javadoc wouldn't be visible from 
user doc but would still be accessible from dev doc.


was (Author: jatsakthi):
In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a package-private class members. As user doc is set to show public & 
protected classes only, this results in a "missing file" issue. 

After this change, now these private-annotated javadoc wouldn't be visible from 
user doc but would still be accessible from dev doc.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21861) Handle the missing file issues from the Linkchecker job



 [ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sakthi updated HBASE-21861:
---
Attachment: hbase-21861.branch-2.0.001.patch

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.branch-2.0.001.patch, hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765589#comment-16765589
 ] 

Sakthi edited comment on HBASE-21861 at 2/12/19 1:47 AM:
-

In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. But, as user doc is set 
to show public & protected classes only and not package-private classes, this 
results in a "missing file" issue. 

After this change, now these private-annotated javadoc wouldn't be visible from 
user doc but would still be accessible from dev doc.


was (Author: jatsakthi):
In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. As user doc is set to 
show public & protected classes only, this results in a "missing file" issue. 

After this change, now these private-annotated javadoc wouldn't be visible from 
user doc but would still be accessible from dev doc.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.branch-2.0.001.patch, hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21863) narrow down the double-assignment race window



[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765592#comment-16765592
 ] 

Duo Zhang commented on HBASE-21863:
---

For other procedures, we will include the procedure id in the request and 
response message so this could be the 'nonce', but for assign/unassign we have 
to support the old protocol so this haven't been done yet.

On checking for inconsistency, I think the only safe way is to kill the region 
server. A possible way is to detect inconsistency in regionServerReport, and 
schedule a background task, which check which region server actually hosts the 
region. If the inconsistency keep there for a while(1 minute maybe? Can be 
configured I think), then we kill the bad regionserver.

> narrow down the double-assignment race window
> -
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765589#comment-16765589
 ] 

Sakthi edited comment on HBASE-21861 at 2/12/19 1:50 AM:
-

In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. But, as user doc is set 
to show public & protected classes only and not package-private classes, this 
results in a "missing file" issue. 

After this change, now these private-annotated classes'  wouldn't be visible 
from user doc but would still be accessible from dev doc.


was (Author: jatsakthi):
In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. But, as user doc is set 
to show public & protected classes only and not package-private classes, this 
results in a "missing file" issue. 

After this change, now these private-annotated javadoc wouldn't be visible from 
user doc but would still be accessible from dev doc.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.branch-2.0.001.patch, hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21861) Handle the missing file issues from the Linkchecker job



[ 
https://issues.apache.org/jira/browse/HBASE-21861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765589#comment-16765589
 ] 

Sakthi edited comment on HBASE-21861 at 2/12/19 1:53 AM:
-

In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. But, as user doc is set 
to show public & protected classes only and not package-private classes, this 
results in a "missing file" issue. 

After this change, now these private-annotated classes'  wouldn't be visible 
from user doc but would still be accessible from dev doc.

I would suggest, you could try building branch-2.0 with this patch and let's 
see if you face any issues in building the site, [~psomogyi]


was (Author: jatsakthi):
In branch-2.0, looks like user doc was being generated for 
"InterfaceAudience.Private" annotations as well. It should have been only for 
public annotations. Looks like it was commented out due to HBASE-19663. I tried 
uncommenting it again and built the website. It was a successful build.

Also it eliminates the case where private-annotated public class's javadoc 
refers to a inherited package-private class's members. But, as user doc is set 
to show public & protected classes only and not package-private classes, this 
results in a "missing file" issue. 

After this change, now these private-annotated classes'  wouldn't be visible 
from user doc but would still be accessible from dev doc.

> Handle the missing file issues from the Linkchecker job
> ---
>
> Key: HBASE-21861
> URL: https://issues.apache.org/jira/browse/HBASE-21861
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Sakthi
>Assignee: Sakthi
>Priority: Major
> Fix For: 1.2.11
>
> Attachments: hbase-21861.branch-1.2.001.patch, 
> hbase-21861.branch-2.0.001.patch, hbase-21861.master.001.patch
>
>
> The parent jira contains the numbers for the missing files. This jira is to 
> track specifically the fixes in that aspect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21785) master reports open regions as RITs and also messes up rit age metric