[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Component/s: consensus

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEADER]: attempt to pr

[jira] [Created] (KUDU-3084) Multiple time sources with fallback behavior between them

2020-03-19 Thread Alexey Serbin (Jira)
Alexey Serbin created KUDU-3084:
---

 Summary: Multiple time sources with fallback behavior between them
 Key: KUDU-3084
 URL: https://issues.apache.org/jira/browse/KUDU-3084
 Project: Kudu
  Issue Type: Improvement
  Components: master, tserver
Reporter: Alexey Serbin


[~tlipcon] suggested an alternative approach to configure and select 
HybridClock's time source.

Kudu servers could maintain multiple time sources and switch between them with 
a fallback behavior.  The default or preferred time source might be any of the 
existing ones (e.g., the built-in client), but when it's not available, another 
available time source is selected (e.g., {{system}} -- the NTP-synchronized 
local clock).  Switching between time sources can be done:
* only upon startup/initialization
* upon startup/initialization and later during normal run time

The advantages are:
* easier deployment and configuration of Kudu clusters
* simplified upgrade path from older releases using {{system}} time source to 
newer releases using {{builtin}} time source by default

There are downsides, though.  Since the new way of maintaining time source is 
more dynamic, it can:
* mask various configuration or network issues
* result in different time source within the same Kudu cluster due to transient 
issues
* introduce extra startup delay



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2432) isolate race creating directory via dist_test.py

2020-03-19 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063083#comment-17063083
 ] 

Todd Lipcon commented on KUDU-2432:
---

I pushed a fix for this: 
https://github.com/cloudera/dist_test/pull/new/kudu-2432

Testing in prod :)

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2432) isolate race creating directory via dist_test.py

2020-03-19 Thread Todd Lipcon (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063080#comment-17063080
 ] 

Todd Lipcon commented on KUDU-2432:
---

I looked into this a bit tonight since it's happening a lot lately. I sshed 
into one of the slaves that had had a failure and ran 'docker logs' on the 
dist-test slave container to get the full logs, and then grabbed the portion 
corresponding to a failed job. It looks like the issue is that a first attempt 
to download the files for the task failed with a "connection reset by peer" 
error. The retries seem to fail because the directory already exists from the 
first attempt. In other words, it's not a race, just broken retry logic. Will 
look at the code next.

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-2432) isolate race creating directory via dist_test.py

2020-03-19 Thread Todd Lipcon (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2432:
--
Attachment: logs.txt

> isolate race creating directory via dist_test.py
> 
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: Mike Percy
>Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the 
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding 
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in 
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in 
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063045#comment-17063045
 ] 

YifanZhang commented on KUDU-3082:
--

Sorry I forgot to explain, the cluster version is 1.10.1.

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3

[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Affects Version/s: 1.10.1

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEADER]: attempt to promote peer 08beca5ed4d04003b69

[jira] [Resolved] (KUDU-2928) built-in NTP client: tests to evaluate the behavior of the client

2020-03-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin resolved KUDU-2928.
-
Fix Version/s: 1.12.0
   Resolution: Fixed

Implemented with {{4aa0c7c0bc7d91af8be9a837b64f2a53fe31dd44}}

> built-in NTP client: tests to evaluate the behavior of the client
> -
>
> Key: KUDU-2928
> URL: https://issues.apache.org/jira/browse/KUDU-2928
> Project: Kudu
>  Issue Type: Sub-task
>  Components: clock, test
>Affects Versions: 1.11.0
>Reporter: Alexey Serbin
>Assignee: Alexey Serbin
>Priority: Major
>  Labels: clock
> Fix For: 1.12.0
>
>
> It's necessary to implement tests covering the behavior of the built-in NTP 
> client in various corner cases:
> * A set of NTP servers which doesn't agree on time
> * non-synchronized NTP server
> * NTP server that loses track of its reference and becomes a false ticker
> * etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3083) Kudu installation fails on ubuntu 19.10 & gcc

2020-03-19 Thread Syed Taha Nemat (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Taha Nemat updated KUDU-3083:
--
Description: 
Followed steps mentioned in 
[https://kudu.apache.org/docs/installation.html#ubuntu_from_source]. 
Build-if-neccesary.sh script fails on linux ubuntu 19.10 with a gcc compiler.

Possible work arounds:
 * have users install clang in environment and run cmake commands using clang
 * make all C code compliant with gcc.

  was:
Followed steps mentioned in 
[https://kudu.apache.org/docs/installation.html#ubuntu_from_source]. 
Build-if-neccesary.sh script fails on linux ubuntu 19.10 with a gcc compiler.

Possible work arounds:
 * have users install clang in environment
 * make all C code compliant with gcc.


> Kudu installation fails on ubuntu 19.10 & gcc
> -
>
> Key: KUDU-3083
> URL: https://issues.apache.org/jira/browse/KUDU-3083
> Project: Kudu
>  Issue Type: Bug
>  Components: build
> Environment: ubuntu 19.10, gcc
>Reporter: Syed Taha Nemat
>Priority: Major
>  Labels: installguide
> Attachments: Screenshot from 2020-03-20 04-36-35.png
>
>
> Followed steps mentioned in 
> [https://kudu.apache.org/docs/installation.html#ubuntu_from_source]. 
> Build-if-neccesary.sh script fails on linux ubuntu 19.10 with a gcc compiler.
> Possible work arounds:
>  * have users install clang in environment and run cmake commands using clang
>  * make all C code compliant with gcc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3083) Kudu installation fails on ubuntu 19.10 & gcc

2020-03-19 Thread Syed Taha Nemat (Jira)
Syed Taha Nemat created KUDU-3083:
-

 Summary: Kudu installation fails on ubuntu 19.10 & gcc
 Key: KUDU-3083
 URL: https://issues.apache.org/jira/browse/KUDU-3083
 Project: Kudu
  Issue Type: Bug
  Components: build
 Environment: ubuntu 19.10, gcc
Reporter: Syed Taha Nemat
 Attachments: Screenshot from 2020-03-20 04-36-35.png

Followed steps mentioned in 
[https://kudu.apache.org/docs/installation.html#ubuntu_from_source]. 
Build-if-neccesary.sh script fails on linux ubuntu 19.10 with a gcc compiler.

Possible work arounds:
 * have users install clang in environment
 * make all C code compliant with gcc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread Alexey Serbin (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062763#comment-17062763
 ] 

Alexey Serbin commented on KUDU-3082:
-

[~zhangyifan27], what Kudu version is that?

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEAD

[jira] [Updated] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-19 Thread Alexey Serbin (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3067:

Fix Version/s: 1.12.0
   Resolution: Fixed
   Status: Resolved  (was: In Review)

> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
> Fix For: 1.12.0
>
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062706#comment-17062706
 ] 

ASF subversion and git services commented on KUDU-3067:
---

Commit a80d7472110ae2349a82c2150aad61079969b337 in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=a80d747 ]

[util] KUDU-3067 add OpenStack metadata detector

This patch adds OpenStack metadata detector that works with OpenStack
Nova metadata server (see [1] for details).  In addition, this patch
fixes the existing AWS detector to tell apart a true EC2 instance
from a masquerading OpenStack one [2].

I couldn't get access to an OpenStack instance, but I asked the reporter
of KUDU-3067 to test how it works and report back.

1. https://docs.openstack.org/nova/latest/user/metadata.html#metadata-service
2. https://docs.openstack.org/nova/latest/user/metadata.html#metadata-ec2-format

Change-Id: I84cc6d155ab1fbd7b401f5349d292f46fcac3a34
Reviewed-on: http://gerrit.cloudera.org:8080/15488
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo 
Reviewed-by: liusheng 


> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3067) Inexplict cloud detection for AWS and OpenStack based cloud by querying metadata

2020-03-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062705#comment-17062705
 ] 

ASF subversion and git services commented on KUDU-3067:
---

Commit a80d7472110ae2349a82c2150aad61079969b337 in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=a80d747 ]

[util] KUDU-3067 add OpenStack metadata detector

This patch adds OpenStack metadata detector that works with OpenStack
Nova metadata server (see [1] for details).  In addition, this patch
fixes the existing AWS detector to tell apart a true EC2 instance
from a masquerading OpenStack one [2].

I couldn't get access to an OpenStack instance, but I asked the reporter
of KUDU-3067 to test how it works and report back.

1. https://docs.openstack.org/nova/latest/user/metadata.html#metadata-service
2. https://docs.openstack.org/nova/latest/user/metadata.html#metadata-ec2-format

Change-Id: I84cc6d155ab1fbd7b401f5349d292f46fcac3a34
Reviewed-on: http://gerrit.cloudera.org:8080/15488
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo 
Reviewed-by: liusheng 


> Inexplict cloud detection for AWS and OpenStack based cloud by querying 
> metadata
> 
>
> Key: KUDU-3067
> URL: https://issues.apache.org/jira/browse/KUDU-3067
> Project: Kudu
>  Issue Type: Bug
>Reporter: liusheng
>Assignee: Alexey Serbin
>Priority: Major
>
> The cloud detector is used to check the cloud provider of the instance, see 
> [here|#L59-L93]],  For AWS cloud it using the URL 
> [http://169.254.169.254/latest/meta-data/instance-id|http://169.254.169.254/latest/meta-data/instance-id*]
>  to check the specific metadata to determine it is AWS instance. This is OK, 
> but for OpenStack based cloud, the metadata is same with AWS, so this URL can 
> also be accessed. So this cannot distinct the AWS and other OpenStack based 
> clouds. This caused an issue when run 
> "HybridClockTest.TimeSourceAutoSelection" test case, this test will use the 
> above URL to detect the Cloud of instance current running on and then try to 
> call the NTP service, for AWS, the dedicated NTP service is 
> "169.254.169.123", but for OpenStack based cloud, there isn't such a 
> dedicated NTP service. So this test case will fail if I run on a instance of 
> OpenStack based cloud because the cloud detector suppose it is AWS instance 
> and try to access "169.254.169.123".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Description: 
Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
output is like:

 
{code:java}
Tablet Summary
Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = 7380d797d2ea49e88d71091802fb1c81
  B = d1952499f94a4e6087bee28466fcb09f
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = 08beca5ed4d04003b6979bf8bac378d2
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 5| -1   | Yes
 B | A   B   C| 5| -1   | Yes
 C | A   B   C*  D~   | 5| 54649| No
Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' 
active configs disagree with the leader master's
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
All reported replicas are:
  A = d1952499f94a4e6087bee28466fcb09f
  B = 47af52df1adc47e1903eb097e9c88f2e
  C = 5a8aeadabdd140c29a09dabcae919b31
  D = 14632cdbb0d04279bc772f64e06389f9
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B*  C|  |  | Yes
 A | A   B*  C| 5| 5| Yes
 B | A   B*  C   D~   | 5| 96176| No
 C | A   B*  C| 5| 5| Yes
Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' 
active configs disagree with the leader master's
  a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = a9eaff3cf1ed483aae84954d649a
  B = f75df4a6b5ce404884313af5f906b392
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 1| -1   | Yes
 B | A   B   C*   | 1| -1   | Yes
 C | A   B   C*  D~   | 1| 2| No
Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
All reported replicas are:
  A = 47af52df1adc47e1903eb097e9c88f2e
  B = f0f7b2f4b9d344e6929105f48365f38e
  C = f75df4a6b5ce404884313af5f906b392
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A*  B   C|  |  | Yes
 A | A*  B   C   D~   | 1| 1991 | No
 B | A*  B   C| 1| 4| Yes
 C | A*  B   C| 1| 4| Yes{code}
These tablets couldn't recover for a couple of days until we restart kudu-ts27.

I found so many duplicated logs in kudu-ts27 are like:
{code:java}
I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.
I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 
6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 
LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.

{code}
There seems to be some RaftConfig change operations that somehow cannot 
comple

[jira] [Created] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)
YifanZhang created KUDU-3082:


 Summary: tablets in "CONSENSUS_MISMATCH" state for a long time
 Key: KUDU-3082
 URL: https://issues.apache.org/jira/browse/KUDU-3082
 Project: Kudu
  Issue Type: Bug
Reporter: YifanZhang


Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
output is like:

 
{code:java}
Tablet Summary
Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = 7380d797d2ea49e88d71091802fb1c81
  B = d1952499f94a4e6087bee28466fcb09f
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = 08beca5ed4d04003b6979bf8bac378d2
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 5| -1   | Yes
 B | A   B   C| 5| -1   | Yes
 C | A   B   C*  D~   | 5| 54649| NoTablet 
6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active 
configs disagree with the leader master's
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
All reported replicas are:
  A = d1952499f94a4e6087bee28466fcb09f
  B = 47af52df1adc47e1903eb097e9c88f2e
  C = 5a8aeadabdd140c29a09dabcae919b31
  D = 14632cdbb0d04279bc772f64e06389f9
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B*  C|  |  | Yes
 A | A   B*  C| 5| 5| Yes
 B | A   B*  C   D~   | 5| 96176| No
 C | A   B*  C| 5| 5| Yes
Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' 
active configs disagree with the leader master's
  a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = a9eaff3cf1ed483aae84954d649a
  B = f75df4a6b5ce404884313af5f906b392
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 1| -1   | Yes
 B | A   B   C*   | 1| -1   | Yes
 C | A   B   C*  D~   | 1| 2| NoTablet 
3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active 
configs disagree with the leader master's
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
All reported replicas are:
  A = 47af52df1adc47e1903eb097e9c88f2e
  B = f0f7b2f4b9d344e6929105f48365f38e
  C = f75df4a6b5ce404884313af5f906b392
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A*  B   C|  |  | Yes
 A | A*  B   C   D~   | 1| 1991 | No
 B | A*  B   C| 1| 4| Yes
 C | A*  B   C| 1| 4| Yes{code}
These tablets couldn't recover for a couple of days until we restart kudu-ts27.

I found so many duplicated logs in kudu-ts27 are like:
{code:java}
I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.
I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 
6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 
LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is 
already a config change operation in progress. Unable to promote follower