[jira] [Updated] (KUDU-2379) Spark generates a broken authentication credentials PB
[ https://issues.apache.org/jira/browse/KUDU-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-2379: -- Code Review: https://gerrit.cloudera.org/#/c/9814/ > Spark generates a broken authentication credentials PB > -- > > Key: KUDU-2379 > URL: https://issues.apache.org/jira/browse/KUDU-2379 > Project: Kudu > Issue Type: Bug > Components: java, spark >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Blocker > > KUDU-2259 introduced a regression which causes Spark to not work properly on > secure clusters. The issue is the following: > - the driver calls exportAuthenticationCredentials() > -- the client hasn't yet talked to the master, so it doesn't have any > credentials yet, despite having a keytab available > -- the code is as follows: > {code} > byte[] authnData = securityContext.exportAuthenticationCredentials(); > if (authnData != null) { > return Deferred.fromResult(authnData); > } > {code} > -- previously, authnData would be null in this case, and it would fall > through to connect to the cluster and then export a proper token. > -- with the new implementation, an authnData is returned which is devoid of > real credentials but contains a realUser. So, it's non-null, and it gets > returned immediately > - the tasks then get credentials with no tokens and can't connect -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2379) Spark generates a broken authentication credentials PB
[ https://issues.apache.org/jira/browse/KUDU-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-2379: -- Status: In Review (was: Open) > Spark generates a broken authentication credentials PB > -- > > Key: KUDU-2379 > URL: https://issues.apache.org/jira/browse/KUDU-2379 > Project: Kudu > Issue Type: Bug > Components: java, spark >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Blocker > > KUDU-2259 introduced a regression which causes Spark to not work properly on > secure clusters. The issue is the following: > - the driver calls exportAuthenticationCredentials() > -- the client hasn't yet talked to the master, so it doesn't have any > credentials yet, despite having a keytab available > -- the code is as follows: > {code} > byte[] authnData = securityContext.exportAuthenticationCredentials(); > if (authnData != null) { > return Deferred.fromResult(authnData); > } > {code} > -- previously, authnData would be null in this case, and it would fall > through to connect to the cluster and then export a proper token. > -- with the new implementation, an authnData is returned which is devoid of > real credentials but contains a realUser. So, it's non-null, and it gets > returned immediately > - the tasks then get credentials with no tokens and can't connect -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2379) Spark generates a broken authentication credentials PB
Todd Lipcon created KUDU-2379: - Summary: Spark generates a broken authentication credentials PB Key: KUDU-2379 URL: https://issues.apache.org/jira/browse/KUDU-2379 Project: Kudu Issue Type: Bug Components: java, spark Affects Versions: 1.7.0 Reporter: Todd Lipcon Assignee: Todd Lipcon KUDU-2259 introduced a regression which causes Spark to not work properly on secure clusters. The issue is the following: - the driver calls exportAuthenticationCredentials() -- the client hasn't yet talked to the master, so it doesn't have any credentials yet, despite having a keytab available -- the code is as follows: {code} byte[] authnData = securityContext.exportAuthenticationCredentials(); if (authnData != null) { return Deferred.fromResult(authnData); } {code} -- previously, authnData would be null in this case, and it would fall through to connect to the cluster and then export a proper token. -- with the new implementation, an authnData is returned which is devoid of real credentials but contains a realUser. So, it's non-null, and it gets returned immediately - the tasks then get credentials with no tokens and can't connect -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type
[ https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414874#comment-16414874 ] Grant Henke commented on KUDU-2375: --- [~tlipcon] The error message is coming from 1.6.0 when trying to read metadata from 1.7.0. I don't think their is a great way to change the behavior/messages for the old versions. I could create a patch to improve the message or handling for future versions so that changes similar to the decimal change have more clear messages on downgrade. Note: If you don't create decimal tables or you delete them before downgrade there should be no issues. > Can't parse message of type "kudu.master.SysTablesEntryPB" because it is > missing required fields: schema.columns[5].type > > > Key: KUDU-2375 > URL: https://issues.apache.org/jira/browse/KUDU-2375 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Michael Brown >Priority: Major > > When tables with decimals are added in 1.7.0, a downgrade from 1.7.0 to 1.6 > results in a dcheck when 1.6 starts and Kudu isn't usable in its downgraded > version. > {noformat} > F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet > metadata into memory failed: Corruption: Failed while visiting tables in sys > catalog: unable to parse metadata field for row > 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of > type "kudu.master.SysTablesEntryPB" because it is missing required fields: > schema.columns[5].type > {noformat} > {noformat} > #0 0x003355e32625 in raise () from /lib64/libc.so.6 > #1 0x003355e33e05 in abort () from /lib64/libc.so.6 > #2 0x01cea129 in ?? () > #3 0x009268cd in google::LogMessage::Fail() () > #4 0x0092878d in google::LogMessage::SendToLog() () > #5 0x00926409 in google::LogMessage::Flush() () > #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () > #7 0x008f05de in ?? () > #8 0x008f6039 in > kudu::master::CatalogManager::PrepareForLeadershipTask() () > #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () > #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () > #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 > #12 0x003355ee893d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed
[ https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414832#comment-16414832 ] Alexey Serbin edited comment on KUDU-2354 at 3/27/18 12:53 AM: --- And another issue to look at: do follower masters continue retrying those tasks once then switched from the leader to the follower role? was (Author: aserbin): And another issue to look at: do follower masters continue to retry those tasks once then switched from the leader to the follower role? > In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly > retries operations to add a replacement replica even if replacement is no > longer needed > --- > > Key: KUDU-2354 > URL: https://issues.apache.org/jira/browse/KUDU-2354 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 > Environment: 3 tservers in the cluster, single master (?) >Reporter: Alexey Serbin >Priority: Major > > In a scenario reported by [~adar], 100 iterations of the following command > were run: > {noformat} > kudu perf loadgen --keep-auto-table --table-num-buckets=40 > --num-rows-per-thread=1 --table-num-replicas=3 > {noformat} > That took about 10-15 minutes to complete, and for some reason ksck reported > UNAVAILABLE tablets for 5-10 minutes after that. Most likely, due to the > spike of IO activity, tablet leaders didn't receive heartbeats from some > replicas and tried to replace those. After some time, the cluster has > stabilized (no problems reported by ksck), but in the master's log the > following messages continued to appear: > {noformat} > I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 > (attempt 22) > I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of > ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet > 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay > of 60018 ms (attempt = 22) > {noformat} > Of course, in case of just 3 tservers in the cluster not a single attempt to > add a replacement non-voter replica would succeed, but it would make sense to > stop retrying those operations when a tablet's OpId index is far ahead of the > cas_config_opid_index of the operation being retried. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed
[ https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414832#comment-16414832 ] Alexey Serbin commented on KUDU-2354: - And another issue to look at: do follower masters continue to retry those tasks once then switched from the leader to the follower role? > In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly > retries operations to add a replacement replica even if replacement is no > longer needed > --- > > Key: KUDU-2354 > URL: https://issues.apache.org/jira/browse/KUDU-2354 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 > Environment: 3 tservers in the cluster, single master (?) >Reporter: Alexey Serbin >Priority: Major > > In a scenario reported by [~adar], 100 iterations of the following command > were run: > {noformat} > kudu perf loadgen --keep-auto-table --table-num-buckets=40 > --num-rows-per-thread=1 --table-num-replicas=3 > {noformat} > That took about 10-15 minutes to complete, and for some reason ksck reported > UNAVAILABLE tablets for 5-10 minutes after that. Most likely, due to the > spike of IO activity, tablet leaders didn't receive heartbeats from some > replicas and tried to replace those. After some time, the cluster has > stabilized (no problems reported by ksck), but in the master's log the > following messages continued to appear: > {noformat} > I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 > (attempt 22) > I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of > ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet > 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay > of 60018 ms (attempt = 22) > {noformat} > Of course, in case of just 3 tservers in the cluster not a single attempt to > add a replacement non-voter replica would succeed, but it would make sense to > stop retrying those operations when a tablet's OpId index is far ahead of the > cas_config_opid_index of the operation being retried. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2378) Crash due to unaligned loads when building with clang 6.0
Todd Lipcon created KUDU-2378: - Summary: Crash due to unaligned loads when building with clang 6.0 Key: KUDU-2378 URL: https://issues.apache.org/jira/browse/KUDU-2378 Project: Kudu Issue Type: Improvement Affects Versions: 1.7.0 Reporter: Todd Lipcon Assignee: Todd Lipcon When I built the whole tree with clang 6.0, all_types-itest crashed due to an illegal instruction. Looking at assembly, it appeared to be that clang had generated a 'movaps' (aligned load) instruction for a *reinterpret_cast() call loading into an xmm register. We aren't careful with alignment about loading other integer types because unaligned loads of int64s don't have a high penalty, but unaligned load of int128 causes a crash. This is likely to crash on other compilers too -- surprised we haven't seen it yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)
[ https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414801#comment-16414801 ] Mike Percy commented on KUDU-2323: -- While the speed of this cycle was likely fixed by the patch to fix KUDU-2320, it appears there is no code path to remove a TrackedPeer when it gets evicted. While this could cause a minor resource leak until a leader was evicted in a 3-2-3 world, in a 3-4-3 world it affects last_communcation_time and can therefore make a downed NON_VOTER to be considered FAILED as soon as it is added to the config. Maybe this is also interacting with KUDU-2354, in which there are certain cases that can cause a catalog manager task to endlessly retry adding a new replica. > NON_VOTER replica flapping (repeatedly added and evicted) > - > > Key: KUDU-2323 > URL: https://issues.apache.org/jira/browse/KUDU-2323 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Major > > In running a YCSB stress workload I see a tablet got into some state where > the master flapped back and forth adding and then removing a replica as a > NON_VOTER: > {code} > I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 > (attempt 1) > I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending > ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt > 1) > I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending >
[jira] [Reopened] (KUDU-2356) Idle WALs can consume significant memory
[ https://issues.apache.org/jira/browse/KUDU-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reopened KUDU-2356: --- Seems the original commit made some tests flaky. Reverting until I have time to look at it. > Idle WALs can consume significant memory > > > Key: KUDU-2356 > URL: https://issues.apache.org/jira/browse/KUDU-2356 > Project: Kudu > Issue Type: Improvement > Components: log, tserver >Affects Versions: 1.7.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Major > Fix For: 1.8.0 > > Attachments: heap.svg > > > I grabbed a heap sample of a tserver which has been running a write workload > for a little while and found that 750MB of memory is used by faststring > allocations inside WritableLogSegment::WriteEntryBatch. It seems like this is > the 'compress_buf_' member. This buffer always resizes up during a log write > but never shrinks back down, even when the WAL is idle. We should consider > clearing the buffer after each append, or perhaps after a short timeout like > 100ms after a WAL becomes idle. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2377) Server fails to start up when RLIMIT_NPROC is -1
Adar Dembo created KUDU-2377: Summary: Server fails to start up when RLIMIT_NPROC is -1 Key: KUDU-2377 URL: https://issues.apache.org/jira/browse/KUDU-2377 Project: Kudu Issue Type: Bug Components: server Affects Versions: 1.7.0 Reporter: Adar Dembo Assignee: Adar Dembo Fix For: 1.7.1 Unlike RLIMIT_NOFILE, it would appear that RLIMIT_NPROC can be set to the special value RLIM_INFINITY. This special value is represented as the integer value -1, which means it's not safe for callers of Env::GetResourceLimit to simply treat the returned value as a non-zero integer. Currently GetThreadPoolThreadLimit (kserver.cc) has a perfect example of such misbehavior; If I open a root shell, run `ulimit -o unlimited`, then try to start a server, I get the following check failure: {noformat} I0326 13:00:33.053771 19813 env_posix.cc:1629] Not raising this process' running threads per effective uid limit of 18446744073709551615; it is already as high as it can go F0326 13:00:33.053802 19813 threadpool.cc:106] Check failed: max_threads > 0 (0 vs. 0) *** Check failure stack trace: *** *** Aborted at 1522094433 (unix time) try "date -d @1522094433" if you are using GNU date *** PC: @ 0x7fe5de4bd428 gsignal *** SIGABRT (@0x4d65) received by PID 19813 (TID 0x7fe5d9421840) from PID 19813; stack trace: *** @ 0x7fe5e0207390 (unknown) @ 0x7fe5de4bd428 gsignal @ 0x7fe5de4bf02a abort @ 0x7fe5df49a1d9 google::logging_fail() @ 0x7fe5df49bb1d google::LogMessage::Fail() @ 0x7fe5df49da03 google::LogMessage::SendToLog() @ 0x7fe5df49b67a google::LogMessage::Flush() @ 0x7fe5df49e3cf google::LogMessageFatal::~LogMessageFatal() @ 0x7fe5df942bf2 kudu::ThreadPoolBuilder::set_max_threads() @ 0x7fe5e0738fad kudu::kserver::KuduServer::Init() @ 0x7fe5e0650a45 kudu::master::Master::Init() @ 0x7fe5e067559d kudu::master::MiniMaster::Start() @ 0x4b3bbb kudu::master::MasterTest::SetUp() @ 0x7fe5e08d2477 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x7fe5e08c77f6 testing::Test::Run() @ 0x7fe5e08c79a8 testing::TestInfo::Run() @ 0x7fe5e08c7a85 testing::TestCase::Run() @ 0x7fe5e08c8758 testing::internal::UnitTestImpl::RunAllTests() @ 0x7fe5e08d2987 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x7fe5e08c7b5a testing::UnitTest::Run() @ 0x7fe5e092c09a RUN_ALL_TESTS() @ 0x7fe5e0929d88 main @ 0x7fe5de4a8830 __libc_start_main @ 0x47a429 _start {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Percy updated KUDU-2342: - Fix Version/s: (was: 1.8.0) > Non-voter replicas can be promoted and get stuck > > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Assignee: Alexey Serbin >Priority: Blocker > Labels: scalability > Fix For: 1.7.0 > > Attachments: Impala query profile.txt, tablet-info.html > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions
[ https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414441#comment-16414441 ] Mike Percy commented on KUDU-2335: -- This issue affects 1.7.0 in a very minor way now (occasionally prints a warning message). That can happen when a leader replica is starting up or shutting down. > Leader can report unknown health for itself during lifecycle transitions > > > Key: KUDU-2335 > URL: https://issues.apache.org/jira/browse/KUDU-2335 > Project: Kudu > Issue Type: Bug > Components: consensus, master >Affects Versions: 1.7.0 >Reporter: Alexey Serbin >Assignee: Mike Percy >Priority: Major > > The following DCHECK triggered in one of pre-commit builds with TSAN > configuration while running > {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario: > {noformat} > quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy > 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: > opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: > "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: > "127.8.86.130" port: 52021 } } peers { permanent_uuid: > "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: > "127.8.86.129" port: 37815 } } peers { permanent_uuid: > "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: > "127.8.86.131" port: 39431 } }{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions
[ https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Percy updated KUDU-2335: - Component/s: consensus > Leader can report unknown health for itself during lifecycle transitions > > > Key: KUDU-2335 > URL: https://issues.apache.org/jira/browse/KUDU-2335 > Project: Kudu > Issue Type: Bug > Components: consensus, master >Affects Versions: 1.7.0 >Reporter: Alexey Serbin >Priority: Major > > The following DCHECK triggered in one of pre-commit builds with TSAN > configuration while running > {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario: > {noformat} > quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy > 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: > opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: > "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: > "127.8.86.130" port: 52021 } } peers { permanent_uuid: > "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: > "127.8.86.129" port: 37815 } } peers { permanent_uuid: > "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: > "127.8.86.131" port: 39431 } }{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions
[ https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Percy reassigned KUDU-2335: Assignee: Mike Percy > Leader can report unknown health for itself during lifecycle transitions > > > Key: KUDU-2335 > URL: https://issues.apache.org/jira/browse/KUDU-2335 > Project: Kudu > Issue Type: Bug > Components: consensus, master >Affects Versions: 1.7.0 >Reporter: Alexey Serbin >Assignee: Mike Percy >Priority: Major > > The following DCHECK triggered in one of pre-commit builds with TSAN > configuration while running > {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario: > {noformat} > quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy > 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: > opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: > "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: > "127.8.86.130" port: 52021 } } peers { permanent_uuid: > "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: > "127.8.86.129" port: 37815 } } peers { permanent_uuid: > "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: > "127.8.86.131" port: 39431 } }{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions
[ https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Percy updated KUDU-2335: - Summary: Leader can report unknown health for itself during lifecycle transitions (was: Debug assert in quorum_util.cc) > Leader can report unknown health for itself during lifecycle transitions > > > Key: KUDU-2335 > URL: https://issues.apache.org/jira/browse/KUDU-2335 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexey Serbin >Priority: Major > > The following DCHECK triggered in one of pre-commit builds with TSAN > configuration while running > {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario: > {noformat} > quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy > 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: > opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: > "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: > "127.8.86.130" port: 52021 } } peers { permanent_uuid: > "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: > "127.8.86.129" port: 37815 } } peers { permanent_uuid: > "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: > "127.8.86.131" port: 39431 } }{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2153) Servers delete tmp files before obtaining directory lock
[ https://issues.apache.org/jira/browse/KUDU-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Henke updated KUDU-2153: -- Fix Version/s: (was: 1.7.x) 1.7.0 > Servers delete tmp files before obtaining directory lock > > > Key: KUDU-2153 > URL: https://issues.apache.org/jira/browse/KUDU-2153 > Project: Kudu > Issue Type: Bug > Components: fs >Affects Versions: 1.2.0, 1.3.1, 1.4.0, 1.5.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > Fix For: 1.7.0, 1.8.0 > > > In FsManager::Open() we currently call DeleteTmpFiles very early, before > starting the block manager. This means that, if you accidentally start a > tserver while another is running, it's possible for it to delete temporary > files that are in-use by the running tserver, causing it to exhibit strange > behavior, crash, etc (as in KUDU-2152). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414249#comment-16414249 ] Andrew Wong commented on KUDU-2359: --- This should be doable by extending the architecture in place for the `kudu fs update_dirs` tool. The caveat here, and with the update tool, is that any tablets that are/were on the missing data directory are/should be started up in a failed state so they can be evicted and re-replicated elsewhere. For the update tool, we have operators confront this tradeoff by requiring them to specify the `–force` flag. Ideally a similar flag could be used here, so at least the mean time to recovery is gated by the time it takes to update a flag, rather than the time it takes to run `kudu fs update_dirs`. It also begs the question, would operators even care about those failed tablets? If our re-replication story is robust enough to handle everything on its own, it could be seen as a pointless configuration. I suppose exposing it as a flag initially would give us that sort of info. > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type
[ https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414119#comment-16414119 ] Michael Brown commented on KUDU-2375: - Taking this out of P1. Apparently on this long-lived, shared cluster, someone recently added some tables with Decimal. Surely it's one of these that's causing this problem. Sorry for mis-reading that message before. > Can't parse message of type "kudu.master.SysTablesEntryPB" because it is > missing required fields: schema.columns[5].type > > > Key: KUDU-2375 > URL: https://issues.apache.org/jira/browse/KUDU-2375 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Michael Brown >Priority: Major > > A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the > same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the > problem go away. > Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the > backward incompatibility was not intended. > {noformat} > F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet > metadata into memory failed: Corruption: Failed while visiting tables in sys > catalog: unable to parse metadata field for row > 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of > type "kudu.master.SysTablesEntryPB" because it is missing required fields: > schema.columns[5].type > {noformat} > {noformat} > #0 0x003355e32625 in raise () from /lib64/libc.so.6 > #1 0x003355e33e05 in abort () from /lib64/libc.so.6 > #2 0x01cea129 in ?? () > #3 0x009268cd in google::LogMessage::Fail() () > #4 0x0092878d in google::LogMessage::SendToLog() () > #5 0x00926409 in google::LogMessage::Flush() () > #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () > #7 0x008f05de in ?? () > #8 0x008f6039 in > kudu::master::CatalogManager::PrepareForLeadershipTask() () > #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () > #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () > #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 > #12 0x003355ee893d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type
[ https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Brown updated KUDU-2375: Priority: Major (was: Blocker) > Can't parse message of type "kudu.master.SysTablesEntryPB" because it is > missing required fields: schema.columns[5].type > > > Key: KUDU-2375 > URL: https://issues.apache.org/jira/browse/KUDU-2375 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Michael Brown >Priority: Major > > A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the > same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the > problem go away. > Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the > backward incompatibility was not intended. > {noformat} > F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet > metadata into memory failed: Corruption: Failed while visiting tables in sys > catalog: unable to parse metadata field for row > 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of > type "kudu.master.SysTablesEntryPB" because it is missing required fields: > schema.columns[5].type > {noformat} > {noformat} > #0 0x003355e32625 in raise () from /lib64/libc.so.6 > #1 0x003355e33e05 in abort () from /lib64/libc.so.6 > #2 0x01cea129 in ?? () > #3 0x009268cd in google::LogMessage::Fail() () > #4 0x0092878d in google::LogMessage::SendToLog() () > #5 0x00926409 in google::LogMessage::Flush() () > #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () > #7 0x008f05de in ?? () > #8 0x008f6039 in > kudu::master::CatalogManager::PrepareForLeadershipTask() () > #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () > #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () > #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 > #12 0x003355ee893d in clone () from /lib64/libc.so.6 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type
[ https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Brown updated KUDU-2375: Description: A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the problem go away. Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the backward incompatibility was not intended. {noformat} F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet metadata into memory failed: Corruption: Failed while visiting tables in sys catalog: unable to parse metadata field for row 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type {noformat} {noformat} #0 0x003355e32625 in raise () from /lib64/libc.so.6 #1 0x003355e33e05 in abort () from /lib64/libc.so.6 #2 0x01cea129 in ?? () #3 0x009268cd in google::LogMessage::Fail() () #4 0x0092878d in google::LogMessage::SendToLog() () #5 0x00926409 in google::LogMessage::Flush() () #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () #7 0x008f05de in ?? () #8 0x008f6039 in kudu::master::CatalogManager::PrepareForLeadershipTask() () #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 #12 0x003355ee893d in clone () from /lib64/libc.so.6 {noformat} was: A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the problem go away. Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the backward incompatibility was not intended. {noformat} I0324 17:45:10.681015 105716 catalog_manager.cc:306] Loaded metadata for table impala::tpcds_1000_kudu.web_returns [id=40c35b333fa84bb8ad331fab02e03fdf] F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet metadata into memory failed: Corruption: Failed while visiting tables in sys catalog: unable to parse metadata field for row 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type {noformat} {noformat} #0 0x003355e32625 in raise () from /lib64/libc.so.6 #1 0x003355e33e05 in abort () from /lib64/libc.so.6 #2 0x01cea129 in ?? () #3 0x009268cd in google::LogMessage::Fail() () #4 0x0092878d in google::LogMessage::SendToLog() () #5 0x00926409 in google::LogMessage::Flush() () #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () #7 0x008f05de in ?? () #8 0x008f6039 in kudu::master::CatalogManager::PrepareForLeadershipTask() () #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 #12 0x003355ee893d in clone () from /lib64/libc.so.6 {noformat} This is on a long-lived cluster that has had Impala and Kudu slowly upgrading with mostly dev releases over time for a few months. Here's the Impala {{SHOW CREATE TABLE}}: {noformat} | CREATE TABLE tpcds_1000_kudu.web_returns ( | | wr_returned_date_sk INT NOT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, | | wr_order_number BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, | | wr_item_sk BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, | | wr_returned_time_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, | | wr_refunded_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, | | wr_refunded_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION,| | wr_refunded_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION,| | wr_refunded_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, | | wr_returning_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION,| | wr_returning_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION,
[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type
[ https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414104#comment-16414104 ] Michael Brown commented on KUDU-2375: - Good point, [~tlipcon]. I misread the Kudu master error message (It says "Loaded" in the first of the two messages, and I read it the other day as "Loading"). Let me at least clean that out from the Description. > Can't parse message of type "kudu.master.SysTablesEntryPB" because it is > missing required fields: schema.columns[5].type > > > Key: KUDU-2375 > URL: https://issues.apache.org/jira/browse/KUDU-2375 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Michael Brown >Priority: Blocker > > A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the > same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the > problem go away. > Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the > backward incompatibility was not intended. > {noformat} > I0324 17:45:10.681015 105716 catalog_manager.cc:306] Loaded metadata for > table impala::tpcds_1000_kudu.web_returns > [id=40c35b333fa84bb8ad331fab02e03fdf] > F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet > metadata into memory failed: Corruption: Failed while visiting tables in sys > catalog: unable to parse metadata field for row > 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of > type "kudu.master.SysTablesEntryPB" because it is missing required fields: > schema.columns[5].type > {noformat} > {noformat} > #0 0x003355e32625 in raise () from /lib64/libc.so.6 > #1 0x003355e33e05 in abort () from /lib64/libc.so.6 > #2 0x01cea129 in ?? () > #3 0x009268cd in google::LogMessage::Fail() () > #4 0x0092878d in google::LogMessage::SendToLog() () > #5 0x00926409 in google::LogMessage::Flush() () > #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () > #7 0x008f05de in ?? () > #8 0x008f6039 in > kudu::master::CatalogManager::PrepareForLeadershipTask() () > #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () > #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () > #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 > #12 0x003355ee893d in clone () from /lib64/libc.so.6 > {noformat} > This is on a long-lived cluster that has had Impala and Kudu slowly upgrading > with mostly dev releases over time for a few months. Here's the Impala {{SHOW > CREATE TABLE}}: > {noformat} > | CREATE TABLE tpcds_1000_kudu.web_returns ( >| > | wr_returned_date_sk INT NOT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_order_number BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_item_sk BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returned_time_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_refunded_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_refunded_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_refunded_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_refunded_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returning_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_returning_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returning_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returning_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_web_page_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_reason_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > |
[jira] [Commented] (KUDU-2372) Don't let kudu start up if any disks are mounted read-only
[ https://issues.apache.org/jira/browse/KUDU-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414100#comment-16414100 ] Todd Lipcon commented on KUDU-2372: --- Per KUDU-2359 I think it may make sense to allow starting up with a bad disk so that we don't need manual intervention after a single disk failure (eg on a 12-disk host) > Don't let kudu start up if any disks are mounted read-only > -- > > Key: KUDU-2372 > URL: https://issues.apache.org/jira/browse/KUDU-2372 > Project: Kudu > Issue Type: Improvement > Components: fs >Reporter: Andrew Wong >Priority: Major > > Today, if a Kudu tserver runs into EROFS (read-only mount error), it treats > the error as it would a complete disk failure (EIO), allowing successful > startup of the server, but failing the tablets that are configured to use the > "failed" disk. > If something is wrong with the mounting of a disk, it might be helpful to > bring immediate attention to it, and have operators deal with it, rather than > handling it automatically. As such, it might be helpful to prevent Kudu from > starting up if errors are detected with the mount configurations. > There are tradeoffs here to be considered: > * The current behavior, as it is today, will evict and delete the data from > the failed tablets, as it is treated as an unrecoverable failure. The user > can ignore such failures and handle it at their leisure, since Kudu will > re-replicate the tablets lost in this way > * If we were to instead crash, this gives operators some immediate feedback > and a time limit to use `kudu fs update_dirs` to remove the read only drive, > or maybe fix the mountpoint itself -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (KUDU-2374) Expose an interface in RpcContext to report the time the InboundCall is received
[ https://issues.apache.org/jira/browse/KUDU-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved KUDU-2374. --- Resolution: Fixed Fix Version/s: 1.8.0 > Expose an interface in RpcContext to report the time the InboundCall is > received > > > Key: KUDU-2374 > URL: https://issues.apache.org/jira/browse/KUDU-2374 > Project: Kudu > Issue Type: Improvement > Components: rpc >Affects Versions: 1.7.0 >Reporter: Michael Ho >Assignee: Michael Ho >Priority: Minor > Fix For: 1.8.0 > > > {{InboundCall::GetTimeReceived()}} returns the time in which the inbound call > was received. While the dispatch and processing time of RPCs are already > reported in histogram in the service queue, it's helpful to make this > accessible to the RPC handler for its own book-keeping purpose (e.g. > reporting the average dispatch latency as part of query profile in Impala). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type
[ https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414097#comment-16414097 ] Todd Lipcon commented on KUDU-2375: --- You sure that table 467d365fffbe4485a3249079c48f42a9 is the one you pasted? My guess is that it's one that has a DECIMAL type column in its 5th (0-indexed) position, and when you downgrade to 1.6.0 it doesn't know what to make of the DECIMAL. Agreed the error message and behavior could be a lot better. [~granthenke] what do you think? > Can't parse message of type "kudu.master.SysTablesEntryPB" because it is > missing required fields: schema.columns[5].type > > > Key: KUDU-2375 > URL: https://issues.apache.org/jira/browse/KUDU-2375 > Project: Kudu > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Michael Brown >Priority: Blocker > > A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the > same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the > problem go away. > Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the > backward incompatibility was not intended. > {noformat} > I0324 17:45:10.681015 105716 catalog_manager.cc:306] Loaded metadata for > table impala::tpcds_1000_kudu.web_returns > [id=40c35b333fa84bb8ad331fab02e03fdf] > F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet > metadata into memory failed: Corruption: Failed while visiting tables in sys > catalog: unable to parse metadata field for row > 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of > type "kudu.master.SysTablesEntryPB" because it is missing required fields: > schema.columns[5].type > {noformat} > {noformat} > #0 0x003355e32625 in raise () from /lib64/libc.so.6 > #1 0x003355e33e05 in abort () from /lib64/libc.so.6 > #2 0x01cea129 in ?? () > #3 0x009268cd in google::LogMessage::Fail() () > #4 0x0092878d in google::LogMessage::SendToLog() () > #5 0x00926409 in google::LogMessage::Flush() () > #6 0x0092922f in google::LogMessageFatal::~LogMessageFatal() () > #7 0x008f05de in ?? () > #8 0x008f6039 in > kudu::master::CatalogManager::PrepareForLeadershipTask() () > #9 0x01d297d7 in kudu::ThreadPool::DispatchThread() () > #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) () > #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0 > #12 0x003355ee893d in clone () from /lib64/libc.so.6 > {noformat} > This is on a long-lived cluster that has had Impala and Kudu slowly upgrading > with mostly dev releases over time for a few months. Here's the Impala {{SHOW > CREATE TABLE}}: > {noformat} > | CREATE TABLE tpcds_1000_kudu.web_returns ( >| > | wr_returned_date_sk INT NOT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_order_number BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_item_sk BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returned_time_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_refunded_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_refunded_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_refunded_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_refunded_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returning_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_returning_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returning_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_returning_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION,| > | wr_web_page_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION > DEFAULT_COMPRESSION, | > | wr_reason_sk INT NULL