[jira] [Updated] (KUDU-2379) Spark generates a broken authentication credentials PB

2018-03-26 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2379:
--
Code Review: https://gerrit.cloudera.org/#/c/9814/

> Spark generates a broken authentication credentials PB
> --
>
> Key: KUDU-2379
> URL: https://issues.apache.org/jira/browse/KUDU-2379
> Project: Kudu
>  Issue Type: Bug
>  Components: java, spark
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
>
> KUDU-2259 introduced a regression which causes Spark to not work properly on 
> secure clusters. The issue is the following:
> - the driver calls exportAuthenticationCredentials()
> -- the client hasn't yet talked to the master, so it doesn't have any 
> credentials yet, despite having a keytab available
> -- the code is as follows:
> {code}
> byte[] authnData = securityContext.exportAuthenticationCredentials();
> if (authnData != null) {
>   return Deferred.fromResult(authnData);
> }
> {code}
> -- previously, authnData would be null in this case, and it would fall 
> through to connect to the cluster and then export a proper token.
> -- with the new implementation, an authnData is returned which is devoid of 
> real credentials but contains a realUser. So, it's non-null, and it gets 
> returned immediately
> - the tasks then get credentials with no tokens and can't connect



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2379) Spark generates a broken authentication credentials PB

2018-03-26 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-2379:
--
Status: In Review  (was: Open)

> Spark generates a broken authentication credentials PB
> --
>
> Key: KUDU-2379
> URL: https://issues.apache.org/jira/browse/KUDU-2379
> Project: Kudu
>  Issue Type: Bug
>  Components: java, spark
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Blocker
>
> KUDU-2259 introduced a regression which causes Spark to not work properly on 
> secure clusters. The issue is the following:
> - the driver calls exportAuthenticationCredentials()
> -- the client hasn't yet talked to the master, so it doesn't have any 
> credentials yet, despite having a keytab available
> -- the code is as follows:
> {code}
> byte[] authnData = securityContext.exportAuthenticationCredentials();
> if (authnData != null) {
>   return Deferred.fromResult(authnData);
> }
> {code}
> -- previously, authnData would be null in this case, and it would fall 
> through to connect to the cluster and then export a proper token.
> -- with the new implementation, an authnData is returned which is devoid of 
> real credentials but contains a realUser. So, it's non-null, and it gets 
> returned immediately
> - the tasks then get credentials with no tokens and can't connect



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2379) Spark generates a broken authentication credentials PB

2018-03-26 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2379:
-

 Summary: Spark generates a broken authentication credentials PB
 Key: KUDU-2379
 URL: https://issues.apache.org/jira/browse/KUDU-2379
 Project: Kudu
  Issue Type: Bug
  Components: java, spark
Affects Versions: 1.7.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon


KUDU-2259 introduced a regression which causes Spark to not work properly on 
secure clusters. The issue is the following:
- the driver calls exportAuthenticationCredentials()
-- the client hasn't yet talked to the master, so it doesn't have any 
credentials yet, despite having a keytab available
-- the code is as follows:
{code}
byte[] authnData = securityContext.exportAuthenticationCredentials();
if (authnData != null) {
  return Deferred.fromResult(authnData);
}
{code}
-- previously, authnData would be null in this case, and it would fall through 
to connect to the cluster and then export a proper token.
-- with the new implementation, an authnData is returned which is devoid of 
real credentials but contains a realUser. So, it's non-null, and it gets 
returned immediately
- the tasks then get credentials with no tokens and can't connect




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type

2018-03-26 Thread Grant Henke (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414874#comment-16414874
 ] 

Grant Henke commented on KUDU-2375:
---

[~tlipcon] The error message is coming from 1.6.0 when trying to read metadata 
from 1.7.0. I don't think their is a great way to change the behavior/messages 
for the old versions. I could create a patch to improve the message or handling 
for future versions so that changes similar to the decimal change have more 
clear messages on downgrade. 

Note: If you don't create decimal tables or you delete them before downgrade 
there should be no issues. 

> Can't parse message of type "kudu.master.SysTablesEntryPB" because it is 
> missing required fields: schema.columns[5].type
> 
>
> Key: KUDU-2375
> URL: https://issues.apache.org/jira/browse/KUDU-2375
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Michael Brown
>Priority: Major
>
> When tables with decimals are added in 1.7.0, a downgrade from 1.7.0 to 1.6 
> results in a dcheck when 1.6 starts and Kudu isn't usable in its downgraded 
> version.
> {noformat}
> F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
> metadata into memory failed: Corruption: Failed while visiting tables in sys 
> catalog: unable to parse metadata field for row 
> 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
> type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
> schema.columns[5].type
> {noformat}
> {noformat}
> #0  0x003355e32625 in raise () from /lib64/libc.so.6
> #1  0x003355e33e05 in abort () from /lib64/libc.so.6
> #2  0x01cea129 in ?? ()
> #3  0x009268cd in google::LogMessage::Fail() ()
> #4  0x0092878d in google::LogMessage::SendToLog() ()
> #5  0x00926409 in google::LogMessage::Flush() ()
> #6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
> #7  0x008f05de in ?? ()
> #8  0x008f6039 in 
> kudu::master::CatalogManager::PrepareForLeadershipTask() ()
> #9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
> #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
> #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
> #12 0x003355ee893d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

2018-03-26 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414832#comment-16414832
 ] 

Alexey Serbin edited comment on KUDU-2354 at 3/27/18 12:53 AM:
---

And another issue to look at: do follower masters continue retrying those tasks 
once then switched from the leader to the follower role?


was (Author: aserbin):
And another issue to look at: do follower masters continue to retry those tasks 
once then switched from the leader to the follower role?

> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly 
> retries operations to add a replacement replica even if replacement is no 
> longer needed
> ---
>
> Key: KUDU-2354
> URL: https://issues.apache.org/jira/browse/KUDU-2354
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
> Environment: 3 tservers in the cluster, single master (?)
>Reporter: Alexey Serbin
>Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command 
> were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40 
> --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported 
> UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the 
> spike of IO activity, tablet leaders didn't receive heartbeats from some 
> replicas and tried to replace those.  After some time, the cluster has 
> stabilized (no problems reported by ksck), but in the master's log the 
> following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
> (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
> ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 
> 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay 
> of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to 
> add a replacement non-voter replica would succeed, but it would make sense to 
> stop retrying those operations when a tablet's OpId index is far ahead of the 
> cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

2018-03-26 Thread Alexey Serbin (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414832#comment-16414832
 ] 

Alexey Serbin commented on KUDU-2354:
-

And another issue to look at: do follower masters continue to retry those tasks 
once then switched from the leader to the follower role?

> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly 
> retries operations to add a replacement replica even if replacement is no 
> longer needed
> ---
>
> Key: KUDU-2354
> URL: https://issues.apache.org/jira/browse/KUDU-2354
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
> Environment: 3 tservers in the cluster, single master (?)
>Reporter: Alexey Serbin
>Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command 
> were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40 
> --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported 
> UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the 
> spike of IO activity, tablet leaders didn't receive heartbeats from some 
> replicas and tried to replace those.  After some time, the cluster has 
> stabilized (no problems reported by ksck), but in the master's log the 
> following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
> (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
> ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 
> 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay 
> of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to 
> add a replacement non-voter replica would succeed, but it would make sense to 
> stop retrying those operations when a tablet's OpId index is far ahead of the 
> cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2378) Crash due to unaligned loads when building with clang 6.0

2018-03-26 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-2378:
-

 Summary: Crash due to unaligned loads when building with clang 6.0
 Key: KUDU-2378
 URL: https://issues.apache.org/jira/browse/KUDU-2378
 Project: Kudu
  Issue Type: Improvement
Affects Versions: 1.7.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon


When I built the whole tree with clang 6.0, all_types-itest crashed due to an 
illegal instruction. Looking at assembly, it appeared to be that clang had 
generated a 'movaps' (aligned load) instruction for a 
*reinterpret_cast() call loading into an xmm register. We aren't 
careful with alignment about loading other integer types because unaligned 
loads of int64s don't have a high penalty, but unaligned load of int128 causes 
a crash.

This is likely to crash on other compilers too -- surprised we haven't seen it 
yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2323) NON_VOTER replica flapping (repeatedly added and evicted)

2018-03-26 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414801#comment-16414801
 ] 

Mike Percy commented on KUDU-2323:
--

While the speed of this cycle was likely fixed by the patch to fix KUDU-2320, 
it appears there is no code path to remove a TrackedPeer when it gets evicted. 
While this could cause a minor resource leak until a leader was evicted in a 
3-2-3 world, in a 3-4-3 world it affects last_communcation_time and can 
therefore make a downed NON_VOTER to be considered FAILED as soon as it is 
added to the config.

Maybe this is also interacting with KUDU-2354, in which there are certain cases 
that can cause a catalog manager task to endlessly retry adding a new replica.

> NON_VOTER replica flapping (repeatedly added and evicted)
> -
>
> Key: KUDU-2323
> URL: https://issues.apache.org/jira/browse/KUDU-2323
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Alexey Serbin
>Priority: Major
>
> In running a YCSB stress workload I see a tablet got into some state where 
> the master flapped back and forth adding and then removing a replica as a 
> NON_VOTER:
> {code}
> I0221 21:54:35.341892 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.360297 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.612417 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.713057 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.725723 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.752959 28052 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:35.767974 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:35.772202 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.291569 28046 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.296468 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.328945 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.339675 28045 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.387465 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.394716 28047 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.398644 28047 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.405082 28047 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.409888 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.414216 28046 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.417915 28048 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.423548 28048 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:54:36.453407 28045 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:54:36.552772 28048 catalog_manager.cc:3162] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 4ffc930e9f2148c38e441f19aa230872 
> (attempt 1)
> I0221 21:58:01.300199 28053 catalog_manager.cc:3274] Sending 
> ChangeConfig:REMOVE_PEER on tablet 4ffc930e9f2148c38e441f19aa230872 (attempt 
> 1)
> I0221 21:58:01.426921 28046 catalog_manager.cc:3162] Sending 
> 

[jira] [Reopened] (KUDU-2356) Idle WALs can consume significant memory

2018-03-26 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon reopened KUDU-2356:
---

Seems the original commit made some tests flaky. Reverting until I have time to 
look at it.

> Idle WALs can consume significant memory
> 
>
> Key: KUDU-2356
> URL: https://issues.apache.org/jira/browse/KUDU-2356
> Project: Kudu
>  Issue Type: Improvement
>  Components: log, tserver
>Affects Versions: 1.7.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Major
> Fix For: 1.8.0
>
> Attachments: heap.svg
>
>
> I grabbed a heap sample of a tserver which has been running a write workload 
> for a little while and found that 750MB of memory is used by faststring 
> allocations inside WritableLogSegment::WriteEntryBatch. It seems like this is 
> the 'compress_buf_' member. This buffer always resizes up during a log write 
> but never shrinks back down, even when the WAL is idle. We should consider 
> clearing the buffer after each append, or perhaps after a short timeout like 
> 100ms after a WAL becomes idle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KUDU-2377) Server fails to start up when RLIMIT_NPROC is -1

2018-03-26 Thread Adar Dembo (JIRA)
Adar Dembo created KUDU-2377:


 Summary: Server fails to start up when RLIMIT_NPROC is -1
 Key: KUDU-2377
 URL: https://issues.apache.org/jira/browse/KUDU-2377
 Project: Kudu
  Issue Type: Bug
  Components: server
Affects Versions: 1.7.0
Reporter: Adar Dembo
Assignee: Adar Dembo
 Fix For: 1.7.1


Unlike RLIMIT_NOFILE, it would appear that RLIMIT_NPROC can be set to the 
special value RLIM_INFINITY. This special value is represented as the integer 
value -1, which means it's not safe for callers of Env::GetResourceLimit to 
simply treat the returned value as a non-zero integer.

Currently GetThreadPoolThreadLimit (kserver.cc) has a perfect example of such 
misbehavior; If I open a root shell, run `ulimit -o unlimited`, then try to 
start a server, I get the following check failure:
{noformat}
I0326 13:00:33.053771 19813 env_posix.cc:1629] Not raising this process' 
running threads per effective uid limit of 18446744073709551615; it is already 
as high as it can go
F0326 13:00:33.053802 19813 threadpool.cc:106] Check failed: max_threads > 0 (0 
vs. 0) 
*** Check failure stack trace: ***
*** Aborted at 1522094433 (unix time) try "date -d @1522094433" if you are 
using GNU date ***
PC: @ 0x7fe5de4bd428 gsignal
*** SIGABRT (@0x4d65) received by PID 19813 (TID 0x7fe5d9421840) from PID 
19813; stack trace: ***
@ 0x7fe5e0207390 (unknown)
@ 0x7fe5de4bd428 gsignal
@ 0x7fe5de4bf02a abort
@ 0x7fe5df49a1d9 google::logging_fail()
@ 0x7fe5df49bb1d google::LogMessage::Fail()
@ 0x7fe5df49da03 google::LogMessage::SendToLog()
@ 0x7fe5df49b67a google::LogMessage::Flush()
@ 0x7fe5df49e3cf google::LogMessageFatal::~LogMessageFatal()
@ 0x7fe5df942bf2 kudu::ThreadPoolBuilder::set_max_threads()
@ 0x7fe5e0738fad kudu::kserver::KuduServer::Init()
@ 0x7fe5e0650a45 kudu::master::Master::Init()
@ 0x7fe5e067559d kudu::master::MiniMaster::Start()
@ 0x4b3bbb kudu::master::MasterTest::SetUp()
@ 0x7fe5e08d2477 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x7fe5e08c77f6 testing::Test::Run()
@ 0x7fe5e08c79a8 testing::TestInfo::Run()
@ 0x7fe5e08c7a85 testing::TestCase::Run()
@ 0x7fe5e08c8758 testing::internal::UnitTestImpl::RunAllTests()
@ 0x7fe5e08d2987 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x7fe5e08c7b5a testing::UnitTest::Run()
@ 0x7fe5e092c09a RUN_ALL_TESTS()
@ 0x7fe5e0929d88 main
@ 0x7fe5de4a8830 __libc_start_main
@ 0x47a429 _start
{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2342) Non-voter replicas can be promoted and get stuck

2018-03-26 Thread Mike Percy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2342:
-
Fix Version/s: (was: 1.8.0)

> Non-voter replicas can be promoted and get stuck
> 
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Fix For: 1.7.0
>
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions

2018-03-26 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414441#comment-16414441
 ] 

Mike Percy commented on KUDU-2335:
--

This issue affects 1.7.0 in a very minor way now (occasionally prints a warning 
message). That can happen when a leader replica is starting up or shutting down.

> Leader can report unknown health for itself during lifecycle transitions
> 
>
> Key: KUDU-2335
> URL: https://issues.apache.org/jira/browse/KUDU-2335
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master
>Affects Versions: 1.7.0
>Reporter: Alexey Serbin
>Assignee: Mike Percy
>Priority: Major
>
> The following DCHECK triggered in one of pre-commit builds with TSAN 
> configuration while running 
> {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario:
> {noformat}
> quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy 
> 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: 
> opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: 
> "127.8.86.130" port: 52021 } } peers { permanent_uuid: 
> "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: 
> "127.8.86.129" port: 37815 } } peers { permanent_uuid: 
> "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: 
> "127.8.86.131" port: 39431 } }{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions

2018-03-26 Thread Mike Percy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2335:
-
Component/s: consensus

> Leader can report unknown health for itself during lifecycle transitions
> 
>
> Key: KUDU-2335
> URL: https://issues.apache.org/jira/browse/KUDU-2335
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master
>Affects Versions: 1.7.0
>Reporter: Alexey Serbin
>Priority: Major
>
> The following DCHECK triggered in one of pre-commit builds with TSAN 
> configuration while running 
> {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario:
> {noformat}
> quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy 
> 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: 
> opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: 
> "127.8.86.130" port: 52021 } } peers { permanent_uuid: 
> "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: 
> "127.8.86.129" port: 37815 } } peers { permanent_uuid: 
> "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: 
> "127.8.86.131" port: 39431 } }{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions

2018-03-26 Thread Mike Percy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy reassigned KUDU-2335:


Assignee: Mike Percy

> Leader can report unknown health for itself during lifecycle transitions
> 
>
> Key: KUDU-2335
> URL: https://issues.apache.org/jira/browse/KUDU-2335
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master
>Affects Versions: 1.7.0
>Reporter: Alexey Serbin
>Assignee: Mike Percy
>Priority: Major
>
> The following DCHECK triggered in one of pre-commit builds with TSAN 
> configuration while running 
> {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario:
> {noformat}
> quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy 
> 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: 
> opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: 
> "127.8.86.130" port: 52021 } } peers { permanent_uuid: 
> "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: 
> "127.8.86.129" port: 37815 } } peers { permanent_uuid: 
> "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: 
> "127.8.86.131" port: 39431 } }{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2335) Leader can report unknown health for itself during lifecycle transitions

2018-03-26 Thread Mike Percy (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Percy updated KUDU-2335:
-
Summary: Leader can report unknown health for itself during lifecycle 
transitions  (was: Debug assert in quorum_util.cc)

> Leader can report unknown health for itself during lifecycle transitions
> 
>
> Key: KUDU-2335
> URL: https://issues.apache.org/jira/browse/KUDU-2335
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexey Serbin
>Priority: Major
>
> The following DCHECK triggered in one of pre-commit builds with TSAN 
> configuration while running 
> {{DeleteTabletITest::TestLeaderElectionDuringDeleteTablet}} scenario:
> {noformat}
> quorum_util.cc:509] Check failed: peer_uuid != leader_uuid || healthy 
> 839fda3822054564af4a3dd547beaca1: leader reported as not healthy; config: 
> opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: 
> "bb39501070104f769870f437f991970e" member_type: VOTER last_known_addr { host: 
> "127.8.86.130" port: 52021 } } peers { permanent_uuid: 
> "839fda3822054564af4a3dd547beaca1" member_type: VOTER last_known_addr { host: 
> "127.8.86.129" port: 37815 } } peers { permanent_uuid: 
> "150b2dd2788f407e8537d28a21d83a80" member_type: VOTER last_known_addr { host: 
> "127.8.86.131" port: 39431 } }{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2153) Servers delete tmp files before obtaining directory lock

2018-03-26 Thread Grant Henke (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-2153:
--
Fix Version/s: (was: 1.7.x)
   1.7.0

> Servers delete tmp files before obtaining directory lock
> 
>
> Key: KUDU-2153
> URL: https://issues.apache.org/jira/browse/KUDU-2153
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 1.2.0, 1.3.1, 1.4.0, 1.5.0
>Reporter: Todd Lipcon
>Assignee: Todd Lipcon
>Priority: Critical
> Fix For: 1.7.0, 1.8.0
>
>
> In FsManager::Open() we currently call DeleteTmpFiles very early, before 
> starting the block manager. This means that, if you accidentally start a 
> tserver while another is running, it's possible for it to delete temporary 
> files that are in-use by the running tserver, causing it to exhibit strange 
> behavior, crash, etc (as in KUDU-2152).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs

2018-03-26 Thread Andrew Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414249#comment-16414249
 ] 

Andrew Wong commented on KUDU-2359:
---

This should be doable by extending the architecture in place for the `kudu fs 
update_dirs` tool. The caveat here, and with the update tool, is that any 
tablets that are/were on the missing data directory are/should be started up in 
a failed state so they can be evicted and re-replicated elsewhere. For the 
update tool, we have operators confront this tradeoff by requiring them to 
specify the `–force` flag. Ideally a similar flag could be used here, so at 
least the mean time to recovery is gated by the time it takes to update a flag, 
rather than the time it takes to run `kudu fs update_dirs`.

It also begs the question, would operators even care about those failed 
tablets? If our re-replication story is robust enough to handle everything on 
its own, it could be seen as a pointless configuration. I suppose exposing it 
as a flag initially would give us that sort of info.

> tserver should allow starting with a small number of missing data dirs
> --
>
> Key: KUDU-2359
> URL: https://issues.apache.org/jira/browse/KUDU-2359
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> Often when a disk fails, its mount point will not come back up when the 
> server is restarted. Currently, Kudu will respond to this by failing to 
> restart with an error like:
> F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() 
> Bad status: Already present: FS layout already exists; not overwriting 
> existing layout. See 
> https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: 
> unable to create file system roots: FSManager roots already exist: 
> /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal
> However, this defeats some of the advantages of the "allow single disk 
> failure" work. One could use the update_data_dirs tool to remove the missing 
> disk, but you'd also need to persistently change the configuration of the 
> daemon, which is hard to do with a consistent configuration management.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type

2018-03-26 Thread Michael Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414119#comment-16414119
 ] 

Michael Brown commented on KUDU-2375:
-

Taking this out of P1. Apparently on this long-lived, shared cluster, someone 
recently added some tables with Decimal. Surely it's one of these that's 
causing this problem. Sorry for mis-reading that message before.

> Can't parse message of type "kudu.master.SysTablesEntryPB" because it is 
> missing required fields: schema.columns[5].type
> 
>
> Key: KUDU-2375
> URL: https://issues.apache.org/jira/browse/KUDU-2375
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Michael Brown
>Priority: Major
>
> A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the 
> same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the 
> problem go away.
> Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the 
> backward incompatibility was not intended.
> {noformat}
> F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
> metadata into memory failed: Corruption: Failed while visiting tables in sys 
> catalog: unable to parse metadata field for row 
> 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
> type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
> schema.columns[5].type
> {noformat}
> {noformat}
> #0  0x003355e32625 in raise () from /lib64/libc.so.6
> #1  0x003355e33e05 in abort () from /lib64/libc.so.6
> #2  0x01cea129 in ?? ()
> #3  0x009268cd in google::LogMessage::Fail() ()
> #4  0x0092878d in google::LogMessage::SendToLog() ()
> #5  0x00926409 in google::LogMessage::Flush() ()
> #6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
> #7  0x008f05de in ?? ()
> #8  0x008f6039 in 
> kudu::master::CatalogManager::PrepareForLeadershipTask() ()
> #9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
> #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
> #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
> #12 0x003355ee893d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type

2018-03-26 Thread Michael Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Brown updated KUDU-2375:

Priority: Major  (was: Blocker)

> Can't parse message of type "kudu.master.SysTablesEntryPB" because it is 
> missing required fields: schema.columns[5].type
> 
>
> Key: KUDU-2375
> URL: https://issues.apache.org/jira/browse/KUDU-2375
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Michael Brown
>Priority: Major
>
> A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the 
> same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the 
> problem go away.
> Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the 
> backward incompatibility was not intended.
> {noformat}
> F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
> metadata into memory failed: Corruption: Failed while visiting tables in sys 
> catalog: unable to parse metadata field for row 
> 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
> type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
> schema.columns[5].type
> {noformat}
> {noformat}
> #0  0x003355e32625 in raise () from /lib64/libc.so.6
> #1  0x003355e33e05 in abort () from /lib64/libc.so.6
> #2  0x01cea129 in ?? ()
> #3  0x009268cd in google::LogMessage::Fail() ()
> #4  0x0092878d in google::LogMessage::SendToLog() ()
> #5  0x00926409 in google::LogMessage::Flush() ()
> #6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
> #7  0x008f05de in ?? ()
> #8  0x008f6039 in 
> kudu::master::CatalogManager::PrepareForLeadershipTask() ()
> #9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
> #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
> #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
> #12 0x003355ee893d in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type

2018-03-26 Thread Michael Brown (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Brown updated KUDU-2375:

Description: 
A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the 
same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the problem 
go away.

Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the 
backward incompatibility was not intended.
{noformat}
F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
metadata into memory failed: Corruption: Failed while visiting tables in sys 
catalog: unable to parse metadata field for row 
467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
schema.columns[5].type
{noformat}
{noformat}
#0  0x003355e32625 in raise () from /lib64/libc.so.6
#1  0x003355e33e05 in abort () from /lib64/libc.so.6
#2  0x01cea129 in ?? ()
#3  0x009268cd in google::LogMessage::Fail() ()
#4  0x0092878d in google::LogMessage::SendToLog() ()
#5  0x00926409 in google::LogMessage::Flush() ()
#6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
#7  0x008f05de in ?? ()
#8  0x008f6039 in 
kudu::master::CatalogManager::PrepareForLeadershipTask() ()
#9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
#10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
#11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
#12 0x003355ee893d in clone () from /lib64/libc.so.6
{noformat}


  was:
A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the 
same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the problem 
go away.

Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the 
backward incompatibility was not intended.
{noformat}
I0324 17:45:10.681015 105716 catalog_manager.cc:306] Loaded metadata for table 
impala::tpcds_1000_kudu.web_returns [id=40c35b333fa84bb8ad331fab02e03fdf]
F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
metadata into memory failed: Corruption: Failed while visiting tables in sys 
catalog: unable to parse metadata field for row 
467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
schema.columns[5].type
{noformat}
{noformat}
#0  0x003355e32625 in raise () from /lib64/libc.so.6
#1  0x003355e33e05 in abort () from /lib64/libc.so.6
#2  0x01cea129 in ?? ()
#3  0x009268cd in google::LogMessage::Fail() ()
#4  0x0092878d in google::LogMessage::SendToLog() ()
#5  0x00926409 in google::LogMessage::Flush() ()
#6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
#7  0x008f05de in ?? ()
#8  0x008f6039 in 
kudu::master::CatalogManager::PrepareForLeadershipTask() ()
#9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
#10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
#11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
#12 0x003355ee893d in clone () from /lib64/libc.so.6
{noformat}
This is on a long-lived cluster that has had Impala and Kudu slowly upgrading 
with mostly dev releases over time for a few months. Here's the Impala {{SHOW 
CREATE TABLE}}:
{noformat}
| CREATE TABLE tpcds_1000_kudu.web_returns (
 |
|   wr_returned_date_sk INT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION, |
|   wr_order_number BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION,  |
|   wr_item_sk BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION,   |
|   wr_returned_time_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION, |
|   wr_refunded_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION, |
|   wr_refunded_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION,|
|   wr_refunded_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION,|
|   wr_refunded_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION, |
|   wr_returning_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION,|
|   wr_returning_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
DEFAULT_COMPRESSION,   

[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type

2018-03-26 Thread Michael Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414104#comment-16414104
 ] 

Michael Brown commented on KUDU-2375:
-

Good point, [~tlipcon]. I misread the Kudu master error message (It says 
"Loaded" in the first of the two messages, and I read it the other day as 
"Loading"). Let me at least clean that out from the Description.

> Can't parse message of type "kudu.master.SysTablesEntryPB" because it is 
> missing required fields: schema.columns[5].type
> 
>
> Key: KUDU-2375
> URL: https://issues.apache.org/jira/browse/KUDU-2375
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Michael Brown
>Priority: Blocker
>
> A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the 
> same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the 
> problem go away.
> Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the 
> backward incompatibility was not intended.
> {noformat}
> I0324 17:45:10.681015 105716 catalog_manager.cc:306] Loaded metadata for 
> table impala::tpcds_1000_kudu.web_returns 
> [id=40c35b333fa84bb8ad331fab02e03fdf]
> F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
> metadata into memory failed: Corruption: Failed while visiting tables in sys 
> catalog: unable to parse metadata field for row 
> 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
> type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
> schema.columns[5].type
> {noformat}
> {noformat}
> #0  0x003355e32625 in raise () from /lib64/libc.so.6
> #1  0x003355e33e05 in abort () from /lib64/libc.so.6
> #2  0x01cea129 in ?? ()
> #3  0x009268cd in google::LogMessage::Fail() ()
> #4  0x0092878d in google::LogMessage::SendToLog() ()
> #5  0x00926409 in google::LogMessage::Flush() ()
> #6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
> #7  0x008f05de in ?? ()
> #8  0x008f6039 in 
> kudu::master::CatalogManager::PrepareForLeadershipTask() ()
> #9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
> #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
> #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
> #12 0x003355ee893d in clone () from /lib64/libc.so.6
> {noformat}
> This is on a long-lived cluster that has had Impala and Kudu slowly upgrading 
> with mostly dev releases over time for a few months. Here's the Impala {{SHOW 
> CREATE TABLE}}:
> {noformat}
> | CREATE TABLE tpcds_1000_kudu.web_returns (  
>|
> |   wr_returned_date_sk INT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_order_number BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,  |
> |   wr_item_sk BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,   |
> |   wr_returned_time_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_refunded_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_refunded_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_refunded_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_refunded_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_returning_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_returning_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,   |
> |   wr_returning_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,   |
> |   wr_returning_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_web_page_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,  |
> |   wr_reason_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   

[jira] [Commented] (KUDU-2372) Don't let kudu start up if any disks are mounted read-only

2018-03-26 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414100#comment-16414100
 ] 

Todd Lipcon commented on KUDU-2372:
---

Per KUDU-2359 I think it may make sense to allow starting up with a bad disk so 
that we don't need manual intervention after a single disk failure (eg on a 
12-disk host)

> Don't let kudu start up if any disks are mounted read-only
> --
>
> Key: KUDU-2372
> URL: https://issues.apache.org/jira/browse/KUDU-2372
> Project: Kudu
>  Issue Type: Improvement
>  Components: fs
>Reporter: Andrew Wong
>Priority: Major
>
> Today, if a Kudu tserver runs into EROFS (read-only mount error), it treats 
> the error as it would a complete disk failure (EIO), allowing successful 
> startup of the server, but failing the tablets that are configured to use the 
> "failed" disk.
> If something is wrong with the mounting of a disk, it might be helpful to 
> bring immediate attention to it, and have operators deal with it, rather than 
> handling it automatically. As such, it might be helpful to prevent Kudu from 
> starting up if errors are detected with the mount configurations.
> There are tradeoffs here to be considered:
>  * The current behavior, as it is today, will evict and delete the data from 
> the failed tablets, as it is treated as an unrecoverable failure. The user 
> can ignore such failures and handle it at their leisure, since Kudu will 
> re-replicate the tablets lost in this way
>  * If we were to instead crash, this gives operators some immediate feedback 
> and a time limit to use `kudu fs update_dirs` to remove the read only drive, 
> or maybe fix the mountpoint itself



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KUDU-2374) Expose an interface in RpcContext to report the time the InboundCall is received

2018-03-26 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon resolved KUDU-2374.
---
   Resolution: Fixed
Fix Version/s: 1.8.0

> Expose an interface in RpcContext to report the time the InboundCall is 
> received
> 
>
> Key: KUDU-2374
> URL: https://issues.apache.org/jira/browse/KUDU-2374
> Project: Kudu
>  Issue Type: Improvement
>  Components: rpc
>Affects Versions: 1.7.0
>Reporter: Michael Ho
>Assignee: Michael Ho
>Priority: Minor
> Fix For: 1.8.0
>
>
> {{InboundCall::GetTimeReceived()}} returns the time in which the inbound call 
> was received. While the dispatch and processing time of RPCs are already 
> reported in histogram in the service queue, it's helpful to make this 
> accessible to the RPC handler for its own book-keeping purpose (e.g. 
> reporting the average dispatch latency as part of query profile in Impala).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2375) Can't parse message of type "kudu.master.SysTablesEntryPB" because it is missing required fields: schema.columns[5].type

2018-03-26 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414097#comment-16414097
 ] 

Todd Lipcon commented on KUDU-2375:
---

You sure that table 467d365fffbe4485a3249079c48f42a9 is the one you pasted? My 
guess is that it's one that has a DECIMAL type column in its 5th (0-indexed) 
position, and when you downgrade to 1.6.0 it doesn't know what to make of the 
DECIMAL. Agreed the error message and behavior could be a lot better. 
[~granthenke] what do you think?

> Can't parse message of type "kudu.master.SysTablesEntryPB" because it is 
> missing required fields: schema.columns[5].type
> 
>
> Key: KUDU-2375
> URL: https://issues.apache.org/jira/browse/KUDU-2375
> Project: Kudu
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Michael Brown
>Priority: Blocker
>
> A downgrade from 1.7.0 to 1.6 results in a dcheck when 1.6 starts. I see the 
> same dcheck on both 1.6.0 and 1.6.x. A "revert" back to 1.7.0 makes the 
> problem go away.
> Although the symptom is in 1.6, I've filed this against 1.7.0 assuming the 
> backward incompatibility was not intended.
> {noformat}
> I0324 17:45:10.681015 105716 catalog_manager.cc:306] Loaded metadata for 
> table impala::tpcds_1000_kudu.web_returns 
> [id=40c35b333fa84bb8ad331fab02e03fdf]
> F0324 17:45:10.681808 105716 catalog_manager.cc:935] Loading table and tablet 
> metadata into memory failed: Corruption: Failed while visiting tables in sys 
> catalog: unable to parse metadata field for row 
> 467d365fffbe4485a3249079c48f42a9: Error parsing msg: Can't parse message of 
> type "kudu.master.SysTablesEntryPB" because it is missing required fields: 
> schema.columns[5].type
> {noformat}
> {noformat}
> #0  0x003355e32625 in raise () from /lib64/libc.so.6
> #1  0x003355e33e05 in abort () from /lib64/libc.so.6
> #2  0x01cea129 in ?? ()
> #3  0x009268cd in google::LogMessage::Fail() ()
> #4  0x0092878d in google::LogMessage::SendToLog() ()
> #5  0x00926409 in google::LogMessage::Flush() ()
> #6  0x0092922f in google::LogMessageFatal::~LogMessageFatal() ()
> #7  0x008f05de in ?? ()
> #8  0x008f6039 in 
> kudu::master::CatalogManager::PrepareForLeadershipTask() ()
> #9  0x01d297d7 in kudu::ThreadPool::DispatchThread() ()
> #10 0x01d20151 in kudu::Thread::SuperviseThread(void*) ()
> #11 0x003356207aa1 in start_thread () from /lib64/libpthread.so.0
> #12 0x003355ee893d in clone () from /lib64/libc.so.6
> {noformat}
> This is on a long-lived cluster that has had Impala and Kudu slowly upgrading 
> with mostly dev releases over time for a few months. Here's the Impala {{SHOW 
> CREATE TABLE}}:
> {noformat}
> | CREATE TABLE tpcds_1000_kudu.web_returns (  
>|
> |   wr_returned_date_sk INT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_order_number BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,  |
> |   wr_item_sk BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,   |
> |   wr_returned_time_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_refunded_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_refunded_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_refunded_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_refunded_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION, |
> |   wr_returning_customer_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_returning_cdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,   |
> |   wr_returning_hdemo_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,   |
> |   wr_returning_addr_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,|
> |   wr_web_page_sk INT NULL ENCODING AUTO_ENCODING COMPRESSION 
> DEFAULT_COMPRESSION,  |
> |   wr_reason_sk INT NULL