[jira] [Commented] (KUDU-1930) Improve performance of dictionary builder
[ https://issues.apache.org/jira/browse/KUDU-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905984#comment-15905984 ] Todd Lipcon commented on KUDU-1930: --- Tested with commands like: {code} ./build/latest/bin/tpch_real_world -tpch_path_to_ts_flags_file ./tsflags -tpch_scaling_factor 100 -tpch_num_inserters 8 -notpch_run_queries -tpch_path_to_dbgen_dir /data/2/mpercy/tpch_2_17_0/dbgen -tpch_partition_strategy hash {code} (the new 'hash' partition strategy is from a simple local patch) Results: {code} with 1 MM thread: I0310 13:53:38.925390 1568 tpch_real_world.cc:278] Time spent by thread 2 to load generated data into the database: real 1140.187s user 398.411s sys 5.315s with 4 MM threads: I0310 14:06:35.509299 8524 tpch_real_world.cc:278] Time spent by thread 4 to load generated data into the database: real 618.348s user 413.431s sys 5.437s with 4 MM threads, hash partition: I0310 15:32:52.386118 27623 tpch_real_world.cc:289] Time spent by thread 2 to load generated data into the database: real 1233.386s user 462.671s sys 6.084s with 4 MM threads, using dense_hash_map instead of std::unordered_map for dictionary builder: I0310 17:26:00.682138 32076 tpch_real_world.cc:289] Time spent by thread 0 to load generated data into the database: real 1147.478s user 464.147s sys 6.200s {code} The "user" times here are from the client side, so not that relevant, whereas "real" is the total wall time taken. It seems like dense_hash_map is an easy 7% speedup relative to the STL map. As we've long known, inserting in sorted order (range partitioned) is 2x faster than non-sorted order (and the longer the benchmark runs, the more the difference magnifies) > Improve performance of dictionary builder > - > > Key: KUDU-1930 > URL: https://issues.apache.org/jira/browse/KUDU-1930 > Project: Kudu > Issue Type: Bug > Components: cfile, perf >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > > I locally tweaked tpch_real_world to use hash partitioning instead of range > partitioning, so that the different threads overlapped on the same tablets, > simulating a more realistic parallel load scenario. I noticed that the MM > threads were CPU bound, with a high percentage of CPU in AddCodeWords(). > Initial prototypes indicate that optimizing the hashmap used here would be an > easy win. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1930) Improve performance of dictionary builder
[ https://issues.apache.org/jira/browse/KUDU-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905985#comment-15905985 ] Todd Lipcon commented on KUDU-1930: --- oops, forgot to paste the flags file I used: {code} --memory_limit_hard_bytes=4294967296 -maintenance_manager_num_threads=4 {code} > Improve performance of dictionary builder > - > > Key: KUDU-1930 > URL: https://issues.apache.org/jira/browse/KUDU-1930 > Project: Kudu > Issue Type: Bug > Components: cfile, perf >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > > I locally tweaked tpch_real_world to use hash partitioning instead of range > partitioning, so that the different threads overlapped on the same tablets, > simulating a more realistic parallel load scenario. I noticed that the MM > threads were CPU bound, with a high percentage of CPU in AddCodeWords(). > Initial prototypes indicate that optimizing the hashmap used here would be an > easy win. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (KUDU-1930) Improve performance of dictionary builder
Todd Lipcon created KUDU-1930: - Summary: Improve performance of dictionary builder Key: KUDU-1930 URL: https://issues.apache.org/jira/browse/KUDU-1930 Project: Kudu Issue Type: Bug Components: cfile, perf Affects Versions: 1.3.0 Reporter: Todd Lipcon Assignee: Todd Lipcon I locally tweaked tpch_real_world to use hash partitioning instead of range partitioning, so that the different threads overlapped on the same tablets, simulating a more realistic parallel load scenario. I noticed that the MM threads were CPU bound, with a high percentage of CPU in AddCodeWords(). Initial prototypes indicate that optimizing the hashmap used here would be an easy win. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (KUDU-1929) [rpc] Allow using encrypted private keys for TLS
Sailesh Mukil created KUDU-1929: --- Summary: [rpc] Allow using encrypted private keys for TLS Key: KUDU-1929 URL: https://issues.apache.org/jira/browse/KUDU-1929 Project: Kudu Issue Type: Improvement Components: rpc Reporter: Sailesh Mukil Assignee: Sailesh Mukil Currently, for internal RPC communication, we aren't able to handle encrypted private keys. This can be done by using the OpenSSL APIs: SSL_CTX_set_default_passwd_cb() SSL_CTX_set_default_passwd_cb_userdata() -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (KUDU-1736) kudu crash in debug build: unordered undo delta
[ https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905847#comment-15905847 ] Andrew Wong edited comment on KUDU-1736 at 3/10/17 11:09 PM: - Also saw this on Jenkins: {code} delta_tracker.cc:208] Check failed: first_copy->delta_stats().min_timestamp() >= second_copy->delta_stats().min_timestamp() (5678 vs. 5681) Found out-of-order deltas: [{11440216355432745334 (ts range=[5678, 5775], delete_count=[0], reinsert_count=[0], update_counts_by_col_id=[12:3))}, {11440216355432745315 (ts range=[5681, 5761], delete_count=[0], reinsert_count=[0], update_counts_by_col_id=[12:18))}]: type = 1 @ 0x7f4f0ef11c37 gsignal at ??:0 @ 0x7f4f0ef15028 abort at ??:0 @ 0x7f4f1bc02ae4 kudu::tablet::DeltaTracker::ValidateDeltaOrder() at ??:0 @ 0x7f4f1bc031c5 kudu::tablet::DeltaTracker::AtomicUpdateStores() at ??:0 @ 0x7f4f1bbe8304 kudu::tablet::MajorDeltaCompaction::UpdateDeltaTracker() at ??:0 @ 0x7f4f1baf7e35 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() at ??:0 @ 0x7f4f1baf7811 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() at ??:0 @ 0x7f4f1b9d7687 kudu::tablet::Tablet::CompactWorstDeltas() at ??:0 @ 0x5e5cbc kudu::tablet::MultiThreadedTabletTest<>::CompactDeltas() at /home/jenkins-slave/workspace/kudu-master/2/src/kudu/tablet/mt-tablet-test.cc:307 (discriminator 17) @ 0x5e48e8 boost::_bi::bind_t<>::operator()() at /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/common/include/boost/bind/bind.hpp:1223 @ 0x7f4f143a5d3f boost::function0<>::operator()() at ??:0 @ 0x7f4f121565e4 kudu::Thread::SuperviseThread() at ??:0 @ 0x7f4f17da4184 start_thread at ??:0 @ 0x7f4f0efd537d clone at ??:0 {code} Failure log can be found here: http://104.196.14.100/job/kudu-gerrit/6963/BUILD_TYPE=ASAN/ was (Author: andrew.wong): Also saw this on Jenkins: ``` delta_tracker.cc:208] Check failed: first_copy->delta_stats().min_timestamp() >= second_copy->delta_stats().min_timestamp() (5678 vs. 5681) Found out-of-order deltas: [{11440216355432745334 (ts range=[5678, 5775], delete_count=[0], reinsert_count=[0], update_counts_by_col_id=[12:3))}, {11440216355432745315 (ts range=[5681, 5761], delete_count=[0], reinsert_count=[0], update_counts_by_col_id=[12:18))}]: type = 1 @ 0x7f4f0ef11c37 gsignal at ??:0 @ 0x7f4f0ef15028 abort at ??:0 @ 0x7f4f1bc02ae4 kudu::tablet::DeltaTracker::ValidateDeltaOrder() at ??:0 @ 0x7f4f1bc031c5 kudu::tablet::DeltaTracker::AtomicUpdateStores() at ??:0 @ 0x7f4f1bbe8304 kudu::tablet::MajorDeltaCompaction::UpdateDeltaTracker() at ??:0 @ 0x7f4f1baf7e35 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() at ??:0 @ 0x7f4f1baf7811 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() at ??:0 @ 0x7f4f1b9d7687 kudu::tablet::Tablet::CompactWorstDeltas() at ??:0 @ 0x5e5cbc kudu::tablet::MultiThreadedTabletTest<>::CompactDeltas() at /home/jenkins-slave/workspace/kudu-master/2/src/kudu/tablet/mt-tablet-test.cc:307 (discriminator 17) @ 0x5e48e8 boost::_bi::bind_t<>::operator()() at /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/common/include/boost/bind/bind.hpp:1223 @ 0x7f4f143a5d3f boost::function0<>::operator()() at ??:0 @ 0x7f4f121565e4 kudu::Thread::SuperviseThread() at ??:0 @ 0x7f4f17da4184 start_thread at ??:0 @ 0x7f4f0efd537d clone at ??:0 ``` Failure log can be found here: http://104.196.14.100/job/kudu-gerrit/6963/BUILD_TYPE=ASAN/ > kudu crash in debug build: unordered undo delta > --- > > Key: KUDU-1736 > URL: https://issues.apache.org/jira/browse/KUDU-1736 > Project: Kudu > Issue Type: Bug > Components: tablet >Reporter: zhangsong >Priority: Critical > Attachments: mt-tablet-test.txt.gz > > > in jd cluster we met kudu-tserver crash with fatal messages described as > follow: > Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in > sorted order (ascending key, then descending ts): got key (row > 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072) > This is a dcheck which should not failed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1736) kudu crash in debug build: unordered undo delta
[ https://issues.apache.org/jira/browse/KUDU-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905847#comment-15905847 ] Andrew Wong commented on KUDU-1736: --- Also saw this on Jenkins: ``` delta_tracker.cc:208] Check failed: first_copy->delta_stats().min_timestamp() >= second_copy->delta_stats().min_timestamp() (5678 vs. 5681) Found out-of-order deltas: [{11440216355432745334 (ts range=[5678, 5775], delete_count=[0], reinsert_count=[0], update_counts_by_col_id=[12:3))}, {11440216355432745315 (ts range=[5681, 5761], delete_count=[0], reinsert_count=[0], update_counts_by_col_id=[12:18))}]: type = 1 @ 0x7f4f0ef11c37 gsignal at ??:0 @ 0x7f4f0ef15028 abort at ??:0 @ 0x7f4f1bc02ae4 kudu::tablet::DeltaTracker::ValidateDeltaOrder() at ??:0 @ 0x7f4f1bc031c5 kudu::tablet::DeltaTracker::AtomicUpdateStores() at ??:0 @ 0x7f4f1bbe8304 kudu::tablet::MajorDeltaCompaction::UpdateDeltaTracker() at ??:0 @ 0x7f4f1baf7e35 kudu::tablet::DiskRowSet::MajorCompactDeltaStoresWithColumnIds() at ??:0 @ 0x7f4f1baf7811 kudu::tablet::DiskRowSet::MajorCompactDeltaStores() at ??:0 @ 0x7f4f1b9d7687 kudu::tablet::Tablet::CompactWorstDeltas() at ??:0 @ 0x5e5cbc kudu::tablet::MultiThreadedTabletTest<>::CompactDeltas() at /home/jenkins-slave/workspace/kudu-master/2/src/kudu/tablet/mt-tablet-test.cc:307 (discriminator 17) @ 0x5e48e8 boost::_bi::bind_t<>::operator()() at /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/common/include/boost/bind/bind.hpp:1223 @ 0x7f4f143a5d3f boost::function0<>::operator()() at ??:0 @ 0x7f4f121565e4 kudu::Thread::SuperviseThread() at ??:0 @ 0x7f4f17da4184 start_thread at ??:0 @ 0x7f4f0efd537d clone at ??:0 ``` Failure log can be found here: http://104.196.14.100/job/kudu-gerrit/6963/BUILD_TYPE=ASAN/ > kudu crash in debug build: unordered undo delta > --- > > Key: KUDU-1736 > URL: https://issues.apache.org/jira/browse/KUDU-1736 > Project: Kudu > Issue Type: Bug > Components: tablet >Reporter: zhangsong >Priority: Critical > Attachments: mt-tablet-test.txt.gz > > > in jd cluster we met kudu-tserver crash with fatal messages described as > follow: > Check failed: last_key_.CompareTo(key) <= 0 must insert undo deltas in > sorted order (ascending key, then descending ts): got key (row > 1422@tx6052042821982183424) after (row 1422@tx6052042821953155072) > This is a dcheck which should not failed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905836#comment-15905836 ] Adar Dembo commented on KUDU-1927: -- bq. At the very least we should change it to CHECK, if you agree. I'm fine with that. > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905818#comment-15905818 ] David Alves commented on KUDU-1927: --- Well if it's a rare race it might go unnoticed and happen in prod, at which time a DCHECK wouldn't help. In the specific case I was looking at such a race was possible (though I don't think it actually happened). At the very least we should change it to CHECK, if you agree. > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905799#comment-15905799 ] Adar Dembo commented on KUDU-1927: -- bq. Curious: Why not just initialized leader_status_ (maybe even both) to non-OK until we check and make sure it's actually fine? Ignoring for a minute that there's no canonical "non-OK" status (i.e. we'd have to pick something arbitrarily, and whatever we choose may introduce side effects), it shouldn't be necessary in the first place, because leader_status() DCHECKs if catalog_status() is not OK. So, if you forgot to check catalog_status() before checking leader_status() and catalog_status() was actually not OK, the new test (there is a new test, right?) will crash. > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905783#comment-15905783 ] David Alves commented on KUDU-1927: --- [~adar] thanks for pointing that out, I missed it. Tbh even if it was by design, I still think it's problematic because it's unexpected. I've seen code (in-flight patch) that missed that subtlety. Curious: Why not just initialized leader_status_ (maybe even both) to non-OK until we check and make sure it's actually fine? > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905773#comment-15905773 ] Adar Dembo commented on KUDU-1927: -- bq. I think there is subtle bug in ScopedLeaderSharedLock in which whatever status is returned by "leader_status()" should be checked after checking "catalog_status()". It's not a bug, it's a "feature" :). Look at the comments on those methods: {noformat} // General status of the catalog manager. If not OK (e.g. the catalog // manager is still being initialized), all operations are illegal and // leader_status() should not be trusted. const Status& catalog_status() const { return catalog_status_; } // Leadership status of the catalog manager. If not OK, the catalog // manager is not the leader, but some operations may still be legal. const Status& leader_status() const { DCHECK(catalog_status_.ok()); return leader_status_; } // First non-OK status of the catalog manager, adhering to the checking // order specified above. const Status& first_failed_status() const { if (!catalog_status_.ok()) { return catalog_status_; } return leader_status_; } {noformat} As far as ConnectToMaster() is concerned, the CheckIsInitializedOrRespond() call will lead to an early return if catalog_status() is not OK, which makes the subsequent leader_status() check safe. > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905764#comment-15905764 ] David Alves commented on KUDU-1927: --- I think there is subtle bug in ScopedLeaderSharedLock in which whatever status is returned by "leader_status()" should be checked after checking "catalog_status()". This is because statuses are, by default, initialized to OK and in its ctor, ScopedLeaderSharedLock doesn't explicitly initialize leader_status_ or catalog_status_. Specifically, if catalog_->state_ != kRunning the lock would report a non-OK status for catalog_status_ but an OK status for leader_status_. > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1843) Client UUIDs should be cryptographically random
[ https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905758#comment-15905758 ] Dan Burkert commented on KUDU-1843: --- I continue to be worried about requiring client IDs to be unguessable for security reasons. We haven't treated client IDs as sensitive information, and we would need to begin doing that. > Client UUIDs should be cryptographically random > --- > > Key: KUDU-1843 > URL: https://issues.apache.org/jira/browse/KUDU-1843 > Project: Kudu > Issue Type: Improvement > Components: security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > > Currently we use boost::uuid's default random generator, which is not > cryptographically random. This may increase the ease with which an attacker > could guess another client's client ID, which would potentially allow them to > perform DoS or try to steal the results of RPCs from the result cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1843) Client UUIDs should be cryptographically random
[ https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905732#comment-15905732 ] Dan Burkert commented on KUDU-1843: --- Isn't the result cache keyed on the username, since it's part of RequestIdPB? > Client UUIDs should be cryptographically random > --- > > Key: KUDU-1843 > URL: https://issues.apache.org/jira/browse/KUDU-1843 > Project: Kudu > Issue Type: Improvement > Components: security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > > Currently we use boost::uuid's default random generator, which is not > cryptographically random. This may increase the ease with which an attacker > could guess another client's client ID, which would potentially allow them to > perform DoS or try to steal the results of RPCs from the result cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1843) Client UUIDs should be cryptographically random
[ https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1843: -- Code Review: https://gerrit.cloudera.org/#/c/6347/ > Client UUIDs should be cryptographically random > --- > > Key: KUDU-1843 > URL: https://issues.apache.org/jira/browse/KUDU-1843 > Project: Kudu > Issue Type: Improvement > Components: security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > > Currently we use boost::uuid's default random generator, which is not > cryptographically random. This may increase the ease with which an attacker > could guess another client's client ID, which would potentially allow them to > perform DoS or try to steal the results of RPCs from the result cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1918) Prevent hijacking of scanners by other users
[ https://issues.apache.org/jira/browse/KUDU-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1918: -- Status: In Review (was: Open) > Prevent hijacking of scanners by other users > > > Key: KUDU-1918 > URL: https://issues.apache.org/jira/browse/KUDU-1918 > Project: Kudu > Issue Type: Improvement > Components: security, tserver >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > > Currently the UUIDs used for scanner IDs are using boost::uuid, which doesn't > necessarily use a secure random source. If these turn out to be predictable, > some attack around scanner hijacking might be possible. We should use an > unpredictable source for scanner IDs, or save the original authenticated user > in the Scanner and ensure that the authentication does not switch mid-scan. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1843) Client UUIDs should be cryptographically random
[ https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1843: -- Status: In Review (was: Open) > Client UUIDs should be cryptographically random > --- > > Key: KUDU-1843 > URL: https://issues.apache.org/jira/browse/KUDU-1843 > Project: Kudu > Issue Type: Improvement > Components: security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > > Currently we use boost::uuid's default random generator, which is not > cryptographically random. This may increase the ease with which an attacker > could guess another client's client ID, which would potentially allow them to > perform DoS or try to steal the results of RPCs from the result cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1918) Prevent hijacking of scanners by other users
[ https://issues.apache.org/jira/browse/KUDU-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1918: -- Code Review: http://gerrit.cloudera.org:8080/6348 > Prevent hijacking of scanners by other users > > > Key: KUDU-1918 > URL: https://issues.apache.org/jira/browse/KUDU-1918 > Project: Kudu > Issue Type: Improvement > Components: security, tserver >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > > Currently the UUIDs used for scanner IDs are using boost::uuid, which doesn't > necessarily use a secure random source. If these turn out to be predictable, > some attack around scanner hijacking might be possible. We should use an > unpredictable source for scanner IDs, or save the original authenticated user > in the Scanner and ensure that the authentication does not switch mid-scan. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (KUDU-1918) Prevent hijacking of scanners by other users
[ https://issues.apache.org/jira/browse/KUDU-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned KUDU-1918: - Assignee: Todd Lipcon > Prevent hijacking of scanners by other users > > > Key: KUDU-1918 > URL: https://issues.apache.org/jira/browse/KUDU-1918 > Project: Kudu > Issue Type: Improvement > Components: security, tserver >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > > Currently the UUIDs used for scanner IDs are using boost::uuid, which doesn't > necessarily use a secure random source. If these turn out to be predictable, > some attack around scanner hijacking might be possible. We should use an > unpredictable source for scanner IDs, or save the original authenticated user > in the Scanner and ensure that the authentication does not switch mid-scan. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1927) Potential race handling ConnectToMaster RPCs during leader transition
[ https://issues.apache.org/jira/browse/KUDU-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905653#comment-15905653 ] Adar Dembo commented on KUDU-1927: -- {noformat} void MasterServiceImpl::ConnectToMaster(const ConnectToMasterRequestPB* /*req*/, ConnectToMasterResponsePB* resp, rpc::RpcContext* rpc) { CatalogManager::ScopedLeaderSharedLock l(server_->catalog_manager()); if (!l.CheckIsInitializedOrRespond(resp, rpc)) { return; } auto role = server_->catalog_manager()->Role(); resp->set_role(role); if (l.leader_status().ok()) { // TODO(PKI): it seems there is some window when 'role' is LEADER but // in fact we aren't done initializing (and we don't have a CA cert). // In that case, if we respond with the 'LEADER' role to a client, but // don't pass back the CA cert, then the client won't be able to trust // anyone... seems like a potential race bug for clients who connect // exactly as the leader is changing. resp->add_ca_cert_der(server_->cert_authority()->ca_cert_der()); // Issue an authentication token for the caller, unless they are // already using a token to authenticate. if (rpc->remote_user().authenticated_by() != rpc::RemoteUser::AUTHN_TOKEN) { SignedTokenPB authn_token; Status s = server_->token_signer()->GenerateAuthnToken( rpc->remote_user().username(), &authn_token); if (!s.ok()) { KLOG_EVERY_N_SECS(WARNING, 1) << "Unable to generate signed token for " << rpc->requestor_string() << ": " << s.ToString(); } else { // TODO(todd): this might be a good spot for some auditing code? resp->mutable_authn_token()->Swap(&authn_token); } } } rpc->RespondSuccess(); } {noformat} If !leader_status.ok() but Role() is LEADER, it's indeed a sign that the consensus state machine has elected this master as leader, but the leader is still becoming active and setting up in-memory state. In that case, we should enforce that resp->role() is not set to LEADER (i.e. just overwrite it with UNKNOWN_ROLE, FOLLOWER, or whatever). In retrospect, we should have used an "is leader" boolean in ConnectToMasterResponse instead of exposing the raw Raft role. > Potential race handling ConnectToMaster RPCs during leader transition > - > > Key: KUDU-1927 > URL: https://issues.apache.org/jira/browse/KUDU-1927 > Project: Kudu > Issue Type: Bug > Components: master, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon > > MasterServiceImpl::ConnectToMaster currently has a TODO that there might be a > case where a client issues the RPC exactly as a leader is becoming active. > The worry is that it may return a response indicating LEADER status, but > without the ability to issue a key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1843) Client UUIDs should be cryptographically random
[ https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905649#comment-15905649 ] Todd Lipcon commented on KUDU-1843: --- Caching the original username turns out to be a little tricky, since the WAL doesn't record the original username, and thus when reconstructing the request cache during tablet bootstrap we don't have enough information to do so. I think making the UUIDs unpredictable is probably a better approach. > Client UUIDs should be cryptographically random > --- > > Key: KUDU-1843 > URL: https://issues.apache.org/jira/browse/KUDU-1843 > Project: Kudu > Issue Type: Improvement > Components: security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > > Currently we use boost::uuid's default random generator, which is not > cryptographically random. This may increase the ease with which an attacker > could guess another client's client ID, which would potentially allow them to > perform DoS or try to steal the results of RPCs from the result cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (KUDU-1843) Client UUIDs should be cryptographically random
[ https://issues.apache.org/jira/browse/KUDU-1843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned KUDU-1843: - Assignee: Todd Lipcon > Client UUIDs should be cryptographically random > --- > > Key: KUDU-1843 > URL: https://issues.apache.org/jira/browse/KUDU-1843 > Project: Kudu > Issue Type: Improvement > Components: security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon >Priority: Critical > > Currently we use boost::uuid's default random generator, which is not > cryptographically random. This may increase the ease with which an attacker > could guess another client's client ID, which would potentially allow them to > perform DoS or try to steal the results of RPCs from the result cache. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1839) DNS failure during tablet creation lead to undeletable tablet
[ https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905642#comment-15905642 ] Adar Dembo commented on KUDU-1839: -- bq. I'm wondering if this should be demoted to "Major" priority considering it doesn't cause downtime or data loss, right? I can't remember the specifics, but the report says that there's a good chance that the replication group doesn't evict the broken replica, which would mean actual downtime upon the next failure. But, I don't feel strongly, so if this is the criteria we're now consistently applying to all Kudu bug reports, feel free to downgrade this one. > DNS failure during tablet creation lead to undeletable tablet > - > > Key: KUDU-1839 > URL: https://issues.apache.org/jira/browse/KUDU-1839 > Project: Kudu > Issue Type: Bug > Components: master, tablet >Affects Versions: 1.2.0 >Reporter: Adar Dembo >Priority: Critical > > During a YCSB workload, two tservers died due to DNS resolution timeouts. For > example: > {noformat} > F0117 09:21:14.952937 8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad > status: Network error: Could not obtain a remote proxy to the peer.: Unable > to resolve address 've0130.halxg.cloudera.com': Name or service not known > {noformat} > It's not clear why this happened; perhaps table creation places an inordinate > strain on DNS due to concurrent resolution load from all the bootstrapping > peers. > In any case, when these tservers were restarted, two tablets failed to > bootstrap, both for the same reason. I'll focus on just one tablet from here > on out to simplify troubleshooting: > {noformat} > E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T > 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet > failed to bootstrap: Not found: Unable to load Consensus metadata: > /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or > directory (error 2) > {noformat} > Eventually, the master decided to delete this tablet: > {noformat} > I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > {noformat} > As can be seen by the presence of multiple deletion requests, each one > failed. It's annoying that the tserver didn't log why. But the master did: > {noformat} > I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 > (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not > found in new config with opid_index 29) > W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS > 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete > failed for tablet 8c167c441a7d44b8add737d13797e694 with error code > TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting > down > I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of > 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for > TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)... > {noformat} > This isn't a fatal error as far as the master is concerned, so it retries the > deletion forever. > Meanwhile, the broken replica of this tablet still appears to be part of the > replication group. At least, that's true as far as both the master web UI and > the tserver web UI are concerned. The leader tserver is logging this error > repeatedly: > {noformat} > W0117 16:38:04.797828 81809 consensus_peers.cc:329] T > 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer > 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.clouder
[jira] [Updated] (KUDU-1807) GetTableSchema() is O(n) in the number of tablets
[ https://issues.apache.org/jira/browse/KUDU-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1807: -- Labels: data-scalability (was: ) > GetTableSchema() is O(n) in the number of tablets > - > > Key: KUDU-1807 > URL: https://issues.apache.org/jira/browse/KUDU-1807 > Project: Kudu > Issue Type: Bug > Components: master, perf >Affects Versions: 1.2.0 >Reporter: Todd Lipcon >Priority: Critical > Labels: data-scalability > > GetTableSchema calls TableInfo::IsCreateTableDone. This method checks each > tablet for whether it is in the correct state, which requires acquiring the > RWC lock for every tablet. This is somewhat slow for large tables with > thousands of tablets, and this is actually a relatively hot path because > every task in an Impala query ends up calling GetTableSchema() when it opens > its scanner. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1194) consensus: Allow abort of uncommittable config change ops
[ https://issues.apache.org/jira/browse/KUDU-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1194: -- Component/s: consensus > consensus: Allow abort of uncommittable config change ops > - > > Key: KUDU-1194 > URL: https://issues.apache.org/jira/browse/KUDU-1194 > Project: Kudu > Issue Type: Improvement > Components: consensus >Reporter: Mike Percy >Assignee: Mike Percy >Priority: Critical > > Wanted to capture a few thoughts about manually fixing broken configs or > automatically rolling back bad config changes. This isn't a fully baked > design, just wanted to jot down some initial thoughts. > A general way to (attempt to) abort uncommitted ops is to truncate the Raft > log on the leader (and replace the op with a NO_OP or something similar). > Some thoughts on recovering from "bad" configs: > * We may hit a situation where there is an in-progress config change > operation that will be impossible to commit due to a majority of the nodes in > the "target" config being permanently dead. If the leader is still alive, we > can provide a timeout on these ops or a way to explicitly (via RPC) abort > them by truncating the log. > * If no leader is alive, and it's impossible to elect one, then we could > write an "unsafe" tool only for emergency use that could do something evil > like make the follower think that the tool is the new leader and append an > unsafe change-config op to the follower's log. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1266) Figure out per-version docs publishing
[ https://issues.apache.org/jira/browse/KUDU-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905638#comment-15905638 ] Todd Lipcon commented on KUDU-1266: --- [~mpercy] still planning on working on this? I think what we have right now, though not ideal, is workable. Perhaps if you want to keep this open we can drop the priority to minor? > Figure out per-version docs publishing > -- > > Key: KUDU-1266 > URL: https://issues.apache.org/jira/browse/KUDU-1266 > Project: Kudu > Issue Type: Task > Components: documentation >Reporter: Jean-Daniel Cryans >Assignee: Mike Percy >Priority: Critical > > Right now we just push the documentation in master to the website, but > ideally we'd want to have documentation available for each version. What's > the best way to do this? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KUDU-422) Replicated master process (no master SPOF)
[ https://issues.apache.org/jira/browse/KUDU-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adar Dembo resolved KUDU-422. - Resolution: Fixed Fix Version/s: 1.0.0 bq. Mind if we close this one and just use the couple remaining subtasks to track the remaining work? Works for me. > Replicated master process (no master SPOF) > -- > > Key: KUDU-422 > URL: https://issues.apache.org/jira/browse/KUDU-422 > Project: Kudu > Issue Type: New Feature > Components: consensus, master >Affects Versions: M5 >Reporter: Todd Lipcon >Assignee: Adar Dembo >Priority: Critical > Labels: kudu-roadmap > Fix For: 1.0.0 > > > Support for multiple masters with consensus-replicated data. Goal is to have > to be able to do manual failovers if the leader of the master quorum dies, > stretch goal for this milestone is to have that failover happen unattended. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1271) Column ordering constraint in Kudu
[ https://issues.apache.org/jira/browse/KUDU-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1271: -- Priority: Major (was: Critical) > Column ordering constraint in Kudu > -- > > Key: KUDU-1271 > URL: https://issues.apache.org/jira/browse/KUDU-1271 > Project: Kudu > Issue Type: Improvement > Components: master >Affects Versions: 0.5.0 > Environment: Cloudera CDH 5.4.x >Reporter: Abhi Basu > > I get this error when I am attempting to create a Kudu table as a select from > an existing Impala/Hive table. The last column of the table is rowid (int) > that is going to be used as primary key for this table. > Error: llegalArgumentException: Got out-of-order primary key column: Column > name: rowid, type: > SQL example: > CREATE TABLE newtable_kudu > TBLPROPERTIES( > 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', > 'kudu.table_name' = 'newtable_kudu', > 'kudu.master_addresses' = 'hostname:7051', > 'kudu.key_columns' = 'rowid' > ) AS SELECT * FROM oldtable_impala; -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1662) Kudu fs command reports permission denied
[ https://issues.apache.org/jira/browse/KUDU-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905633#comment-15905633 ] Adar Dembo commented on KUDU-1662: -- bq. I think our current behavior makes sense and am inclined to resolve this as wontfix Agreed. bq. but maybe we need the tool to give better output in the case of an error? Eh. "Permission denied (error 13)" already sounds like a UNIX permission issue to me; Suzanne thought so too, and instinctively reran with sudo. What additional information would have helped? > Kudu fs command reports permission denied > - > > Key: KUDU-1662 > URL: https://issues.apache.org/jira/browse/KUDU-1662 > Project: Kudu > Issue Type: Bug > Components: ops-tooling >Affects Versions: 1.0.0 > Environment: 1. Kudu demo Quickstart VM (VirtualBox) running CDH 5.8 > 2. New Kudu (VMWare) VM running CDH 5.8 in Centos 6.6. with Kudu installed > from command line (no CM) >Reporter: Suzanne McIntosh >Priority: Minor > > This is a minor issue because the uuid is ultimately displayed. > Issuing 'kudu fs dump uuid ...' gives a permission denied error as shown > below. Testing against the Kudu demo VM yielded the same result. Using sudo > gives a different error. > My session below shows results of using plain kudu command, followed by an > attempt to clear the permission denied message by using sudo: > > {code} > [training@localhost ~]$ kudu fs dump uuid -fs_wal_dir /var/lib/kudu/master > -fs_data_dirs /var/lib/kudu/master > I0929 14:57:30.706712 6436 mem_tracker.cc:140] MemTracker: hard memory limit > is 4.673157 GB > I0929 14:57:30.706782 6436 mem_tracker.cc:142] MemTracker: soft memory limit > is 2.803894 GB > IO error: /var/lib/kudu/master/instance: Permission denied (error 13) > [training@localhost ~]$ sudo kudu fs dump uuid -fs_wal_dir > /var/lib/kudu/master -fs_data_dirs /var/lib/kudu/master > I0929 14:57:44.155582 6448 mem_tracker.cc:140] MemTracker: hard memory limit > is 4.673157 GB > I0929 14:57:44.155685 6448 mem_tracker.cc:142] MemTracker: soft memory limit > is 2.803894 GB > W0929 14:57:44.157841 6450 log_block_manager.cc:1616] IO error: Could not > lock /var/lib/kudu/master/data/block_manager_instance: Could not lock > /var/lib/kudu/master/data/block_manager_instance: lock > /var/lib/kudu/master/data/block_manager_instance: Resource temporarily > unavailable (error 11) > W0929 14:57:44.157857 6450 log_block_manager.cc:1617] Proceeding without lock > I0929 14:57:44.158095 6448 fs_manager.cc:243] Opened local filesystem: > /var/lib/kudu/master > uuid: "e9e534d6c178470f895279a8eca77a5b" > format_stamp: "Formatted at 2016-09-28 05:41:04 on localhost.localdomain" > e9e534d6c178470f895279a8eca77a5b > > Here are the directory permissions I see in my VMWare VM: > [training@localhost ~]$ ls -l /var/lib/kudu/master/data/block_manager_instance > -rw--- 1 kudu kudu 654 Sep 28 01:41 > /var/lib/kudu/master/data/block_manager_instance > [training@localhost ~]$ ls -l /var/lib/kudu/master/data > total 4 > -rw--- 1 kudu kudu 654 Sep 28 01:41 block_manager_instance > [training@localhost ~]$ ls -l /var/lib/kudu/master > total 20 > drwxr-xr-x 2 kudu kudu 4096 Sep 28 15:11 consensus-meta > drwxr-xr-x 2 kudu kudu 4096 Sep 28 01:41 data > -rw--- 1 kudu kudu 665 Sep 28 01:41 instance > drwxr-xr-x 2 kudu kudu 4096 Sep 28 01:41 tablet-meta > drwxr-xr-x 3 kudu kudu 4096 Sep 28 15:11 wals > [training@localhost ~]$ ls -l /var/lib/kudu/master/data > total 4 > -rw--- 1 kudu kudu 654 Sep 28 01:41 block_manager_instance > [training@localhost ~]$ ls -l /var/lib/kudu/master/data/block_manager_instance > -rw--- 1 kudu kudu 654 Sep 28 01:41 > /var/lib/kudu/master/data/block_manager_instance > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (KUDU-1860) ksck doesn't identify tablets that are evicted but still in config
[ https://issues.apache.org/jira/browse/KUDU-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1860: -- Priority: Major (was: Critical) > ksck doesn't identify tablets that are evicted but still in config > -- > > Key: KUDU-1860 > URL: https://issues.apache.org/jira/browse/KUDU-1860 > Project: Kudu > Issue Type: Bug > Components: ksck, ops-tooling >Affects Versions: 1.2.0 >Reporter: Jean-Daniel Cryans > > As reported by a user on Slack, ksck can give you a wrong output such as: > {noformat} > ca199fafca544df2a1b2a01be9d5266d (server1:7250): RUNNING [LEADER] > a077957f627c4758ab5a989aca8a1ca8 (server2:7250): RUNNING > 5c09a555c205482b8131f15b2c249ec6 (server3:7250): bad state > State: NOT_STARTED > Data state: TABLET_DATA_TOMBSTONED > Last status: Tablet initializing... > {noformat} > The problem is that server2 was already evicted out of the configuration > (based on reading the logs) but it wasn't committed in the config (which > contains server 1 and 3) since there's really only 1 server left out of 3. > Ideally ksck should try to see what each server thinks the configuration is > and see if there's a difference from what's in the master. As it is, it looks > like we're missing 1 replica but in reality this is a broken tablet. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-1839) DNS failure during tablet creation lead to undeletable tablet
[ https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905631#comment-15905631 ] Todd Lipcon commented on KUDU-1839: --- I'm wondering if this should be demoted to "Major" priority considering it doesn't cause downtime or data loss, right? > DNS failure during tablet creation lead to undeletable tablet > - > > Key: KUDU-1839 > URL: https://issues.apache.org/jira/browse/KUDU-1839 > Project: Kudu > Issue Type: Bug > Components: master, tablet >Affects Versions: 1.2.0 >Reporter: Adar Dembo >Priority: Critical > > During a YCSB workload, two tservers died due to DNS resolution timeouts. For > example: > {noformat} > F0117 09:21:14.952937 8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad > status: Network error: Could not obtain a remote proxy to the peer.: Unable > to resolve address 've0130.halxg.cloudera.com': Name or service not known > {noformat} > It's not clear why this happened; perhaps table creation places an inordinate > strain on DNS due to concurrent resolution load from all the bootstrapping > peers. > In any case, when these tservers were restarted, two tablets failed to > bootstrap, both for the same reason. I'll focus on just one tablet from here > on out to simplify troubleshooting: > {noformat} > E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T > 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet > failed to bootstrap: Not found: Unable to load Consensus metadata: > /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or > directory (error 2) > {noformat} > Eventually, the master decided to delete this tablet: > {noformat} > I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > {noformat} > As can be seen by the presence of multiple deletion requests, each one > failed. It's annoying that the tserver didn't log why. But the master did: > {noformat} > I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 > (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not > found in new config with opid_index 29) > W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS > 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete > failed for tablet 8c167c441a7d44b8add737d13797e694 with error code > TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting > down > I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of > 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for > TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)... > {noformat} > This isn't a fatal error as far as the master is concerned, so it retries the > deletion forever. > Meanwhile, the broken replica of this tablet still appears to be part of the > replication group. At least, that's true as far as both the master web UI and > the tserver web UI are concerned. The leader tserver is logging this error > repeatedly: > {noformat} > W0117 16:38:04.797828 81809 consensus_peers.cc:329] T > 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer > 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't > send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet > 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). > Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load > Consensus metadata: > /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or
[jira] [Updated] (KUDU-1860) ksck doesn't identify tablets that are evicted but still in config
[ https://issues.apache.org/jira/browse/KUDU-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-1860: -- Component/s: (was: util) ops-tooling ksck > ksck doesn't identify tablets that are evicted but still in config > -- > > Key: KUDU-1860 > URL: https://issues.apache.org/jira/browse/KUDU-1860 > Project: Kudu > Issue Type: Bug > Components: ksck, ops-tooling >Affects Versions: 1.2.0 >Reporter: Jean-Daniel Cryans >Priority: Critical > > As reported by a user on Slack, ksck can give you a wrong output such as: > {noformat} > ca199fafca544df2a1b2a01be9d5266d (server1:7250): RUNNING [LEADER] > a077957f627c4758ab5a989aca8a1ca8 (server2:7250): RUNNING > 5c09a555c205482b8131f15b2c249ec6 (server3:7250): bad state > State: NOT_STARTED > Data state: TABLET_DATA_TOMBSTONED > Last status: Tablet initializing... > {noformat} > The problem is that server2 was already evicted out of the configuration > (based on reading the logs) but it wasn't committed in the config (which > contains server 1 and 3) since there's really only 1 server left out of 3. > Ideally ksck should try to see what each server thinks the configuration is > and see if there's a difference from what's in the master. As it is, it looks > like we're missing 1 replica but in reality this is a broken tablet. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KUDU-422) Replicated master process (no master SPOF)
[ https://issues.apache.org/jira/browse/KUDU-422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905626#comment-15905626 ] Todd Lipcon commented on KUDU-422: -- [~adar]: I'm doing a triage of open "critical" JIRAs. Mind if we close this one and just use the couple remaining subtasks to track the remaining work? Given they've been open for 6+ months they must not be that critical, and I think the replicated master can be considered supported/finished enough to close this umbrella JIRA > Replicated master process (no master SPOF) > -- > > Key: KUDU-422 > URL: https://issues.apache.org/jira/browse/KUDU-422 > Project: Kudu > Issue Type: New Feature > Components: consensus, master >Affects Versions: M5 >Reporter: Todd Lipcon >Assignee: Adar Dembo >Priority: Critical > Labels: kudu-roadmap > > Support for multiple masters with consensus-replicated data. Goal is to have > to be able to do manual failovers if the leader of the master quorum dies, > stretch goal for this milestone is to have that failover happen unattended. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (KUDU-1923) --rpc_encryption=required option is not enforced on servers
[ https://issues.apache.org/jira/browse/KUDU-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon resolved KUDU-1923. --- Resolution: Fixed Fix Version/s: 1.3.0 > --rpc_encryption=required option is not enforced on servers > --- > > Key: KUDU-1923 > URL: https://issues.apache.org/jira/browse/KUDU-1923 > Project: Kudu > Issue Type: Bug > Components: rpc, security >Affects Versions: 1.3.0 >Reporter: Todd Lipcon >Assignee: Dan Burkert >Priority: Blocker > Fix For: 1.3.0 > > > When setting "rpc_encryption" to "REQUIRED" it doesn't appear that it > actually enforces this on the server side. -- This message was sent by Atlassian JIRA (v6.3.15#6346)