[jira] [Updated] (KUDU-3195) Make DMS flush policy more robust when maintenance threads are idle
[ https://issues.apache.org/jira/browse/KUDU-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3195: Code Review: https://gerrit.cloudera.org/#/c/16581/ > Make DMS flush policy more robust when maintenance threads are idle > --- > > Key: KUDU-3195 > URL: https://issues.apache.org/jira/browse/KUDU-3195 > Project: Kudu > Issue Type: Improvement > Components: tserver >Affects Versions: 1.13.0 >Reporter: Alexey Serbin >Priority: Major > > In one scenario I observed very long bootstrap times of tablet servers > (something between to 45 minutes and 60 minutes) even if tablet servers had > relatively small amount of data under management (~80GByte). It turned out > the time was spent on replaying WAL segments, with {{kudu cluster ksck}} > reporting something like below all the time during bootstrap: > {noformat} > b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running > State: BOOTSTRAPPING > Data state: TABLET_DATA_READY > Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this > segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} > inserts{seen=5949247 > ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7) > {noformat} > The workload I ran before shutting down the tablet servers consisted of many > small UPSERT operations, but the cluster was idle after terminating the > workload for long time (about few hours or so). The workload was generated by > {noformat} > kudu perf loadgen \ > --table_name=$TABLE_NAME \ > --num_rows_per_thread=8 \ > --num_threads=4 \ > --use_upsert \ > --use_random_pk \ > $MASTER_ADDR > {noformat} > The table that the UPSERT workload was running against had been pre-populated > by the following: > {noformat} > kudu perf loadgen --table_num_replicas=3 --keep-auto-table > --table_num_hash_partitions=5 --table_num_range_partitions=5 > --num_rows_per_thread=8 --num_threads=4 $MASTER_ADDR > {noformat} > As it turned out, tablet servers accumulated huge number of DMS which > required flushing/compaction, but after the memory pressure subsided, the > compaction policy was scheduling just one operation per tablet in every 120 > seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}). > In fact, tablet servers could flush those rowsets non-stop since the > maintenance threads were completely idle otherwise and there were no active > workload running against the cluster. Those DMS has been around for long > time (much more than 120 seconds) and were anchoring a lot of WAL segments. > So, the operations from the WAL had to be replayed once I restarted the > tablet servers. > It would be great to update the flushing/compaction policy to allow tablet > servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than > specified by {{\-\-flush_threshold_secs}} when the maintenance threads are > not busy otherwise. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3195) Make DMS flush policy more robust when maintenance threads are idle
[ https://issues.apache.org/jira/browse/KUDU-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3195: Status: In Review (was: Open) > Make DMS flush policy more robust when maintenance threads are idle > --- > > Key: KUDU-3195 > URL: https://issues.apache.org/jira/browse/KUDU-3195 > Project: Kudu > Issue Type: Improvement > Components: tserver >Affects Versions: 1.13.0 >Reporter: Alexey Serbin >Priority: Major > > In one scenario I observed very long bootstrap times of tablet servers > (something between to 45 minutes and 60 minutes) even if tablet servers had > relatively small amount of data under management (~80GByte). It turned out > the time was spent on replaying WAL segments, with {{kudu cluster ksck}} > reporting something like below all the time during bootstrap: > {noformat} > b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running > State: BOOTSTRAPPING > Data state: TABLET_DATA_READY > Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this > segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} > inserts{seen=5949247 > ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7) > {noformat} > The workload I ran before shutting down the tablet servers consisted of many > small UPSERT operations, but the cluster was idle after terminating the > workload for long time (about few hours or so). The workload was generated by > {noformat} > kudu perf loadgen \ > --table_name=$TABLE_NAME \ > --num_rows_per_thread=8 \ > --num_threads=4 \ > --use_upsert \ > --use_random_pk \ > $MASTER_ADDR > {noformat} > The table that the UPSERT workload was running against had been pre-populated > by the following: > {noformat} > kudu perf loadgen --table_num_replicas=3 --keep-auto-table > --table_num_hash_partitions=5 --table_num_range_partitions=5 > --num_rows_per_thread=8 --num_threads=4 $MASTER_ADDR > {noformat} > As it turned out, tablet servers accumulated huge number of DMS which > required flushing/compaction, but after the memory pressure subsided, the > compaction policy was scheduling just one operation per tablet in every 120 > seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}). > In fact, tablet servers could flush those rowsets non-stop since the > maintenance threads were completely idle otherwise and there were no active > workload running against the cluster. Those DMS has been around for long > time (much more than 120 seconds) and were anchoring a lot of WAL segments. > So, the operations from the WAL had to be replayed once I restarted the > tablet servers. > It would be great to update the flushing/compaction policy to allow tablet > servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than > specified by {{\-\-flush_threshold_secs}} when the maintenance threads are > not busy otherwise. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3149) Lock contention between registering ops and computing maintenance op stats
[ https://issues.apache.org/jira/browse/KUDU-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3149: Code Review: https://gerrit.cloudera.org/#/c/16580/ > Lock contention between registering ops and computing maintenance op stats > -- > > Key: KUDU-3149 > URL: https://issues.apache.org/jira/browse/KUDU-3149 > Project: Kudu > Issue Type: Bug > Components: perf, tserver >Reporter: Andrew Wong >Priority: Critical > > We saw a bunch of tablets bootstrapping extremely slowly, and many stuck > supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. > we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING. > Upon digging into the stacks, we saw a bunch waiting to acquire the MM lock: > {code:java} > TID 46577(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb59b99 kudu::tablet::Tablet::RegisterMaintenanceOps() > @ 0xb855a1 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > TID 46574(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb59c74 kudu::tablet::Tablet::RegisterMaintenanceOps() > @ 0xb855a1 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > 7 threads with same stack: > TID 46575(tablet-open [wo): > TID 46576(tablet-open [wo): > TID 46578(tablet-open [wo): > TID 46580(tablet-open [wo): > TID 46581(tablet-open [wo): > TID 46582(tablet-open [wo): > TID 46583(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb85374 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > TID 46573(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb854c7 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > 2 threads with same stack: > TID 43795(MaintenanceMgr ): > TID 43796(MaintenanceMgr ): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x239a064 kudu::MaintenanceManager::LaunchOp() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > {code} > A couple more stacks show some work being done by the maintenance manager: > {code:java} > TID 43794(MaintenanceMgr ): > @ 0x7f1dd57147e0 (unknown) > @ 0xba7b41 > kudu::tablet::BudgetedCompactionPolicy::RunApproximation() > @ 0xba8f5d > kudu::tablet::BudgetedCompactionPolicy::PickRowSets() > @ 0xb5b1a1 kudu::tablet::Tablet::PickRowSetsToCompact() > @ 0xb64e93
[jira] [Updated] (KUDU-3149) Lock contention between registering ops and computing maintenance op stats
[ https://issues.apache.org/jira/browse/KUDU-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3149: Status: In Review (was: Open) > Lock contention between registering ops and computing maintenance op stats > -- > > Key: KUDU-3149 > URL: https://issues.apache.org/jira/browse/KUDU-3149 > Project: Kudu > Issue Type: Bug > Components: perf, tserver >Reporter: Andrew Wong >Priority: Critical > > We saw a bunch of tablets bootstrapping extremely slowly, and many stuck > supposedly bootstrapping, but not showing up in the {{/tablets}} page, i.e. > we could only see INITIALIZED and RUNNING tablets, no BOOTSTRAPPING. > Upon digging into the stacks, we saw a bunch waiting to acquire the MM lock: > {code:java} > TID 46577(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb59b99 kudu::tablet::Tablet::RegisterMaintenanceOps() > @ 0xb855a1 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > TID 46574(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb59c74 kudu::tablet::Tablet::RegisterMaintenanceOps() > @ 0xb855a1 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > 7 threads with same stack: > TID 46575(tablet-open [wo): > TID 46576(tablet-open [wo): > TID 46578(tablet-open [wo): > TID 46580(tablet-open [wo): > TID 46581(tablet-open [wo): > TID 46582(tablet-open [wo): > TID 46583(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb85374 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > TID 46573(tablet-open [wo): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x23980ff kudu::MaintenanceManager::RegisterOp() > @ 0xb854c7 > kudu::tablet::TabletReplica::RegisterMaintenanceOps() > @ 0xa0055b kudu::tserver::TSTabletManager::OpenTablet() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > 2 threads with same stack: > TID 43795(MaintenanceMgr ): > TID 43796(MaintenanceMgr ): > @ 0x7f1dd57147e0 (unknown) > @ 0x7f1dd5713332 (unknown) > @ 0x7f1dd570e5d8 (unknown) > @ 0x7f1dd570e4a7 (unknown) > @ 0x23b4058 kudu::Mutex::Acquire() > @ 0x239a064 kudu::MaintenanceManager::LaunchOp() > @ 0x23f994c kudu::ThreadPool::DispatchThread() > @ 0x23f3f8b kudu::Thread::SuperviseThread() > @ 0x7f1dd570caa1 (unknown) > @ 0x7f1dd3b18bcd (unknown) > {code} > A couple more stacks show some work being done by the maintenance manager: > {code:java} > TID 43794(MaintenanceMgr ): > @ 0x7f1dd57147e0 (unknown) > @ 0xba7b41 > kudu::tablet::BudgetedCompactionPolicy::RunApproximation() > @ 0xba8f5d > kudu::tablet::BudgetedCompactionPolicy::PickRowSets() > @ 0xb5b1a1 kudu::tablet::Tablet::PickRowSetsToCompact() > @ 0xb64e93 kudu::tablet::Tablet::Compact() > @
[jira] [Resolved] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3198. - Resolution: Fixed Fixed with the patch mentioned in the prior comment. Thank you [~zhangyifan27] for debugging and fixing the issue! > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0 >Reporter: YifanZhang >Priority: Major > Fix For: 1.14.0 > > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column from the table, this error will > not appear. The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3198: Fix Version/s: 1.14.0 > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0 >Reporter: YifanZhang >Priority: Major > Fix For: 1.14.0 > > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column from the table, this error will > not appear. The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3198: Affects Version/s: 1.10.0 1.10.1 1.11.0 1.11.1 > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1, 1.13.0 >Reporter: YifanZhang >Priority: Major > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column from the table, this error will > not appear. The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2987) Intra location rebalance will crash in special case
[ https://issues.apache.org/jira/browse/KUDU-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2987: Affects Version/s: 1.9.0 1.10.0 1.10.1 > Intra location rebalance will crash in special case > --- > > Key: KUDU-2987 > URL: https://issues.apache.org/jira/browse/KUDU-2987 > Project: Kudu > Issue Type: Bug > Components: CLI >Affects Versions: 1.9.0, 1.10.0, 1.10.1, 1.11.0 >Reporter: ZhangYao >Assignee: ZhangYao >Priority: Major > Fix For: 1.12.0, 1.11.1 > > > Recently I am doing POC about rebalance and I get core when running intra > location rebalance. > Here is the log: > {code:java} > I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer > within location '/location/2044' > F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != > collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb > *** Check failure stack trace: *** > @ 0x111a75d google::LogMessage::Fail() > @ 0x111c6d3 google::LogMessage::SendToLog() > @ 0x111a2b9 google::LogMessage::Flush() > @ 0x111d0ef google::LogMessageFatal::~LogMessageFatal() > @ 0xe26da7 FindOrDie<>() > @ 0xe1f204 > kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl() > @ 0xe162e0 > kudu::tools::RebalancerTool::BaseRunner::GetNextMoves() > @ 0xe15bf5 kudu::tools::RebalancerTool::RunWith() > @ 0xe1db0e kudu::tools::RebalancerTool::Run() > @ 0xb6fea1 kudu::tools::(anonymous namespace)::RunRebalance() > @ 0xb70e14 std::_Function_handler<>::_M_invoke() > @ 0x11714a2 kudu::tools::Action::Run() > @ 0xc00587 kudu::tools::DispatchCommand() > @ 0xc00f4b kudu::tools::RunTool() > @ 0xb0fd6d main > @ 0x7f37086a4b15 __libc_start_main > @ 0xb6b399 (unknown) > {code} > I found it may be the problem in > {{RebalancerTool::AlgoBasedRunner::GetNextMovesImpl}} when building > extra_info_by_tablet_id, it check that the table id in tablet must occur in > table info. But when we build ClusterRawInfo in > {{RebalancerTool::KsckResultsToClusterRawInfo}} we only collect the table > occurs in location but all tablets in cluster. > This problem will occur when the location doesn't have replica for all > table. When location is far more than table's replica it will happen. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3199) kudu cluster rebalaner: add a flag to ignore all defined locations
Alexey Serbin created KUDU-3199: --- Summary: kudu cluster rebalaner: add a flag to ignore all defined locations Key: KUDU-3199 URL: https://issues.apache.org/jira/browse/KUDU-3199 Project: Kudu Issue Type: Improvement Components: CLI Reporter: Alexey Serbin When the location-aware rebalancer was designed, it was assumed that the tool should always honor the partitioning of the cluster defined by locations whatever the partitioning was. The only options considered were to run the process excluding particular rebalancing phases: * correction of placement policy violations ({{\-\-disable_policy_fixer}}) * inter-location rebalancing, i.e. rebalancing across different locations ({{\-\-disable_cross_location_rebalancing}}) * intra-location rebalancing, i.e. rebalancing within a location ({{\-\-disable_intra_location_rebalancing}}) As it turns out, there are use cases when people want to run the rebalancer on a location-aware cluster ignoring the location-awareness specifics. Those cases are: # The locations are defined by some higher-level cluster orchestration software, and people are reluctant to disable the location-awareness for Kudu specifically (i.e. providing an alternative script for {{\-\-location_mapping_cmd}}), but want to even out the distribution of replicas. # Having just two locations defined for some time. Even if it's a transitional phase (e.g., awaiting for a new zone/rack/datacenter to be added 'soon'), it could take some time. For both cases, there is a workaround if every location has the same number of tablet servers: run the rebalancer tool with the {{\-\-disable_policy_fixer}} flag. However, this workaround isn't applicable if there is difference in the number of tablet replicas per location, and no combination of flags could make the location-aware rebalancer run as it there were no locations defined. Let's add a new flag for the {{kudu cluster rebalance}} CLI tool to make it run on a location-aware cluster as if no locations were defined. Of course, the flag should be {{off}} by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3195) Make DMS flush policy more robust when maitenance threads are idle
Alexey Serbin created KUDU-3195: --- Summary: Make DMS flush policy more robust when maitenance threads are idle Key: KUDU-3195 URL: https://issues.apache.org/jira/browse/KUDU-3195 Project: Kudu Issue Type: Improvement Components: tserver Affects Versions: 1.13.0 Reporter: Alexey Serbin In one scenario I observed very long bootstrap times of tablet servers (something between to 45 minutes and 60 minutes) even if tablet servers had relatively small amount of data under management (~80GByte). It turned out the time was spent on replaying WAL segments, with {{kudu cluster ksck}} reporting something like below all the time during bootstrap: {noformat} b0a20b117a1242ae9fc15620a6f7a524 (tserver-6.local.site:7050): not running State: BOOTSTRAPPING Data state: TABLET_DATA_READY Last status: Bootstrap replaying log segment 21/37 (2.28M/7.85M this segment, stats: ops{read=27374 overwritten=0 applied=25016 ignored=657} inserts{seen=5949247 ignored=0} mutations{seen=0 ignored=0} orphaned_commits=7) {noformat} The workload I ran before shutting down the tablet servers consisted of many small UPSERT operations, but the cluster was idle after terminating the workload for long time (about few hours or so). The workload was generated by {noformat} kudu perf loadgen \ --table_name=$TABLE_NAME \ --num_rows_per_thread=8 \ --num_threads=4 \ --use_upsert \ --use_random_pk \ $MASTER_ADDR {noformat} The table that the UPSERT workload was running against had been pre-populated by the following: {noformat} kudu perf loadgen --table_num_replicas=3 --keep-auto-table --table_num_hash_partitions=5 --table_num_range_partitions=5 --num_rows_per_thread=8 --num_threads=4 $MASTER_ADDR {noformat} As it turned out, tablet servers accumulated huge number of DMS which required flushing/compaction, but after the memory pressure subsided, the compaction policy was scheduling just one operation per tablet in every 120 seconds (the latter interval is controlled by {{\-\-flush_threshold_secs}}). In fact, tablet servers could flush those rowsets non-stop since the maintenance threads were completely idle otherwise and there were no active workload running against the cluster. Those DMS has been around for long time (much more than 120 seconds) and were anchoring a lot of WAL segments. So, the operations from the WAL had to be replayed once I restarted the tablet servers. It would be great to update the flushing/compaction policy to allow tablet servers run {{FlushDeltaMemStoresOp}} as soon as a DMS becomes older than specified by {{\-\-flush_threshold_secs}} when the maintenance threads are not busy otherwise. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3194) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) sometimes fails
Alexey Serbin created KUDU-3194: --- Summary: testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) sometimes fails Key: KUDU-3194 URL: https://issues.apache.org/jira/browse/KUDU-3194 Project: Kudu Issue Type: Improvement Components: client, test Affects Versions: 1.13.0, 1.14.0 Reporter: Alexey Serbin Attachments: test-output.txt.xz The test scenario sometimes fails. {noformat} Time: 55.485 There was 1 failure: 1) testReadDataFrameAtSnapshot(org.apache.kudu.spark.kudu.DefaultSourceTest) java.lang.AssertionError: expected:<100> but was:<99> at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.failNotEquals(Assert.java:835) at org.junit.Assert.assertEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:633) at org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784) FAILURES!!! Tests run: 30, Failures: 1 {noformat} The full log is attached (RELEASE build); the relevant stack trace looks like the following: {noformat} 23:53:48.683 [ERROR - main] (RetryRule.java:219) org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot: failed attempt 1 java.lang.AssertionError: expected:<100> but was:<99> at org.junit.Assert.fail(Assert.java:89) ~[junit-4.13.jar:4.13] at org.junit.Assert.failNotEquals(Assert.java:835) ~[junit-4.13.jar:4.13] at org.junit.Assert.assertEquals(Assert.java:647) ~[junit-4.13.jar:4.13] at org.junit.Assert.assertEquals(Assert.java:633) ~[junit-4.13.jar:4.13] at org.apache.kudu.spark.kudu.DefaultSourceTest.testReadDataFrameAtSnapshot(DefaultSourceTest.scala:784) ~[test/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_141] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_141] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_141] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_141] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) ~[junit-4.13.jar:4.13] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) ~[junit-4.13.jar:4.13] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) ~[junit-4.13.jar:4.13] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) ~[junit-4.13.jar:4.13] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) ~[junit-4.13.jar:4.13] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) ~[junit-4.13.jar:4.13] at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54) ~[junit-4.13.jar:4.13] at org.apache.kudu.test.junit.RetryRule$RetryStatement.doOneAttempt(RetryRule.java:217) [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.apache.kudu.test.junit.RetryRule$RetryStatement.evaluate(RetryRule.java:234) [kudu-test-utils-1.13.0-SNAPSHOT.jar:1.13.0-SNAPSHOT] at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) [junit-4.13.jar:4.13] at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) [junit-4.13.jar:4.13] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) [junit-4.13.jar:4.13] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner.run(ParentRunner.java:413) [junit-4.13.jar:4.13] at org.junit.runners.Suite.runChild(Suite.java:128) [junit-4.13.jar:4.13] at org.junit.runners.Suite.runChild(Suite.java:27) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) [junit-4.13.jar:4.13] at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
[jira] [Resolved] (KUDU-1728) Parallelize tablet copy operations
[ https://issues.apache.org/jira/browse/KUDU-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-1728. - Fix Version/s: 1.14.0 Resolution: Fixed > Parallelize tablet copy operations > -- > > Key: KUDU-1728 > URL: https://issues.apache.org/jira/browse/KUDU-1728 > Project: Kudu > Issue Type: Bug > Components: consensus, tablet >Reporter: Mike Percy >Priority: Major > Labels: roadmap-candidate > Fix For: 1.14.0 > > > Parallelize tablet copy operations. Right now all data is copied serially. We > may want to consider throttling on either side if we want to budget IO. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3182) 'last_known_addr' is not specified for single master Raft configuration
[ https://issues.apache.org/jira/browse/KUDU-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3182: Code Review: https://gerrit.cloudera.org/#/c/16340/ > 'last_known_addr' is not specified for single master Raft configuration > --- > > Key: KUDU-3182 > URL: https://issues.apache.org/jira/browse/KUDU-3182 > Project: Kudu > Issue Type: Task > Components: consensus, master >Reporter: Bankim Bhavsar >Assignee: Bankim Bhavsar >Priority: Major > > 'last_known_addr' field is not persisted for a single master Raft > configuration. This is okay as long as we have a single master > configuration. On dynamically transitioning from single master > to two master configuration, ChangeConfig() request to ADD_PEER > fails the validation in VerifyRaftConfig(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads
[ https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-1587: Fix Version/s: 1.13.0 Resolution: Fixed Status: Resolved (was: In Review) With this [commit|https://github.com/apache/kudu/commit/ee3bb83575a051c2feade1f8c159b2902a7160d5], it's now possible to specify a threshold for the op apply times. While not yet enabled by default with some safe threshold for the apply queue times, this allows to engage this new behavior of rejecting write requests, if needed. To do so, set the {{\-\-tablet_apply_pool_overload_threshold_ms}} tablet server's flag to the desired value (anything greater than 0 turns on the new behavior). > Memory-based backpressure is insufficient on seek-bound workloads > - > > Key: KUDU-1587 > URL: https://issues.apache.org/jira/browse/KUDU-1587 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 0.10.0, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, > 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, > 1.11.1 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Labels: roadmap-candidate > Fix For: 1.13.0 > > Attachments: graph.png, queue-time.png > > > I pushed a uniform random insert workload from a bunch of clients to the > point that the vast majority of bloom filters no longer fit in buffer cache, > and the compaction had fallen way behind. Thus, every inserted row turns into > 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of > workload, the current backpressure (based on memory usage) is insufficient to > prevent ridiculously long queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads
[ https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-1587: Affects Version/s: 1.0.0 1.0.1 1.1.0 1.2.0 1.3.0 1.3.1 1.4.0 1.5.0 1.6.0 1.7.0 1.8.0 1.7.1 1.9.0 1.10.0 1.10.1 1.11.0 1.12.0 1.11.1 > Memory-based backpressure is insufficient on seek-bound workloads > - > > Key: KUDU-1587 > URL: https://issues.apache.org/jira/browse/KUDU-1587 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 0.10.0, 1.0.0, 1.0.1, 1.1.0, 1.2.0, 1.3.0, 1.3.1, 1.4.0, > 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, > 1.11.1 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Labels: roadmap-candidate > Attachments: graph.png, queue-time.png > > > I pushed a uniform random insert workload from a bunch of clients to the > point that the vast majority of bloom filters no longer fit in buffer cache, > and the compaction had fallen way behind. Thus, every inserted row turns into > 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of > workload, the current backpressure (based on memory usage) is insufficient to > prevent ridiculously long queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster
[ https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186743#comment-17186743 ] Alexey Serbin commented on KUDU-3169: - This issue has been spotted elsewhere as well. This seems to be specific to Java Kudu client only, C++/Python Kudu clients don't have this issue. Below is a snippet from the client-side log: {noformat} 20/07/20 00:26:41 INFO client.AsyncKuduClient: Invalidating location edd230034290421aa36bbf83c4b3b97e(tserver-00.local:7050) for tablet a3dbde5879d3486fa68f442dff1b86d5: Service unavailable: Scan request on kudu.tserver.TabletServerService from 10.80.34.23:54724 dropped due to backpressure. The service queue is full; it has 50 items. 20/07/20 00:26:42 WARN client.AsyncKuduScanner: a3dbde5879d3486fa68f442dff1b86d5@[592d79bf710046a88bf6da9799fe26d6(terver-01.local:7050),d8677f078c754b1dac4a1aad2c5c1c7e(tserver-01.local:7050)] pretends to not know KuduScanner(table=impala::t00.p00, tablet=null, scannerId="33e4c93f3ca84ef8b5cd40c4846573f7", scanRequestTimeout=3) org.apache.kudu.client.NonRecoverableException: Scanner 33e4c93f3ca84ef8b5cd40c4846573f7 not found (it may have expired) {noformat} Tablet server at {{tserver-00.local}} drops the RPC with scan request and Kudu client proceeds on to the next tablet server at {{tserver-01.local}}, sending scan continuation (not a new scan) request there. The tablet server at {{tserver-01.local}} responds with {{Status::NotFound}} status with specific error code {{TabletServerErrorPB::SCANNER_EXPIRED}}, hinting that the scanner with identifier {{33e4c93f3ca84ef8b5cd40c4846573f7}} might have already expired (see [the server-side code|https://github.com/apache/kudu/blob/c590a05778443bb6112e831d0b0ad0dce4b74724/src/kudu/tserver/scanners.cc#L170-L175] for details). The tablet server at {{tserver-01.local}} could not find the scanner because the client hadn't started scan operation with that tablet server, but started the scan operation with tablet server at {{tserver-00.local}}. > kudu java client throws scanner expired error while processing large scan on > High-load cluster > --- > > Key: KUDU-3169 > URL: https://issues.apache.org/jira/browse/KUDU-3169 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1 >Reporter: mintao >Priority: Major > Labels: scalability, stability > > user submits a spark task to scan a kudu table with large amount records, > after just few minutes the job failed after 4 attempts, each attempt failed > with error : > {code:java} > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at > org.apache.kudu.client.KuduException.transformException(KuduException.java:110) > at > org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) > at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at > org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:109) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Suppressed: > org.apache.kudu.client.KuduException$OriginalException: Original asynchronous > stack trace at > org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at > org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at > org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at >
[jira] [Updated] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster
[ https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3169: Labels: scalability stability (was: ) > kudu java client throws scanner expired error while processing large scan on > High-load cluster > --- > > Key: KUDU-3169 > URL: https://issues.apache.org/jira/browse/KUDU-3169 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1 >Reporter: mintao >Priority: Major > Labels: scalability, stability > > user submits a spark task to scan a kudu table with large amount records, > after just few minutes the job failed after 4 attempts, each attempt failed > with error : > {code:java} > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at > org.apache.kudu.client.KuduException.transformException(KuduException.java:110) > at > org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) > at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at > org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:109) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Suppressed: > org.apache.kudu.client.KuduException$OriginalException: Original asynchronous > stack trace at > org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at > org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at > org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at > org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) > at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) > at >
[jira] [Updated] (KUDU-3169) kudu java client throws scanner expired error while processing large scan on High-load cluster
[ https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3169: Affects Version/s: 1.9.0 1.10.0 1.10.1 1.11.0 1.12.0 1.11.1 > kudu java client throws scanner expired error while processing large scan on > High-load cluster > --- > > Key: KUDU-3169 > URL: https://issues.apache.org/jira/browse/KUDU-3169 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1 >Reporter: mintao >Priority: Major > > user submits a spark task to scan a kudu table with large amount records, > after just few minutes the job failed after 4 attempts, each attempt failed > with error : > {code:java} > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) > org.apache.kudu.client.NonRecoverableException: Scanner > 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at > org.apache.kudu.client.KuduException.transformException(KuduException.java:110) > at > org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) > at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at > org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:109) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Suppressed: > org.apache.kudu.client.KuduException$OriginalException: Original asynchronous > stack trace at > org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at > org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at > org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at > org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at > org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184) > at > org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) > at > org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) > at >
[jira] [Resolved] (KUDU-3181) Compilation manager queue may have too many tasks
[ https://issues.apache.org/jira/browse/KUDU-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3181. - Fix Version/s: 1.13.0 Resolution: Fixed > Compilation manager queue may have too many tasks > - > > Key: KUDU-3181 > URL: https://issues.apache.org/jira/browse/KUDU-3181 > Project: Kudu > Issue Type: Bug > Components: codegen >Reporter: Li Zhiming >Priority: Major > Fix For: 1.13.0 > > Attachments: heap.svg > > > When a client frequently scanning for thousands of diffrent columns, the > code_cache_hits rate is quite low. Then compilation task is frequently > submitted to queue, but the compiler manager thread cannot consumes the queue > quick enough. The queue could accumulate tons of entries, each of which > retains a copy of schema meta data, so a lot of memeory is consumed for a > long time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads
[ https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178138#comment-17178138 ] Alexey Serbin edited comment on KUDU-1587 at 8/15/20, 2:55 AM: --- I implemented the requested functionality with the following changelists: * [a test scenario to simulate apply queue "overload"|https://gerrit.cloudera.org/#/c/16312/] * [tracking the state of the apply queue|https://gerrit.cloudera.org/#/c/16332/] * [controlling the admission of write requests with CoDel-like approach|http://gerrit.cloudera.org:8080/16343] These follow the approach suggested by [~tlipcon] in the comment above, but with small variation: while deciding whether to reject an incoming write request, the amount of already rejected requests isn't taken into account. Instead, the criterion is to look for how long the queue has been in the 'overloaded' state: the longer the queue stays overloaded, the greater the probability of rejections. was (Author: aserbin): I implemented the requested functionality with the following changelists: * [a test scenario to simulate apply queue "overload"|https://gerrit.cloudera.org/#/c/16312/] * [tracking the state of the apply queue|https://gerrit.cloudera.org/#/c/16332/] * [controlling the admission of write requests with CoDel-like approach|http://gerrit.cloudera.org:8080/16343] > Memory-based backpressure is insufficient on seek-bound workloads > - > > Key: KUDU-1587 > URL: https://issues.apache.org/jira/browse/KUDU-1587 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 0.10.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Labels: roadmap-candidate > Attachments: graph.png, queue-time.png > > > I pushed a uniform random insert workload from a bunch of clients to the > point that the vast majority of bloom filters no longer fit in buffer cache, > and the compaction had fallen way behind. Thus, every inserted row turns into > 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of > workload, the current backpressure (based on memory usage) is insufficient to > prevent ridiculously long queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads
[ https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178138#comment-17178138 ] Alexey Serbin commented on KUDU-1587: - I implemented the requested functionality with the following changelists: * [a test scenario to simulate apply queue "overload"|https://gerrit.cloudera.org/#/c/16312/] * [tracking the state of the apply queue|https://gerrit.cloudera.org/#/c/16332/] * [controlling the admission of write requests with CoDel-like approach|http://gerrit.cloudera.org:8080/16343] > Memory-based backpressure is insufficient on seek-bound workloads > - > > Key: KUDU-1587 > URL: https://issues.apache.org/jira/browse/KUDU-1587 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 0.10.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Labels: roadmap-candidate > Attachments: graph.png, queue-time.png > > > I pushed a uniform random insert workload from a bunch of clients to the > point that the vast majority of bloom filters no longer fit in buffer cache, > and the compaction had fallen way behind. Thus, every inserted row turns into > 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of > workload, the current backpressure (based on memory usage) is insufficient to > prevent ridiculously long queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads
[ https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-1587: Code Review: http://gerrit.cloudera.org:8080/16343 > Memory-based backpressure is insufficient on seek-bound workloads > - > > Key: KUDU-1587 > URL: https://issues.apache.org/jira/browse/KUDU-1587 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 0.10.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Labels: roadmap-candidate > Attachments: graph.png, queue-time.png > > > I pushed a uniform random insert workload from a bunch of clients to the > point that the vast majority of bloom filters no longer fit in buffer cache, > and the compaction had fallen way behind. Thus, every inserted row turns into > 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of > workload, the current backpressure (based on memory usage) is insufficient to > prevent ridiculously long queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175778#comment-17175778 ] Alexey Serbin edited comment on KUDU-3119 at 8/11/20, 6:44 PM: --- I changed the priority to BLOCKER in the context of cutting a new release soon. It would be great to clarify on this: # If this is a real race, could this affect data consistency, leading to data corruption or alike in the long run? # If this is a race that could cause a corruption (done either by the {{kudu}} CLI tool or by kudu tablet server), I think it should be fixed before cutting next release. was (Author: aserbin): I changed the priority to BLOCKER in the context of cutting a new release soon. It would be great to clarify on this: # If this is a real race, could this affect data consistency, leading to data corruption or alike in the long run? # If this is a race that could cause a corruption (done either by the {{kudu}} CLI tool or by kudu tablet server,) it should be fixed before cutting the upcoming release. > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Blocker > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, > kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175778#comment-17175778 ] Alexey Serbin commented on KUDU-3119: - I changed the priority to BLOCKER in the context of cutting a new release soon. It would be great to clarify on this: # If this is a real race, could this affect data consistency, leading to data corruption or alike in the long run? # If this is a race that could cause a corruption (done either by the {{kudu}} CLI tool or by kudu tablet server,) it should be fixed before cutting the upcoming release. > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Blocker > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, > kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3119: Priority: Blocker (was: Minor) > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Blocker > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, > kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-1587) Memory-based backpressure is insufficient on seek-bound workloads
[ https://issues.apache.org/jira/browse/KUDU-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-1587: --- Assignee: Alexey Serbin > Memory-based backpressure is insufficient on seek-bound workloads > - > > Key: KUDU-1587 > URL: https://issues.apache.org/jira/browse/KUDU-1587 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 0.10.0 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Critical > Labels: roadmap-candidate > Attachments: graph.png, queue-time.png > > > I pushed a uniform random insert workload from a bunch of clients to the > point that the vast majority of bloom filters no longer fit in buffer cache, > and the compaction had fallen way behind. Thus, every inserted row turns into > 40+ seeks (due to non-compact data) and takes 400-500ms. In this kind of > workload, the current backpressure (based on memory usage) is insufficient to > prevent ridiculously long queues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172751#comment-17172751 ] Alexey Serbin commented on KUDU-3180: - [~awong] put together a great summary of the recent discussion. I just want to add two cents from my side, hoping it might be useful (if it makes sense). It's about looking at this from a generic perspective, (i.e. ignorant of the compaction/flush implementation details :) ). During our recent discussion of this issue with [~awong] and [~granthenke], one observation was that using {{memory_size * time_since_last_flush}} as the simplest proxy for the cost function allows for easier comprehension (at least for me) of an alternative policy that takes into account both the size and the age of datasets to flush. The idea is to make sure that being under starvation of flushes/compactions compared with the rate of incoming updates, heavier chunks of data are more likely to be picked up for flushing/compacting than smaller ones, even if the smaller ones have been around for somewhat longer time. However, super-old tiny data chunks are also picked up eventually even if heavy updates arrive all the time. So, picking datasets with the highest values of the cost function among those which cross a pre-set threshold might be a model to think of. As for doing compactions vs flushes, maybe it's possible to use a similar cost function but with 0.x something coefficient to reflect the notion that occupying disk storage is cheaper than occupying the same amount of RAM for the same time interval. > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3178) An option to terminate connections which have been open for very long time
[ https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168126#comment-17168126 ] Alexey Serbin commented on KUDU-3178: - Ah, probably the essence was in _not_ terminating a connection, but instead send back authn error that would lead to re-authentication (e.g. re-acquiring authn token). Yes, that might be a good option in addressing the actual issue behind those long-living connections. Thanks! > An option to terminate connections which have been open for very long time > -- > > Key: KUDU-3178 > URL: https://issues.apache.org/jira/browse/KUDU-3178 > Project: Kudu > Issue Type: Improvement > Components: master, security, tserver >Reporter: Alexey Serbin >Priority: Major > > A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} > and keep that connection open indefinitely by issuing some method at least > once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call > {{Ping()}} method). This means there isn't a limit on how long an client can > have access to cluster once it's authenticated, unless {{kudu-master}} and > {{kudu-tserver}} processes are restarted. When fine-grained authorization if > enforced, this issue is really benign because such lingering clients are > unable to call any methods that require authz token to be provided. > It would be nice to address this by providing an option to terminate > connections which were established long time ago. Both the interval of the > maximum connection lifetime and whether to terminate over-the-TTL connections > should be controlled by flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3178) An option to terminate connections which have been open for very long time
[ https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168123#comment-17168123 ] Alexey Serbin commented on KUDU-3178: - Why 'instead'? That's exactly one of the way I was thinking to implement this :) > An option to terminate connections which have been open for very long time > -- > > Key: KUDU-3178 > URL: https://issues.apache.org/jira/browse/KUDU-3178 > Project: Kudu > Issue Type: Improvement > Components: master, security, tserver >Reporter: Alexey Serbin >Priority: Major > > A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} > and keep that connection open indefinitely by issuing some method at least > once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call > {{Ping()}} method). This means there isn't a limit on how long an client can > have access to cluster once it's authenticated, unless {{kudu-master}} and > {{kudu-tserver}} processes are restarted. When fine-grained authorization if > enforced, this issue is really benign because such lingering clients are > unable to call any methods that require authz token to be provided. > It would be nice to address this by providing an option to terminate > connections which were established long time ago. Both the interval of the > maximum connection lifetime and whether to terminate over-the-TTL connections > should be controlled by flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3178) An option to terminate connections which have been open for very long time
[ https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3178: Summary: An option to terminate connections which have been open for very long time (was: Terminate connections which have been open for long time) > An option to terminate connections which have been open for very long time > -- > > Key: KUDU-3178 > URL: https://issues.apache.org/jira/browse/KUDU-3178 > Project: Kudu > Issue Type: Improvement > Components: master, security, tserver >Reporter: Alexey Serbin >Priority: Major > > A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} > and keep that connection open indefinitely by issuing some method at least > once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call > {{Ping()}} method). This means there isn't a limit on how long an client can > have access to cluster once it's authenticated, unless {{kudu-master}} and > {{kudu-tserver}} processes are restarted. When fine-grained authorization if > enforced, this issue is really benign because such lingering clients are > unable to call any methods that require authz token to be provided. > It would be nice to address this by providing an option to terminate > connections which were established long time ago. Both the interval of the > maximum connection lifetime and whether to terminate over-the-TTL connections > should be controlled by flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3178) Terminate connections which have been open for longer than authn token expiration period
Alexey Serbin created KUDU-3178: --- Summary: Terminate connections which have been open for longer than authn token expiration period Key: KUDU-3178 URL: https://issues.apache.org/jira/browse/KUDU-3178 Project: Kudu Issue Type: Improvement Components: master, security, tserver Reporter: Alexey Serbin A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} and keep that connection open indefinitely by issuing some method at least once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call {{Ping()}} method). This means there isn't a limit on how long an client can have access to cluster once it's authenticated, unless {{kudu-master}} and {{kudu-tserver}} processes are restarted. When fine-grained authorization if enforced, this issue is really benign because such lingering clients are unable to call any methods that require authz token to be provided. It would be nice to address this by providing an option to terminate connections which were established long time ago. Both the interval of the maximum connection lifetime and whether to terminate over-the-TTL connections should be controlled by flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3178) Terminate connections which have been open for long time
[ https://issues.apache.org/jira/browse/KUDU-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3178: Summary: Terminate connections which have been open for long time (was: Terminate connections which have been open for longer than authn token expiration period) > Terminate connections which have been open for long time > > > Key: KUDU-3178 > URL: https://issues.apache.org/jira/browse/KUDU-3178 > Project: Kudu > Issue Type: Improvement > Components: master, security, tserver >Reporter: Alexey Serbin >Priority: Major > > A Kudu client can open a connection to {{kudu-master}} or {{kudu-tserver}} > and keep that connection open indefinitely by issuing some method at least > once every {{\-\-rpc_default_keepalive_time_ms}} interval (e.g., call > {{Ping()}} method). This means there isn't a limit on how long an client can > have access to cluster once it's authenticated, unless {{kudu-master}} and > {{kudu-tserver}} processes are restarted. When fine-grained authorization if > enforced, this issue is really benign because such lingering clients are > unable to call any methods that require authz token to be provided. > It would be nice to address this by providing an option to terminate > connections which were established long time ago. Both the interval of the > maximum connection lifetime and whether to terminate over-the-TTL connections > should be controlled by flags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3173) Document time source options and recommendations
[ https://issues.apache.org/jira/browse/KUDU-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3173: Description: It's necessary to document existing time source options and recommendations for Kudu. Since the introduction of the [built-in NTP client|https://github.com/apache/kudu/commit/c103d51a52d00c3a9d062e06e20a5cc8c98df9a0] and the [{{auto}} time source|https://github.com/apache/kudu/commit/bd8e8f8b805bec5673590dffa67e48fbc9cfe208], more options are available while deploying Kudu clusters, but these are not properly documented yet. Probably, the best place to add that information is at the [configuration page|https://kudu.apache.org/docs/configuration.html]. was: It's necessary to document existing time source options and recommendations for Kudu. Since the introduction of the built-in NTP client and the {{auto}} time source, more options are available, but those are not documented. Probably, the proper place to add that information is at the [configuration page|https://kudu.apache.org/docs/configuration.html]. > Document time source options and recommendations > > > Key: KUDU-3173 > URL: https://issues.apache.org/jira/browse/KUDU-3173 > Project: Kudu > Issue Type: Task > Components: documentation >Reporter: Alexey Serbin >Priority: Major > > It's necessary to document existing time source options and recommendations > for Kudu. Since the introduction of the [built-in NTP > client|https://github.com/apache/kudu/commit/c103d51a52d00c3a9d062e06e20a5cc8c98df9a0] > and the [{{auto}} time > source|https://github.com/apache/kudu/commit/bd8e8f8b805bec5673590dffa67e48fbc9cfe208], > more options are available while deploying Kudu clusters, but these are not > properly documented yet. > Probably, the best place to add that information is at the [configuration > page|https://kudu.apache.org/docs/configuration.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3173) Document time source options and recommendations
Alexey Serbin created KUDU-3173: --- Summary: Document time source options and recommendations Key: KUDU-3173 URL: https://issues.apache.org/jira/browse/KUDU-3173 Project: Kudu Issue Type: Task Reporter: Alexey Serbin It's necessary to document existing time source options and recommendations for Kudu. Since the introduction of the built-in NTP client and the {{auto}} time source, more options are available, but those are not documented. Probably, the proper place to add that information is at the [configuration page|https://kudu.apache.org/docs/configuration.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3172) Enable hybrid clock and built-in NTP client in Docker by default
[ https://issues.apache.org/jira/browse/KUDU-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162899#comment-17162899 ] Alexey Serbin commented on KUDU-3172: - Another option is to set {{\-\-time_source=system_unsync}} if all the dockerized Kudu cluster runs at a single host. > Enable hybrid clock and built-in NTP client in Docker by default > > > Key: KUDU-3172 > URL: https://issues.apache.org/jira/browse/KUDU-3172 > Project: Kudu > Issue Type: Improvement >Affects Versions: 1.12.0 >Reporter: Grant Henke >Assignee: Grant Henke >Priority: Minor > > Currently the docker entrypoint sets `--use_hybrid_clock=false` by default. > This can cause unusual issues when snapshot scans are needed. Now that the > built-in client is available we should switch to use that by default in the > docker image by setting `--time_source=auto`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155693#comment-17155693 ] Alexey Serbin commented on KUDU-3119: - One more TSAN trace.f [^kudu-tool-test.3.txt.xz] > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Minor > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, > kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3119: Attachment: kudu-tool-test.3.txt.xz > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Minor > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.3.txt.xz, > kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155096#comment-17155096 ] Alexey Serbin commented on KUDU-3119: - It seems the issue started manifesting itself after https://github.com/apache/kudu/commit/98f44f4537ceddffedaf9afce26b634c4ab2142a > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Minor > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155078#comment-17155078 ] Alexey Serbin commented on KUDU-3119: - One more instance of TSAN race report, log is attached. [^kudu-tool-test.20200709.txt.xz] I guess there is real race if trying to add a block at the same index from two different threads. Yes, there is a lock per index, but consider what happens when two threads trying to access using {{operator[]}} if an element at the index which was not present: {noformat} bool LogBlockManager::AddLogBlock(LogBlockRefPtr lb) { // InsertIfNotPresent doesn't use move semantics, so instead we just // insert an empty scoped_refptr and assign into it down below rather // than using the utility function. int index = lb->block_id().id() & kBlockMapMask; std::lock_guard l(*managed_block_shards_[index].lock); auto& blocks_by_block_id = *managed_block_shards_[index].blocks_by_block_id; LogBlockRefPtr* entry_ptr = _by_block_id[lb->block_id()]; if (*entry_ptr) { // Already have an entry for this block ID. return false; } ... {noformat} > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Minor > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
[ https://issues.apache.org/jira/browse/KUDU-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3119: Attachment: kudu-tool-test.20200709.txt.xz > ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN > --- > > Key: KUDU-3119 > URL: https://issues.apache.org/jira/browse/KUDU-3119 > Project: Kudu > Issue Type: Bug > Components: CLI, test >Reporter: Alexey Serbin >Priority: Minor > Attachments: kudu-tool-test.20200709.txt.xz, kudu-tool-test.log.xz > > > Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} > reports races for TSAN builds: > {noformat} > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: > Failure > Failed > Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: > process exited with non-ze > ro status 66 > Google Test trace: > /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: > W0506 17:5 > 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true > I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided > I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log > directory (fs_wal_dir) as metad > ata directory > I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory > manager: real 0.007s > user 0.005s sys 0.002s > I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open > files per process limi > t of 1048576; it is already as high as it can go > I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm > with capacity 419430 > == > WARNING: ThreadSanitizer: data race (pid=4432) > ... > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-3156) Whether the CVE-2019-17543 vulnerability of lz affects kudu
[ https://issues.apache.org/jira/browse/KUDU-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3156. - Fix Version/s: n/a Resolution: Information Provided Kudu doesn't use {{LZ4_compress_fast}} call, so it's not affected by CVE-2019-17543. > Whether the CVE-2019-17543 vulnerability of lz affects kudu > --- > > Key: KUDU-3156 > URL: https://issues.apache.org/jira/browse/KUDU-3156 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: yejiabao_h >Priority: Major > Fix For: n/a > > > LZ4 before 1.9.2 has a heap-based buffer overflow in LZ4_write32 (related to > LZ4_compress_destSize), affecting applications that call LZ4_compress_fast > with a large input. (This issue can also lead to data corruption.) NOTE: the > vendor states "only a few specific / uncommon usages of the API are at risk." > > Whether the CVE-2019-17543 vulnerability of lz affects kudu? if yes, what is > the impact? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3163) Long after restarting kudu-tserver nodes, follower replicas continue rejecting scan requests with 'Uninitialized: safe time has not yet been initialized' error
Alexey Serbin created KUDU-3163: --- Summary: Long after restarting kudu-tserver nodes, follower replicas continue rejecting scan requests with 'Uninitialized: safe time has not yet been initialized' error Key: KUDU-3163 URL: https://issues.apache.org/jira/browse/KUDU-3163 Project: Kudu Issue Type: Bug Components: tserver Reporter: Alexey Serbin Attachments: logs.tar.bz2 There was a report on a strange state of tablet replicas after some sort of rolling restart. ksck with checksum reported the tablet was fine, but follower replicas continued rejecting scan requests with {{Uninitialized: safe time has not yet been initialized}} error. It seems the issue went away after forcing tablet leader re-election. No new write operations (INSERT, UPDATE, DELETE) were issued against the tablet. As already mentioned, some nodes in the cluster were restarted, and before doing that {{\-\-follower_unavailable_considered_failed_sec}} flag was set to {{3600}}. At this time, I don't have a clear picture of what was going on, but I just wanted to dump available information. I need to do a root cause analysis to produce a clear description and diagnosis for the issue. The logs are attached (these are filtered tablet server logs containing the lines attributed only to the affected tablet: UUID {{c56432b0164e45d98175f26a54d65270}}). At the time when the logs were captured, {{hdp025}} hosted the leader replica of the tablet, while {{hdp014}} and {{hdp035}} hosted the follower ones. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3154) RangerClientTestBase.TestLogging sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152197#comment-17152197 ] Alexey Serbin commented on KUDU-3154: - This is one of the recent changelists that might be the culprit: introduced https://github.com/apache/kudu/commit/15af717e8c200d4e7a33a25171a6bfe58d70aa65 However, I cannot repro the failure when running it via dist-test: {noformat} $ ~/bin/run_32_dist_test.sh ./bin/ranger_client-test --gtest_filter=RangerClientTestBase.TestLogging [found] [hashed/size/to hash] [looked up/to lookup] [uploaded/size/to upload/size] [3628] [3626/978.5Mib/3627] [0/3626] [0/0b/0/0b] 600ms 0d9dad6f00b682663cbf81cd1c93a849ce3a96f4 ranger_client-test.0 [3629] [3628/1.01Gib/3628] [3628/3628] [1/12.4Kib/2/2.05Mib] 8.3s Hits: 3626 (1.01Gib) Misses : 2 (2.05Mib) Duration: 9.607s INFO:dist_test.client:Submitting job to http://dist-test.cloudera.org//submit_job INFO:dist_test.client:Submitted job aserbin.1594055658.80007 INFO:dist_test.client:Watch your results at http://dist-test.cloudera.org//job?job_id=aserbin.1594055658.80007 769.1s 32/32 tests complete {noformat} > RangerClientTestBase.TestLogging sometimes fails > > > Key: KUDU-3154 > URL: https://issues.apache.org/jira/browse/KUDU-3154 > Project: Kudu > Issue Type: Bug > Components: ranger, test >Affects Versions: 1.13.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: ranger_client-test.txt.xz > > > The {{RangerClientTestBase.TestLogging}} scenario of the > {{ranger_client-test}} sometimes fails (all types of builds) with error > message like below: > {noformat} > src/kudu/ranger/ranger_client-test.cc:398: Failure > Failed > > Bad status: Timed out: timed out while in flight > > I0620 07:06:02.907177 1140 server.cc:247] Received an EOF from the > subprocess > I0620 07:06:02.910923 1137 server.cc:317] get failed, inbound queue shut > down: Aborted: > I0620 07:06:02.910964 1141 server.cc:380] outbound queue shut down: Aborted: > > I0620 07:06:02.910995 1138 server.cc:317] get failed, inbound queue shut > down: Aborted: > I0620 07:06:02.910984 1139 server.cc:317] get failed, inbound queue shut > down: Aborted: > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Fix Version/s: 1.13.0 Resolution: Fixed Status: Resolved (was: In Review) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: consensus, tserver >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > Fix For: 1.13.0 > > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3154) RangerClientTestBase.TestLogging sometimes fails
Alexey Serbin created KUDU-3154: --- Summary: RangerClientTestBase.TestLogging sometimes fails Key: KUDU-3154 URL: https://issues.apache.org/jira/browse/KUDU-3154 Project: Kudu Issue Type: Bug Components: ranger, test Affects Versions: 1.13.0 Reporter: Alexey Serbin Attachments: ranger_client-test.txt.xz The {{RangerClientTestBase.TestLogging}} scenario of the {{ranger_client-test}} sometimes fails (all types of builds) with error message like below: {noformat} src/kudu/ranger/ranger_client-test.cc:398: Failure Failed Bad status: Timed out: timed out while in flight I0620 07:06:02.907177 1140 server.cc:247] Received an EOF from the subprocess I0620 07:06:02.910923 1137 server.cc:317] get failed, inbound queue shut down: Aborted: I0620 07:06:02.910964 1141 server.cc:380] outbound queue shut down: Aborted: I0620 07:06:02.910995 1138 server.cc:317] get failed, inbound queue shut down: Aborted: I0620 07:06:02.910984 1139 server.cc:317] get failed, inbound queue shut down: Aborted: {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Component/s: (was: perf) tserver consensus > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: consensus, tserver >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Status: In Review (was: In Progress) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Code Review: https://gerrit.cloudera.org/#/c/16034/ > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2727: --- Assignee: Alexey Serbin > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Alexey Serbin >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2727: Labels: performance scalability (was: ) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > Labels: performance, scalability > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3129) ToolTest.TestHmsList can timeout
[ https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136729#comment-17136729 ] Alexey Serbin commented on KUDU-3129: - The test also times out in case of RELEASE builds: the log is attached. [^kudu-tool-test.2.txt.xz] > ToolTest.TestHmsList can timeout > > > Key: KUDU-3129 > URL: https://issues.apache.org/jira/browse/KUDU-3129 > Project: Kudu > Issue Type: Bug > Components: hms, test >Affects Versions: 1.12.0 >Reporter: Andrew Wong >Priority: Major > Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz > > > When running in TSAN mode, the test timed out, spending 10 minutes not really > doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, > appearing to hang while running the HMS tool. > {code} > I0521 22:31:49.436857 4601 catalog_manager.cc:1161] Initializing in-progress > tserver states... > I0521 22:31:49.446161 4606 hms_notification_log_listener.cc:228] Skipping > Hive Metastore notification log poll: Service unavailable: Catalog manager is > not initialized. State: Starting > I0521 22:31:49.839709 4488 heartbeater.cc:325] Connected to a master server > at 127.0.89.254:42487 > I0521 22:31:49.845547 4559 master_service.cc:295] Got heartbeat from unknown > tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: > 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this > server to re-register. > I0521 22:31:49.846786 4488 heartbeater.cc:416] Registering TS with master... > I0521 22:31:49.847297 4488 heartbeater.cc:465] Master 127.0.89.254:42487 > requested a full tablet report, sending... > I0521 22:31:49.849771 4559 ts_manager.cc:191] Registered new tserver with > Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527) > I0521 22:31:49.852535 359 external_mini_cluster.cc:699] 1 TS(s) registered > with all masters > W0521 22:32:23.142868 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b060 after lost signal to thread 4531 > W0521 22:32:23.14 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b780 after lost signal to thread 4591 > W0521 22:32:28.996440 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b740 after lost signal to thread 4531 > W0521 22:32:28.996966 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b520 after lost signal to thread 4591 > W0521 22:33:05.743249 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002aae0 after lost signal to thread 4360 > W0521 22:33:05.743983 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af00 after lost signal to thread 4486 > I0521 22:33:49.594769 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > FlushMRSOp(): perf score=0.033386 > I0521 22:33:49.637208 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > FlushMRSOp() complete. Timing: real 0.042s > user 0.032s sys 0.008s Metrics: > {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819} > I0521 22:33:49.639096 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > UndoDeltaBlockGCOp(): 396 bytes on disk > I0521 22:33:49.640486 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > UndoDeltaBlockGCOp() complete. Timing: real > 0.001suser 0.001s sys 0.000s Metrics: > {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4} > W0521 22:34:17.794472 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002ade0 after lost signal to thread 4360 > W0521 22:34:17.795437 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a7e0 after lost signal to thread 4486 > W0521 22:34:20.286921 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b2e0 after lost signal to thread 4531 > W0521 22:34:20.287376 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b140 after lost signal to thread 4591 > W0521 22:35:27.726336 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af40 after lost signal to thread 4360 > W0521 22:35:27.727084 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a980 after lost signal to thread 4486 > W0521 22:36:12.250830 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b9c0 after lost signal to thread 4531 > W0521 22:36:12.251247 4545
[jira] [Updated] (KUDU-3129) ToolTest.TestHmsList can timeout
[ https://issues.apache.org/jira/browse/KUDU-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3129: Attachment: kudu-tool-test.2.txt.xz > ToolTest.TestHmsList can timeout > > > Key: KUDU-3129 > URL: https://issues.apache.org/jira/browse/KUDU-3129 > Project: Kudu > Issue Type: Bug > Components: hms, test >Affects Versions: 1.12.0 >Reporter: Andrew Wong >Priority: Major > Attachments: kudu-tool-test.2.txt, kudu-tool-test.2.txt.xz > > > When running in TSAN mode, the test timed out, spending 10 minutes not really > doing anything. It isn't obvious why, but ToolTest.TestHmsList can timeout, > appearing to hang while running the HMS tool. > {code} > I0521 22:31:49.436857 4601 catalog_manager.cc:1161] Initializing in-progress > tserver states... > I0521 22:31:49.446161 4606 hms_notification_log_listener.cc:228] Skipping > Hive Metastore notification log poll: Service unavailable: Catalog manager is > not initialized. State: Starting > I0521 22:31:49.839709 4488 heartbeater.cc:325] Connected to a master server > at 127.0.89.254:42487 > I0521 22:31:49.845547 4559 master_service.cc:295] Got heartbeat from unknown > tserver (permanent_uuid: "cf9e08c4271e4d9aa28b1aacbd630908" instance_seqno: > 1590100304311876) as {username='slave'} at 127.0.89.193:33867; Asking this > server to re-register. > I0521 22:31:49.846786 4488 heartbeater.cc:416] Registering TS with master... > I0521 22:31:49.847297 4488 heartbeater.cc:465] Master 127.0.89.254:42487 > requested a full tablet report, sending... > I0521 22:31:49.849771 4559 ts_manager.cc:191] Registered new tserver with > Master: cf9e08c4271e4d9aa28b1aacbd630908 (127.0.89.193:43527) > I0521 22:31:49.852535 359 external_mini_cluster.cc:699] 1 TS(s) registered > with all masters > W0521 22:32:23.142868 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b060 after lost signal to thread 4531 > W0521 22:32:23.14 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b780 after lost signal to thread 4591 > W0521 22:32:28.996440 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b740 after lost signal to thread 4531 > W0521 22:32:28.996966 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b520 after lost signal to thread 4591 > W0521 22:33:05.743249 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002aae0 after lost signal to thread 4360 > W0521 22:33:05.743983 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af00 after lost signal to thread 4486 > I0521 22:33:49.594769 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > FlushMRSOp(): perf score=0.033386 > I0521 22:33:49.637208 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > FlushMRSOp() complete. Timing: real 0.042s > user 0.032s sys 0.008s Metrics: > {"bytes_written":6485,"cfile_init":1,"dirs.queue_time_us":675,"dirs.run_cpu_time_us":237,"dirs.run_wall_time_us":997,"drs_written":1,"lbm_read_time_us":231,"lbm_reads_lt_1ms":4,"lbm_write_time_us":1980,"lbm_writes_lt_1ms":27,"rows_written":5,"thread_start_us":953,"threads_started":2,"wal-append.queue_time_us":819} > I0521 22:33:49.639096 4549 maintenance_manager.cc:326] P > c3cc85c33a5447b2aa520019fe162966: Scheduling > UndoDeltaBlockGCOp(): 396 bytes on disk > I0521 22:33:49.640486 4548 maintenance_manager.cc:525] P > c3cc85c33a5447b2aa520019fe162966: > UndoDeltaBlockGCOp() complete. Timing: real > 0.001suser 0.001s sys 0.000s Metrics: > {"cfile_init":1,"lbm_read_time_us":269,"lbm_reads_lt_1ms":4} > W0521 22:34:17.794472 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002ade0 after lost signal to thread 4360 > W0521 22:34:17.795437 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a7e0 after lost signal to thread 4486 > W0521 22:34:20.286921 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b2e0 after lost signal to thread 4531 > W0521 22:34:20.287376 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b140 after lost signal to thread 4591 > W0521 22:35:27.726336 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002af40 after lost signal to thread 4360 > W0521 22:35:27.727084 4380 debug-util.cc:402] Leaking SignalData structure > 0x7b080002a980 after lost signal to thread 4486 > W0521 22:36:12.250830 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b9c0 after lost signal to thread 4531 > W0521 22:36:12.251247 4545 debug-util.cc:402] Leaking SignalData structure > 0x7b080002b220 after lost signal to thread 4591 > W0521
[jira] [Resolved] (KUDU-3145) KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called
[ https://issues.apache.org/jira/browse/KUDU-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3145. - Fix Version/s: 1.13.0 Resolution: Fixed > KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called > - > > Key: KUDU-3145 > URL: https://issues.apache.org/jira/browse/KUDU-3145 > Project: Kudu > Issue Type: Sub-task >Reporter: zhaorenhai >Assignee: huangtianhua >Priority: Major > Fix For: 1.13.0 > > Time Spent: 20m > Remaining Estimate: 0h > > KUDU_LINK should be set before function APPEND_LINKER_FLAGS is called > > Because in function APPEND_LINKER_FLAGS , there are following logic: > {code:java} > if ("${LINKER_FAMILY}" STREQUAL "gold") > if("${LINKER_VERSION}" VERSION_LESS "1.12" AND > "${KUDU_LINK}" STREQUAL "d") > message(WARNING "Skipping gold <1.12 with dynamic linking.") > continue() > endif() > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126243#comment-17126243 ] Alexey Serbin commented on KUDU-2727: - One more set of stack traces: {noformat} tids=[1324418] 0x7f61b79fc5e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb4236d kudu::consensus::Peer::SendNextRequest() 0xb43771 _ZN5boost6detail8function26void_function_obj_invoker0IZN4kudu9consensus4Peer13SignalRequestEbEUlvE_vE6invokeERNS1_15function_bufferE 0x1eb1d1d kudu::FunctionRunnable::Run() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone tids=[93293,93284,93285,93286,93287,93288,93289,93290,93291,93292,93304,93294,93295,93296,93297,93298,93299,93300,93301,93302,93303,93313,93322,93321,93320,93319,93318,93317,93316,93315,93314,93283,93312,93311,93310,93309,93308,93307,93306,93305] 0x7f61b79fc5e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() 0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync() 0xaa344c kudu::tablet::TabletReplica::SubmitWrite() 0x928fb0 kudu::tserver::TabletServiceImpl::Write() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone tids=[1324661] 0x7f61b79fc5e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7df8e kudu::consensus::RaftConsensus::Replicate() 0xaab8e7 kudu::tablet::TransactionDriver::Prepare() 0xaac009 kudu::tablet::TransactionDriver::PrepareTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone tids=[93383] 0x7f61b79fc5e0 0x7f61b79f8cf2 __pthread_cond_timedwait 0x1dfcfa9 kudu::ConditionVariable::WaitUntil() 0xb73bc7 kudu::consensus::RaftConsensus::UpdateReplica() 0xb75128 kudu::consensus::RaftConsensus::Update() 0x92c5d1 kudu::tserver::ConsensusServiceImpl::UpdateConsensus() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7f61b79f4e25 start_thread 0x7f61b5cd234d __clone {noformat} Thread {{93383}} holds the lock, waiting on another conditional variable and blocks many other threads. > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a
[jira] [Comment Edited] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125466#comment-17125466 ] Alexey Serbin edited comment on KUDU-2727 at 6/4/20, 3:14 AM: -- Another set of stacks, just for more context (captured with code close to kudu 1.10.1): {noformat} tids=[1866940] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex() 0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask() 0xb47850 _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() 0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync() 0xaa344c kudu::tablet::TabletReplica::SubmitWrite() 0x928fb0 kudu::tserver::TabletServiceImpl::Write() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866932,1866929] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c24c kudu::log::Log::AsyncAppendCommit() 0xaad489 kudu::tablet::TransactionDriver::ApplyTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866928] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c493 kudu::log::Log::AsyncAppendReplicates() 0xb597e9 kudu::consensus::LogCache::AppendOperations() 0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations() 0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation() 0xb6f28c kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() 0xb7dff8 kudu::consensus::RaftConsensus::Replicate() 0xaab8e7 kudu::tablet::TransactionDriver::Prepare() 0xaac009 kudu::tablet::TransactionDriver::PrepareTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() {noformat} In the stacks above, thread {{1866928}} is holding a lock taken in {{RaftConsensus::Replicate()}} while waiting on a condition variable in {{Log::AsyncAppend()}}, calling {{entry_batch_queue_.BlockingPut(entry_batch.get())}}. was (Author: aserbin): Another set of stacks, just for more context (captured with code close to kudu 1.10.1): {noformat} tids=[1866940] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex() 0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask() 0xb47850 _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8
[jira] [Assigned] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-2727: --- Assignee: (was: Mike Percy) > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 > _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22185,22194,22193,22188,22187,22186] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb8bff8 > kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() > 0xaaaef9 kudu::tablet::TransactionDriver::ExecuteAsync() > 0xaa3742 kudu::tablet::TabletReplica::SubmitWrite() > 0x92812d kudu::tserver::TabletServiceImpl::Write() >0x1e28f3c kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[22192,22191] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() >0x1e13dec kudu::rpc::ResultTracker::TrackRpc() >0x1e28ef5 kudu::rpc::GeneratedServiceIf::Handle() >0x1e2986a kudu::rpc::ServicePool::RunThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4426] > 0x379ba0f710 >0x206d3d0 >0x212fd25 google::protobuf::Message::SpaceUsedLong() >0x211dee4 > google::protobuf::internal::GeneratedMessageReflection::SpaceUsedLong() > 0xb6658e kudu::consensus::LogCache::AppendOperations() > 0xb5c539 kudu::consensus::PeerMessageQueue::AppendOperations() > 0xb5c7c7 kudu::consensus::PeerMessageQueue::AppendOperation() > 0xb7c675 > kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() > 0xb8c147 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > {noformat} > {{kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm()}} needs to > take the lock to check the term and the Raft role. When many RPCs come in for > the same tablet, the contention can hog service threads and cause queue > overflows on busy systems. > Yugabyte switched their equivalent lock to be an atomic that allows them to > read the term and role wait-free. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2727) Contention on the Raft consensus lock can cause tablet service queue overflows
[ https://issues.apache.org/jira/browse/KUDU-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125466#comment-17125466 ] Alexey Serbin commented on KUDU-2727: - Another set of stacks, just for more context (captured with code close to kudu 1.10.1): {noformat} tids=[1866940] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb68518 kudu::consensus::RaftConsensus::NotifyCommitIndex() 0xb4c9e7 kudu::consensus::PeerMessageQueue::NotifyObserversTask() 0xb47850 _ZN4kudu8internal7InvokerILi2ENS0_9BindStateINS0_15RunnableAdapterIMNS_9consensus16PeerMessageQueueEFvRKSt8functionIFvPNS4_24PeerMessageQueueObserverEEFvPS5_SC_EFvNS0_17UnretainedWrapperIS5_EEZNS5_34NotifyObserversOfCommitIndexChangeElEUlS8_E_EEESH_E3RunEPNS0_13BindStateBaseE 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1370336,1370326,1370327,1370328,1370329,1370330,1370331,1370332,1370333,1370334,1370335,1370323,1370337,1370338,1370339,1370340,1370341,1370342,1370343,1370344,1370345,1370346,1370353,1370361,1370360,1370359,1370358,1370357,1370356,1370355,1370354,1370325,1370352,1370351,1370350,1370349,1370348,1370347,1370322,1370324] 0x7fc8d67f95e0 0x1ec35f4 base::internal::SpinLockDelay() 0x1ec347c base::SpinLock::SlowLock() 0xb7deb8 kudu::consensus::RaftConsensus::CheckLeadershipAndBindTerm() 0xaab010 kudu::tablet::TransactionDriver::ExecuteAsync() 0xaa344c kudu::tablet::TabletReplica::SubmitWrite() 0x928fb0 kudu::tserver::TabletServiceImpl::Write() 0x1d2e8d9 kudu::rpc::GeneratedServiceIf::Handle() 0x1d2efd9 kudu::rpc::ServicePool::RunThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866932,1866929] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c24c kudu::log::Log::AsyncAppendCommit() 0xaad489 kudu::tablet::TransactionDriver::ApplyTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() 0x7fc8d67f1e25 start_thread 0x7fc8d4acf34d __clone tids=[1866928] 0x7fc8d67f95e0 0x7fc8d67f5943 __pthread_cond_wait 0xb99b38 kudu::log::Log::AsyncAppend() 0xb9c493 kudu::log::Log::AsyncAppendReplicates() 0xb597e9 kudu::consensus::LogCache::AppendOperations() 0xb4fa24 kudu::consensus::PeerMessageQueue::AppendOperations() 0xb4fd45 kudu::consensus::PeerMessageQueue::AppendOperation() 0xb6f28c kudu::consensus::RaftConsensus::AppendNewRoundToQueueUnlocked() 0xb7dff8 kudu::consensus::RaftConsensus::Replicate() 0xaab8e7 kudu::tablet::TransactionDriver::Prepare() 0xaac009 kudu::tablet::TransactionDriver::PrepareTask() 0x1eaedbf kudu::ThreadPool::DispatchThread() 0x1ea4a84 kudu::Thread::SuperviseThread() {noformat} > Contention on the Raft consensus lock can cause tablet service queue overflows > -- > > Key: KUDU-2727 > URL: https://issues.apache.org/jira/browse/KUDU-2727 > Project: Kudu > Issue Type: Improvement > Components: perf >Reporter: William Berkeley >Assignee: Mike Percy >Priority: Major > > Here's stacks illustrating the phenomenon: > {noformat} > tids=[2201] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb4e68e kudu::consensus::Peer::SignalRequest() > 0xb9c0df kudu::consensus::PeerManager::SignalRequest() > 0xb8c178 kudu::consensus::RaftConsensus::Replicate() > 0xaab816 kudu::tablet::TransactionDriver::Prepare() > 0xaac0ed kudu::tablet::TransactionDriver::PrepareTask() >0x1fa37ed kudu::ThreadPool::DispatchThread() >0x1f9c2a1 kudu::Thread::SuperviseThread() > 0x379ba079d1 start_thread > 0x379b6e88fd clone > tids=[4515] > 0x379ba0f710 >0x1fb951a base::internal::SpinLockDelay() >0x1fb93b7 base::SpinLock::SlowLock() > 0xb74c60 kudu::consensus::RaftConsensus::NotifyCommitIndex() > 0xb59307 kudu::consensus::PeerMessageQueue::NotifyObserversTask() > 0xb54058 >
[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag
[ https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2781: Component/s: master > Hardening for location awareness command-line flag > -- > > Key: KUDU-2781 > URL: https://issues.apache.org/jira/browse/KUDU-2781 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > Add few verification steps related to the location assignment: > * the location assignment executable is present and executable > * the location assignment executable conforms with the expected interface: it > accepts one argument (IP address or DNS name) and outputs the assigned > location into the stdout > * the same DNS name/IP address assigned the same location > * the result location output into the stdout conforms with the format for > locations in Kudu > It's possible to implement these in {{kudu-master}} using group flag > validators: see the {{GROUP_FLAG_VALIDATOR}} macro. > Performing few verification steps mentioned above should help to avoid > situations when Kudu tablet servers cannot be registered with Kudu master if > the location assignment executable path is misspelled or the executable > behaves not as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag
[ https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2781: Description: Add few verification steps related to the location assignment: * the location assignment executable is present and executable * the location assignment executable conforms with the expected interface: it accepts one argument (IP address or DNS name) and outputs the assigned location into the stdout * the same DNS name/IP address assigned the same location * the result location output into the stdout conforms with the format for locations in Kudu It's possible to implement these in {{kudu-master}} using group flag validators: see the {{GROUP_FLAG_VALIDATOR}} macro. Performing few verification steps mentioned above should help to avoid situations when Kudu tablet servers cannot be registered with Kudu master if the location assignment executable path is misspelled or the executable behaves not as expected. > Hardening for location awareness command-line flag > -- > > Key: KUDU-2781 > URL: https://issues.apache.org/jira/browse/KUDU-2781 > Project: Kudu > Issue Type: Improvement >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > Add few verification steps related to the location assignment: > * the location assignment executable is present and executable > * the location assignment executable conforms with the expected interface: it > accepts one argument (IP address or DNS name) and outputs the assigned > location into the stdout > * the same DNS name/IP address assigned the same location > * the result location output into the stdout conforms with the format for > locations in Kudu > It's possible to implement these in {{kudu-master}} using group flag > validators: see the {{GROUP_FLAG_VALIDATOR}} macro. > Performing few verification steps mentioned above should help to avoid > situations when Kudu tablet servers cannot be registered with Kudu master if > the location assignment executable path is misspelled or the executable > behaves not as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2781) Hardening for location awareness command-line flag
[ https://issues.apache.org/jira/browse/KUDU-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2781: Labels: observability supportability (was: ) > Hardening for location awareness command-line flag > -- > > Key: KUDU-2781 > URL: https://issues.apache.org/jira/browse/KUDU-2781 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Labels: observability, supportability > > Add few verification steps related to the location assignment: > * the location assignment executable is present and executable > * the location assignment executable conforms with the expected interface: it > accepts one argument (IP address or DNS name) and outputs the assigned > location into the stdout > * the same DNS name/IP address assigned the same location > * the result location output into the stdout conforms with the format for > locations in Kudu > It's possible to implement these in {{kudu-master}} using group flag > validators: see the {{GROUP_FLAG_VALIDATOR}} macro. > Performing few verification steps mentioned above should help to avoid > situations when Kudu tablet servers cannot be registered with Kudu master if > the location assignment executable path is misspelled or the executable > behaves not as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-2169) Allow replicas that do not exist to vote
[ https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124314#comment-17124314 ] Alexey Serbin edited comment on KUDU-2169 at 6/2/20, 9:08 PM: -- Now we have the 3-4-3 replica management scheme, and we don't use the 3-2-3 scheme anymore. With the 3-4-3 scheme there are scenarios where the system first evicts a replica, and then adds a new non-voter replica: that's when the replica to be evicted fails behind WAL segment GC threshold or experience a disk failure. In very rare cases it might happen that a tablet ends up with leader replica A not being able to replicate/commit the change in the Raft configuration, as described. >From the other side, such a newly added replica D in case of the 3-4-3 scheme >is a non-voter, and it cannot vote by definition. In other words, some manual intervention would be necessary in the described scenario, but not the way how this JIRA proposes it to be implemented. Closing as 'Won't Do'. was (Author: aserbin): Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme anymore. With the 3-4-3 scheme there are scenarios where the system first evicts a replica, and then adds a new non-voter replica: that's when the replica to be evicted fails behind WAL segment GC threshold or experience a disk failure. In very rare cases it might happen that a tablet ends up with leader replica A, and replica A cannot replicate/commit the change in the Raft configuration as described. >From the other side, such a newly replica D in case of the 3-4-3 scheme is a >non-voter, and it cannot vote by definition. In other words, some manual intervention would be necessary in the described scenario, but not the way how this JIRA proposes. Closing as 'Won't Do'. > Allow replicas that do not exist to vote > > > Key: KUDU-2169 > URL: https://issues.apache.org/jira/browse/KUDU-2169 > Project: Kudu > Issue Type: Sub-task > Components: consensus >Reporter: Mike Percy >Priority: Major > Fix For: n/a > > > In certain scenarios it is desirable for replicas that do not exist on a > tablet server to be able to vote. After the implementation of KUDU-871, > tombstoned tablets are now able to vote. However, there are circumstances (at > least in a pre- KUDU-1097 world) where voters that do not have a copy of a > replica (running or tombstoned) would be needed to vote to ensure > availability in certain edge-case failure scenarios. > The quick justification for why it would be safe for a non-existent replica > to vote is that it would be equivalent to a replica that has simply not yet > replicated any WAL entries, in which case it would be legal to vote for any > candidate. Of course, a candidate would only ask such a replica to vote for > it if it believed that replica to be a voter in its config. > Some additional discussion can be found here: > https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted > What follows is an example of a scenario where "non-existent" replicas being > able to vote would be desired: > In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, > B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). > Before A is able to replicate this config change to B or D, A is partitioned > from a network perspective. However A writes this config change to its local > WAL. After this, the entire cluster is brought down, the network is restored, > and the entire cluster is restarted. However, B fails to come back online due > to a hardware failure. > The only way to automatically recover in this scenario is to allow D, which > has no concept of the tablet being discussed, to vote for A to become leader, > which will then tablet copy to D and make the tablet available for writes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2169) Allow replicas that do not exist to vote
[ https://issues.apache.org/jira/browse/KUDU-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2169. - Fix Version/s: n/a Resolution: Won't Do Now we have the 3-4-3 replica management scheme, and we don't use 3-2-3 scheme anymore. With the 3-4-3 scheme there are scenarios where the system first evicts a replica, and then adds a new non-voter replica: that's when the replica to be evicted fails behind WAL segment GC threshold or experience a disk failure. In very rare cases it might happen that a tablet ends up with leader replica A, and replica A cannot replicate/commit the change in the Raft configuration as described. >From the other side, such a newly replica D in case of the 3-4-3 scheme is a >non-voter, and it cannot vote by definition. In other words, some manual intervention would be necessary in the described scenario, but not the way how this JIRA proposes. Closing as 'Won't Do'. > Allow replicas that do not exist to vote > > > Key: KUDU-2169 > URL: https://issues.apache.org/jira/browse/KUDU-2169 > Project: Kudu > Issue Type: Sub-task > Components: consensus >Reporter: Mike Percy >Priority: Major > Fix For: n/a > > > In certain scenarios it is desirable for replicas that do not exist on a > tablet server to be able to vote. After the implementation of KUDU-871, > tombstoned tablets are now able to vote. However, there are circumstances (at > least in a pre- KUDU-1097 world) where voters that do not have a copy of a > replica (running or tombstoned) would be needed to vote to ensure > availability in certain edge-case failure scenarios. > The quick justification for why it would be safe for a non-existent replica > to vote is that it would be equivalent to a replica that has simply not yet > replicated any WAL entries, in which case it would be legal to vote for any > candidate. Of course, a candidate would only ask such a replica to vote for > it if it believed that replica to be a voter in its config. > Some additional discussion can be found here: > https://github.com/apache/kudu/blob/master/docs/design-docs/raft-tablet-copy.md#should-a-server-be-allowed-to-vote-if-it-does_not_exist-or-is-deleted > What follows is an example of a scenario where "non-existent" replicas being > able to vote would be desired: > In a 3-2-3 re-replication paradigm, the leader (A) of a 3-replica config \{A, > B, C\} evicts one replica (C). Then, the leader (A) adds a new voter (D). > Before A is able to replicate this config change to B or D, A is partitioned > from a network perspective. However A writes this config change to its local > WAL. After this, the entire cluster is brought down, the network is restored, > and the entire cluster is restarted. However, B fails to come back online due > to a hardware failure. > The only way to automatically recover in this scenario is to allow D, which > has no concept of the tablet being discussed, to vote for A to become leader, > which will then tablet copy to D and make the tablet available for writes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-1621) Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND session
[ https://issues.apache.org/jira/browse/KUDU-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-1621. - Fix Version/s: n/a Resolution: Won't Fix Automatically flushing data in the {{KuduSession}} might block, indeed. It seems the current approach of issuing a warning when data is not flushed is good enough: it's uniform across all flush modes and avoids application hanging on close. > Flush lingering operations upon destruction of an AUTO_FLUSH_BACKGROUND > session > --- > > Key: KUDU-1621 > URL: https://issues.apache.org/jira/browse/KUDU-1621 > Project: Kudu > Issue Type: Improvement > Components: client >Affects Versions: 1.0.0 >Reporter: Alexey Serbin >Priority: Major > Fix For: n/a > > > In current implementation of AUTO_FLUSH_BACKGROUND mode, it's necessary to > call KuduSession::Flush() or KuduSession::FlushAsync() explicitly before > destroying/abandoning a session if it's desired to have any pending > operations flushed. > As [~adar] noticed during review of https://gerrit.cloudera.org/#/c/4432/ , > it might make sense to change this behavior to automatically flush any > pending operations upon closing Kudu AUTO_FLUSH_BACKGROUND session. That > would be more consistent with the semantics of the AUTO_FLUSH_BACKGROUND mode > and more user-friendly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release
[ https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120608#comment-17120608 ] Alexey Serbin commented on KUDU-3131: - Hi [~RenhaiZhao], at the server where I do a lot of compilation/testing the version of glibc is {{2.12-1.149.el6_6.9}}. It's a really old installation: CentOS6.6 > test rw_mutex-test hangs sometimes if build_type is release > --- > > Key: KUDU-3131 > URL: https://issues.apache.org/jira/browse/KUDU-3131 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Priority: Major > > Built and test kudu on aarch64, in release mode there is a test hangs > sometimes(maybe a deadlock?) the console out as following: > [==] Running 2 tests from 1 test case. > [--] Global test environment set-up. > [--] 2 tests from Priorities/RWMutexTest > [ RUN ] Priorities/RWMutexTest.TestDeadlocks/0 > And seems it's ok in debug mode. > Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would > you please have a look for this? Or give some suggestion to us, thanks very > much. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-368) Run local benchmarks under perf-stat
[ https://issues.apache.org/jira/browse/KUDU-368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-368: -- Assignee: Alexey Serbin > Run local benchmarks under perf-stat > > > Key: KUDU-368 > URL: https://issues.apache.org/jira/browse/KUDU-368 > Project: Kudu > Issue Type: Improvement > Components: test >Affects Versions: M4.5 >Reporter: Todd Lipcon >Assignee: Alexey Serbin >Priority: Minor > Labels: benchmarks, perf > > Would be nice to run a lot of our nightly benchmarks under perf-stat so we > can see on regression what factors changed (eg instruction count, cycles, > stalled cycles, cache misses, etc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2604) Add label for tserver
[ https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118788#comment-17118788 ] Alexey Serbin commented on KUDU-2604: - [~granthenke], yes, I think the remaining functionality can be broken down into smaller JIRA items. At the higher level, I see the following pieces: * Define and assign tags to tablet servers. * Update master's placement policies to take into account tags when adding/distributing replicas of tablets. * Add support for C++ and Java clients: clients can specify set of tags when creating tables. * The {{kudu cluster rebalance}} tool and the auto-rebalancer honors the tags when rebalancing corresponding tables. The tools is also able to report on tablet replicas which are placed in a non-conforming way w.r.t. tags specified for tables (those con-conformant-placed replicas might appear during automatic re-replication: this is something similar that we have with current placement policies). * The {{kudu cluster ksck}} CLI tool provides information on tags for tablet servers. We can create sub-tasks for this if we decide to implement this. > Add label for tserver > - > > Key: KUDU-2604 > URL: https://issues.apache.org/jira/browse/KUDU-2604 > Project: Kudu > Issue Type: New Feature >Reporter: Hong Shen >Priority: Major > Labels: location-awareness, rack-awareness > Fix For: n/a > > Attachments: image-2018-10-15-21-52-21-426.png > > > When the cluster is bigger and bigger, big table with a lot of tablets will > be distributed in almost all the tservers, when client write batch to the big > table, it may cache connections to lots of tservers, the scalability may > constrained. > If the tablets in one table or partition only in a part of tservers, client > will only have to cache connections to the part's tservers. So we propose to > add label to tservers, each tserver belongs to a unique label. Client > specified label when create table or add partition, the tablets will only be > created on the tservers in specified label, if not specified, defalut label > will be used. > It will also benefit for: > 1 Tserver across data center. > 2 Heterogeneous tserver, like different disk, cpu or memory. > 3 Physical isolating, especially IO, isolate some tables with others. > 4 Gated Launch, upgrade tservers one by one label. > In our product cluster, we have encounter the above issues and need to be > resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KUDU-2604) Add label for tserver
[ https://issues.apache.org/jira/browse/KUDU-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reopened KUDU-2604: - It seems this JIRA item contains some useful ideas and details which are orthogonal to current implementation of the rack awareness feature. If implemented, they might complement the overall functionality of the placement policies in Kudu. I'm removing the duplicate of KUDU-1535 resolution. > Add label for tserver > - > > Key: KUDU-2604 > URL: https://issues.apache.org/jira/browse/KUDU-2604 > Project: Kudu > Issue Type: New Feature >Reporter: Hong Shen >Priority: Major > Labels: location-awareness, rack-awareness > Fix For: n/a > > Attachments: image-2018-10-15-21-52-21-426.png > > > When the cluster is bigger and bigger, big table with a lot of tablets will > be distributed in almost all the tservers, when client write batch to the big > table, it may cache connections to lots of tservers, the scalability may > constrained. > If the tablets in one table or partition only in a part of tservers, client > will only have to cache connections to the part's tservers. So we propose to > add label to tservers, each tserver belongs to a unique label. Client > specified label when create table or add partition, the tablets will only be > created on the tservers in specified label, if not specified, defalut label > will be used. > It will also benefit for: > 1 Tserver across data center. > 2 Heterogeneous tserver, like different disk, cpu or memory. > 3 Physical isolating, especially IO, isolate some tables with others. > 4 Gated Launch, upgrade tservers one by one label. > In our product cluster, we have encounter the above issues and need to be > resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-1865) Create fast path for RespondSuccess() in KRPC
[ https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117387#comment-17117387 ] Alexey Serbin commented on KUDU-1865: - Some more stacks captured from diagnostic logs for {{kudu-master}} process (kudu 1.10): {noformat} Stacks at 0516 18:53:00.042003 (service queue overflowed for kudu.master.MasterService): tids=[736230] 0x7f803a76a5e0 0xb6219e tcmalloc::ThreadCache::ReleaseToCentralCache() 0xb62530 tcmalloc::ThreadCache::Scavenge() 0xad8a27 kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock() 0xaa3a31 kudu::master::MasterServiceImpl::GetTableSchema() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736248,736245,736243,736242] 0x7f803a76a5e0 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xac5814 kudu::master::CatalogManager::CheckOnline() 0xae5032 kudu::master::CatalogManager::GetTableSchema() 0xaa3a85 kudu::master::MasterServiceImpl::GetTableSchema() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736239,736229,736232,736233,736234,736235,736236,736237,736238,736240,736241,736244,736247] 0x7f803a76a5e0 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xac5814 kudu::master::CatalogManager::CheckOnline() 0xaf102f kudu::master::CatalogManager::GetTableLocations() 0xaa36f8 kudu::master::MasterServiceImpl::GetTableLocations() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone tids=[736246,736231] 0x7f803a76a5e0 0x23c5b44 base::internal::SpinLockDelay() 0x23c59cc base::SpinLock::SlowLock() 0xad8b7c kudu::master::CatalogManager::ScopedLeaderSharedLock::ScopedLeaderSharedLock() 0xaa369d kudu::master::MasterServiceImpl::GetTableLocations() 0x221aaa9 kudu::rpc::GeneratedServiceIf::Handle() 0x221b1a9 kudu::rpc::ServicePool::RunThread() 0x23a8f84 kudu::Thread::SuperviseThread() 0x7f803a762e25 start_thread 0x7f8038a4134d __clone {noformat} > Create fast path for RespondSuccess() in KRPC > - > > Key: KUDU-1865 > URL: https://issues.apache.org/jira/browse/KUDU-1865 > Project: Kudu > Issue Type: Improvement > Components: rpc >Reporter: Sailesh Mukil >Priority: Major > Labels: perfomance, rpc > Attachments: alloc-pattern.py, cross-thread.txt > > > A lot of RPCs just respond with RespondSuccess() which returns the exact > payload every time. This takes the same path as any other response by > ultimately calling Connection::QueueResponseForCall() which has a few small > allocations. These small allocations (and their corresponding deallocations) > are called quite frequently (once for every IncomingCall) and end up taking > quite some time in the kernel (traversing the free list, spin locks etc.) > This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 > node cluster and found the following: > The exact % of time spent is hard to quantify from the profiles, but these > were the among the top 5 of the slowest stacks: > {code:java} > impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file] > impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown > source file] > impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source > file] > impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown > source file] > impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file] > impalad ! operator delete + 0x329 - [unknown source file] > impalad ! __gnu_cxx::new_allocator::deallocate + 0x4 - > new_allocator.h:110 > impalad ! std::_Vector_base std::allocator>::_M_deallocate + 0x5 - stl_vector.h:178 > impalad ! ~_Vector_base + 0x4 - stl_vector.h:160 > impalad ! ~vector - stl_vector.h:425'slices' vector > impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - > connection.cc:433 > impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133 > impalad !
[jira] [Commented] (KUDU-3131) test rw_mutex-test hangs sometimes if build_type is release
[ https://issues.apache.org/jira/browse/KUDU-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117008#comment-17117008 ] Alexey Serbin commented on KUDU-3131: - I cannot reproduce this on x86_64 architecture and I don't have access to aarch64 at this point. I'd try to attach to the hung process with a debugger and see what's going on. [~huangtianhua], did you have a chance to try that? > test rw_mutex-test hangs sometimes if build_type is release > --- > > Key: KUDU-3131 > URL: https://issues.apache.org/jira/browse/KUDU-3131 > Project: Kudu > Issue Type: Sub-task >Reporter: huangtianhua >Priority: Major > > Built and test kudu on aarch64, in release mode there is a test hangs > sometimes(maybe a deadlock?) the console out as following: > [==] Running 2 tests from 1 test case. > [--] Global test environment set-up. > [--] 2 tests from Priorities/RWMutexTest > [ RUN ] Priorities/RWMutexTest.TestDeadlocks/0 > And seems it's ok in debug mode. > Now only this one test failed sometimes on aarch64, [~aserbin] [~adar] would > you please have a look for this? Or give some suggestion to us, thanks very > much. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2453) kudu should stop creating tablet infinitely
[ https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110649#comment-17110649 ] Alexey Serbin commented on KUDU-2453: - There is a reproduction scenario for the issue described in this JIRA: https://gerrit.cloudera.org/#/c/15912/ > kudu should stop creating tablet infinitely > --- > > Key: KUDU-2453 > URL: https://issues.apache.org/jira/browse/KUDU-2453 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Affects Versions: 1.4.0, 1.7.2 >Reporter: LiFu He >Priority: Major > > I have met this problem again on 2018/10/26. And now the kudu version is > 1.7.2. > - > We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and > there are some load on the kudu cluster. Then someone else created a big > table which had tens of thousands of tablets from impala-shell (that was a > mistake). > {code:java} > CREATE TABLE XXX( > ... >PRIMARY KEY (...) > ) > PARTITION BY HASH (...) PARTITIONS 100, > RANGE (...) > ( > PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", > PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", > ... > PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" > ) > STORED AS KUDU > TBLPROPERTIES ('kudu.master_addresses'= '...'); > {code} > Here are the logs after creating table (only pick one tablet as example): > {code:java} > --Kudu-master log > ==e884bda6bbd3482f94c07ca0f34f99a4== > W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC > failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:50247 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS > 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) > ... > ==Be replaced by 0b144c00f35d48cca4d4981698faef72== > W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 > ... > I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T > P f6c9a09da7ef4fc191cab6276b942ba3: Sending > DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 > ... > I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 > on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by > 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) > ... > W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for > tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: > Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 > already in progress: creating tablet > ... > I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of > e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for > TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) > ... > W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS > b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC > failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:59735 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS > b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) > ... > ==Be replaced by c0e0acc448fc42fc9e48f5025b112a75== > W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75 > ... > --Kudu-tserver log > I1024 11:40:52.014571 137358
[jira] [Created] (KUDU-3124) A safer way to handle CreateTablet requests
Alexey Serbin created KUDU-3124: --- Summary: A safer way to handle CreateTablet requests Key: KUDU-3124 URL: https://issues.apache.org/jira/browse/KUDU-3124 Project: Kudu Issue Type: Improvement Components: master, tserver Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0, 1.2.0 Reporter: Alexey Serbin As of now, catalog manager (a part of kudu-master) sends {{CreateTabletRequest}} RPC as soon as they are realized by {{CatalogManager::ProcessPendingAssignments()}} when processing the list of deferred DDL operations, and at this level there isn't any restrictions on how many of those might be in flight or sent to a particular tablet server (NOTE: there is {{\-\-max_create_tablets_per_ts}} flag, but it works on a higher level and only during initial creation of a table). The {{CreateTablet}} requests are sent asynchronously, and if the tablet isn't created within {{\-\-tablet_creation_timeout_ms|| milliseconds, catalog manager replaces all the tablet replicas, generating a new tablet UUID and sending corresponding {{CreateTabletRequest}} RPCs to a potentially different set of tablet servers. Corresponding {{DeleteTabletRequest}} RPCs (to remove the replicas of the stalled-during-creation tablet) are sent separately in an asynchronous way as well. There are at least two issues with this approach: # The {{\-\-max_create_tablets_per_ts}} threshold limits the number of concurrent requests hitting one tablet server only during the initial creation of a table. However, nothing limits how many requests to create a table replica might hit a tablet server when adding partitions to an existing table as a result of ALTER TABLE request. # {{DeleteTabletRequest}} RPCs sometimes might not get into the RPC queues of corresponding tablet servers, and catalog manager stops retrying sending those after {{\-\-unresponsive_ts_rpc_timeout_ms}} interval. This might spiral into a situation when requests to create replacement tablet replicas are passing through and executed by tablet servers, but corresponding requests to delete tablet replica cannot get through because of queue overflows, with catalog manager eventually giving up retrying the latter ones. Eventually, tablet servers end up with huge number of tablet replicas created, and they crash running out of memory. The crashed tablet servers cannot start after that because they eventually run out of memory trying to bootstrap the huge number of tablet replicas (running out of memory again). See https://gerrit.cloudera.org/#/c/15912/ for the reproduction scenario and [KUDU-2453|https://issues.apache.org/jira/browse/KUDU-2453] for corresponding issue reported some time ago. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3000: Attachment: ksck_remote-test.01.txt.xz > RemoteKsckTest.TestChecksumSnapshot sometimes fails > --- > > Key: KUDU-3000 > URL: https://issues.apache.org/jira/browse/KUDU-3000 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz > > > The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test > sometimes fails with the following error message: > {noformat} > W1116 06:46:18.593114 3904 tablet_service.cc:2365] Rejecting scan request > for tablet 4ce9988aac744b > 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized > src/kudu/tools/ksck_remote-test.cc:407: Failure > Failed > {noformat} > Full log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3000) RemoteKsckTest.TestChecksumSnapshot sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105098#comment-17105098 ] Alexey Serbin commented on KUDU-3000: - Another failure (probably, this time the root cause different): {noformat} /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/ksck_remote-test.cc:407: Failure Failed Bad status: Aborted: 1 errors were detected {noformat} The log is attached. [^ksck_remote-test.01.txt.xz] > RemoteKsckTest.TestChecksumSnapshot sometimes fails > --- > > Key: KUDU-3000 > URL: https://issues.apache.org/jira/browse/KUDU-3000 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.10.0, 1.10.1, 1.11.0 >Reporter: Alexey Serbin >Priority: Major > Attachments: ksck_remote-test.01.txt.xz, ksck_remote-test.txt.xz > > > The {{TestChecksumSnapshot}} scenario of the {{RemoteKsckTest}} test > sometimes fails with the following error message: > {noformat} > W1116 06:46:18.593114 3904 tablet_service.cc:2365] Rejecting scan request > for tablet 4ce9988aac744b > 1bbde2772c66cce35d: Uninitialized: safe time has not yet been initialized > src/kudu/tools/ksck_remote-test.cc:407: Failure > Failed > {noformat} > Full log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3120) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) sometimes fails with timeout
Alexey Serbin created KUDU-3120: --- Summary: testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) sometimes fails with timeout Key: KUDU-3120 URL: https://issues.apache.org/jira/browse/KUDU-3120 Project: Kudu Issue Type: Bug Components: test Reporter: Alexey Serbin Attachments: test-output.txt.xz The subj sometimes fails due to timeout: {noformat} Time: 56.114 There was 1 failure: 1) testHiveMetastoreIntegration(org.apache.kudu.test.TestMiniKuduCluster) org.junit.runners.model.TestTimedOutException: test timed out after 5 milliseconds at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.kudu.test.cluster.MiniKuduCluster.sendRequestToCluster(MiniKuduCluster.java:162) at org.apache.kudu.test.cluster.MiniKuduCluster.start(MiniKuduCluster.java:235) at org.apache.kudu.test.cluster.MiniKuduCluster.access$300(MiniKuduCluster.java:72) at org.apache.kudu.test.cluster.MiniKuduCluster$MiniKuduClusterBuilder.build(MiniKuduCluster.java:697) at org.apache.kudu.test.TestMiniKuduCluster.testHiveMetastoreIntegration(TestMiniKuduCluster.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3119) ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN
Alexey Serbin created KUDU-3119: --- Summary: ToolTest.TestFsAddRemoveDataDirEndToEnd reports race under TSAN Key: KUDU-3119 URL: https://issues.apache.org/jira/browse/KUDU-3119 Project: Kudu Issue Type: Bug Components: CLI, test Reporter: Alexey Serbin Attachments: kudu-tool-test.log.xz Sometimes the {{TestFsAddRemoveDataDirEndToEnd}} scenario of the {{ToolTest}} reports races for TSAN builds: {noformat} /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:266: Failure Failed Bad status: Runtime error: /tmp/dist-test-taskIZqSmU/build/tsan/bin/kudu: process exited with non-ze ro status 66 Google Test trace: /data0/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/kudu-tool-test.cc:265: W0506 17:5 6:02.744191 4432 flags.cc:404] Enabled unsafe flag: --never_fsync=true I0506 17:56:02.780252 4432 fs_manager.cc:263] Metadata directory not provided I0506 17:56:02.780442 4432 fs_manager.cc:269] Using write-ahead log directory (fs_wal_dir) as metad ata directory I0506 17:56:02.789638 4432 fs_manager.cc:399] Time spent opening directory manager: real 0.007s user 0.005s sys 0.002s I0506 17:56:02.789986 4432 env_posix.cc:1676] Not raising this process' open files per process limi t of 1048576; it is already as high as it can go I0506 17:56:02.790426 4432 file_cache.cc:465] Constructed file cache lbm with capacity 419430 == WARNING: ThreadSanitizer: data race (pid=4432) ... {noformat} The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3117: Description: The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario sometimes fails (TSAN builds) with the messages like below: {noformat} kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure Failed Bad status: Timed out: Unable to find leader of tablet 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: Error connecting to replica: Timed out: GetConsensusState RPC to 127.0.177.65:42397 timed out after -0.003s (SENT) kudu/util/test_util.cc:349: Failure Failed {noformat} The log is attached. was: The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario sometimes fails (TSAN builds) with the messages like below: {noformat} kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure Failed Bad status: Timed out: Unable to find leader of tablet 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: Error connecting to replica: Timed out: GetConsensusState RPC to 127.0.177.65:42397 timed out after -0.003s (SENT) kudu/util/test_util.cc:349: Failure Failed {noformat} > TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails > --- > > Key: KUDU-3117 > URL: https://issues.apache.org/jira/browse/KUDU-3117 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: tablet_server_quiescing-itest.txt.xz > > > The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario > sometimes fails (TSAN builds) with the messages like below: > {noformat} > kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure > Failed > > Bad status: Timed out: Unable to find leader of tablet > 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: > Error connecting to replica: Timed out: GetConsensusState RPC to > 127.0.177.65:42397 timed out after -0.003s (SENT) > kudu/util/test_util.cc:349: Failure > Failed > {noformat} > The log is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
[ https://issues.apache.org/jira/browse/KUDU-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3117: Attachment: tablet_server_quiescing-itest.txt.xz > TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails > --- > > Key: KUDU-3117 > URL: https://issues.apache.org/jira/browse/KUDU-3117 > Project: Kudu > Issue Type: Bug > Components: test >Affects Versions: 1.12.0 >Reporter: Alexey Serbin >Priority: Minor > Attachments: tablet_server_quiescing-itest.txt.xz > > > The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario > sometimes fails (TSAN builds) with the messages like below: > {noformat} > kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure > Failed > > Bad status: Timed out: Unable to find leader of tablet > 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: > Error connecting to replica: Timed out: GetConsensusState RPC to > 127.0.177.65:42397 timed out after -0.003s (SENT) > kudu/util/test_util.cc:349: Failure > Failed > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3117) TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails
Alexey Serbin created KUDU-3117: --- Summary: TServerQuiescingITest.TestMajorityQuiescingElectsLeader sometimes fails Key: KUDU-3117 URL: https://issues.apache.org/jira/browse/KUDU-3117 Project: Kudu Issue Type: Bug Components: test Affects Versions: 1.12.0 Reporter: Alexey Serbin The {{TServerQuiescingITest.TestMajorityQuiescingElectsLeader}} test scenario sometimes fails (TSAN builds) with the messages like below: {noformat} kudu/integration-tests/tablet_server_quiescing-itest.cc:228: Failure Failed Bad status: Timed out: Unable to find leader of tablet 65dba49fff3f45678638414af2fe83e4 after 10.005s. Status message: Not found: Error connecting to replica: Timed out: GetConsensusState RPC to 127.0.177.65:42397 timed out after -0.003s (SENT) kudu/util/test_util.cc:349: Failure Failed {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3115) Improve scalability of Kudu masters
Alexey Serbin created KUDU-3115: --- Summary: Improve scalability of Kudu masters Key: KUDU-3115 URL: https://issues.apache.org/jira/browse/KUDU-3115 Project: Kudu Issue Type: Improvement Reporter: Alexey Serbin Currently, multiple masters in a multi-master Kudu cluster are used only for high availability & fault tolerance use cases, but not for sharing the load among the available master nodes. For example, Kudu clients detect current leader master upon connecting to the cluster and send all their subsequent requests to the leader master, so serving many more clients require running masters on more powerful nodes. Current design assumes that masters store and process the requests for metadata only, but that makes sense only up to some limit on the rate of incoming client requests. It would be great to achieve better 'horizontal' scalability for Kudu masters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099242#comment-17099242 ] Alexey Serbin commented on KUDU-3114: - Right, it's possible to disable coredumps for Kudu processes by adding {{\-\-disable_core_dumps}} even if the limit for core files size of set to non-zero. My point was that enabling/disabling coredumps per {{LOG(FATAL)}} instance is not feasible. Dumping a core file might have sense when troubleshooting an issue: e.g., if there is a bug in computing the number of bytes to allocate, what event triggered the issue if it's requested to allocate unexpectedly high amount of space, etc. Probably, we can keep that for DEBUG builds only. I'm OK with keeping this JIRA item open (so, I'm re-opening it). Feel free to submit a patch to address the issue as needed. > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > Fix For: n/a > > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reopened KUDU-3114: - > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > Fix For: n/a > > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-3114. - Fix Version/s: n/a Resolution: Information Provided > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > Fix For: n/a > > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3114) tserver writes core dump when reporting 'out of space'
[ https://issues.apache.org/jira/browse/KUDU-3114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099075#comment-17099075 ] Alexey Serbin commented on KUDU-3114: - Thank you for reporting the issue. The way how fatal inconsistencies are handled in Kudu doesn't provide control to choose between coredump behavior. The behavior of it's controlled at different level: the environment that Kudu processes are run with (check {{ulimit -c}}). As a good operational practice, it's advised to separate the location for core files (some directory at system partition/volume?) and the directories where Kudu stores its data and WAL. Also, consider [enabling mini-dumps in Kudu|https://kudu.apache.org/docs/troubleshooting.html#crash_reporting] and disabling core files if dumping cores isn't feasible due to space limitations. > tserver writes core dump when reporting 'out of space' > -- > > Key: KUDU-3114 > URL: https://issues.apache.org/jira/browse/KUDU-3114 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.7.1 >Reporter: Balazs Jeszenszky >Priority: Major > > Fatal log has: > {code} > F0503 23:56:27.359544 40012 status_callback.cc:35] Enqueued commit operation > failed to write to WAL: IO error: Insufficient disk space to allocate 8388608 > bytes under path (39973171200 bytes available vs 39988335247 bytes > reserved) (error 28) > {code} > Generating a core file in this case yields no benefit, and potentially > compounds the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3107) TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service queue is full
[ https://issues.apache.org/jira/browse/KUDU-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096864#comment-17096864 ] Alexey Serbin commented on KUDU-3107: - I think the problem is that the code doesn't do proper conversion of the RPC-level status code into the application status code. I think the following is missing: {noformat} if (controller.status().IsRemoteError()) { const ErrorStatusPB* err = rpc->error_response(); CHECK(err && err->has_code() && (err->code() == ErrorStatusPB::ERROR_SERVER_TOO_BUSY || err->code() == ErrorStatusPB::ERROR_UNAVAILABLE)); } {noformat} > TestRpc.TestCancellationMultiThreads fail on ARM sometimes due to service > queue is full > --- > > Key: KUDU-3107 > URL: https://issues.apache.org/jira/browse/KUDU-3107 > Project: Kudu > Issue Type: Sub-task >Reporter: liusheng >Priority: Major > Attachments: rpc-test.txt > > > The test TestRpc.TestCancellationMultiThreads fail sometimes on ARM mechine > due the the "service queue full" error. related error message: > {code:java} > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 318) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 319) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 320) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 321) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 324) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 332) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 334) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 335) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 336) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 337) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 338) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 339) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 340) > Call kudu.rpc.GenericCalculatorService.PushStrings from 127.0.0.1:41516 > (request call id 341) > F0416 13:01:38.616358 31937 rpc-test.cc:1471] Check failed: > controller.status().IsAborted() || controller.status().IsServiceUnavailable() > || controller.status().ok() Remote error: Service unavailable: PushStrings > request on kudu.rpc.GenericCalculatorService from 127.0.0.1:41516 dropped due > to backpressure. The service queue is full; it has 100 items. > *** Check failure stack trace: *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e86bbf) received by PID 27583 (TID 0x84b1f050) from > PID 27583; stack trace: *** > @ 0x93cf0464 raise at ??:0 > @ 0x93cf18b4 abort at ??:0 > @ 0x942c5fdc google::logging_fail() at ??:0 > @ 0x942c7d40 google::LogMessage::Fail() at ??:0 > @ 0x942c9c78 google::LogMessage::SendToLog() at ??:0 > @ 0x942c7874 google::LogMessage::Flush() at ??:0 > @ 0x942ca4fc google::LogMessageFatal::~LogMessageFatal() at ??:0 > @ 0xdcee4940 kudu::rpc::SendAndCancelRpcs() at ??:0 > @ 0xdcee4b98 > _ZZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvENKUlvE_clEv > at ??:0 > @ 0xdcee76bc > _ZSt13__invoke_implIvZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_ > at ??:0 > @ 0xdcee7484 > _ZSt8__invokeIZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS5_DpOS6_ > at ??:0 > @ 0xdcee8208 > _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEE9_M_invokeIJLm0DTcl8__invokespcl10_S_declvalIXT_ESt12_Index_tupleIJXspT_EEE > at ??:0 > @ 0xdcee8168 > _ZNSt6thread8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_EEEclEv > at ??:0 > @ 0xdcee8110 > _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN4kudu3rpc41TestRpc_TestCancellationMultiThreads_Test8TestBodyEvEUlvE_E6_M_runEv > at ??:0 > @ 0x93f22e94 (unknown) at ??:0 > @ 0x93e1e088 start_thread at ??:0 > @ 0x93d8e4ec (unknown) at ??:0 > {code} > The attatchment is the full test log -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Summary: getEndpointChannelBindings() isn't working as expected with BouncyCastle (was: getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65) > getEndpointChannelBindings() isn't working as expected with BouncyCastle > > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3111) IWYU processes freestanding headers
Alexey Serbin created KUDU-3111: --- Summary: IWYU processes freestanding headers Key: KUDU-3111 URL: https://issues.apache.org/jira/browse/KUDU-3111 Project: Kudu Issue Type: Improvement Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.8.0, 1.7.0, 1.12.0 Reporter: Alexey Serbin When working out of the compilation database, IWYU processes only associated headers, i.e. {{.h}} files that pair corresponding {{.cc}} files. It would be nice to make IWYU processing so-called freestanding header files. [This thread|https://github.com/include-what-you-use/include-what-you-use/issues/268] contains very useful information on the topic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3111) Make IWYU processes freestanding headers
[ https://issues.apache.org/jira/browse/KUDU-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3111: Summary: Make IWYU processes freestanding headers (was: IWYU processes freestanding headers ) > Make IWYU processes freestanding headers > > > Key: KUDU-3111 > URL: https://issues.apache.org/jira/browse/KUDU-3111 > Project: Kudu > Issue Type: Improvement >Affects Versions: 1.7.0, 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, > 1.11.1 >Reporter: Alexey Serbin >Priority: Major > > When working out of the compilation database, IWYU processes only associated > headers, i.e. {{.h}} files that pair corresponding {{.cc}} files. It would > be nice to make IWYU processing so-called freestanding header files. [This > thread|https://github.com/include-what-you-use/include-what-you-use/issues/268] > contains very useful information on the topic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3007) ARM/aarch64 platform support
[ https://issues.apache.org/jira/browse/KUDU-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091778#comment-17091778 ] Alexey Serbin commented on KUDU-3007: - Yes, I'm planning to take a closer look this weekend. Thank you for the contribution! > ARM/aarch64 platform support > > > Key: KUDU-3007 > URL: https://issues.apache.org/jira/browse/KUDU-3007 > Project: Kudu > Issue Type: Improvement >Reporter: liusheng >Priority: Critical > > As an import alternative of x86 architecture, Aarch64(ARM) architecture is > currently the dominate architecture in small devices like phone, IOT devices, > security cameras, drones etc. And also, there are more and more hadware or > cloud vendor start to provide ARM resources, such as AWS, Huawei, Packet, > Ampere. etc. Usually, the ARM servers are low cost and more cheap than x86 > servers, and now more and more ARM servers have comparative performance with > x86 servers, and even more efficient in some areas. > We want to propose to add an Aarch64 CI for KUDU to promote the support for > KUDU on Aarch64 platforms. We are willing to provide machines to the current > CI system and manpower to mananging the CI and fxing problems that occours. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2986) Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables
[ https://issues.apache.org/jira/browse/KUDU-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2986. - Fix Version/s: 1.12.0 Resolution: Fixed > Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables > -- > > Key: KUDU-2986 > URL: https://issues.apache.org/jira/browse/KUDU-2986 > Project: Kudu > Issue Type: Bug > Components: CLI, client, master, metrics >Affects Versions: 1.11.0 >Reporter: YifanZhang >Assignee: LiFu He >Priority: Major > Fix For: 1.12.0 > > > When we upgraded the cluster with pre-1.11.0 tables, we got inconsistent > values for the 'live_row_count' metric of these tables: > When visiting masterURL:port/metrics, we got 0 for old tables, and got a > positive integer for a old table with a newly added partition, which is the > count of rows in the newly added partition. > When getting table statistics via `kudu table statistics` CLI tool, we got 0 > for old tables and the old table with a new parition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Fix Version/s: 1.12.0 Resolution: Fixed Status: Resolved (was: In Review) > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Description: With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 1.65 converts the name of the certificate signature algorithm uppercase. was: With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 1.65 converts the name of the certificate signature algorithm uppercase. > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > Fix For: 1.12.0 > > > With [BouncyCastle|https://www.bouncycastle.org] 1.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Status: In Review (was: In Progress) > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Code Review: http://gerrit.cloudera.org:8080/15664 > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin reassigned KUDU-3106: --- Assignee: Alexey Serbin > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Summary: getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 (was: getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65) > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 2.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65
[ https://issues.apache.org/jira/browse/KUDU-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-3106: Description: With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 1.65 converts the name of the certificate signature algorithm uppercase. was: With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 2.65 converts the name of the certificate signature algorithm uppercase. > getEndpointChannelBindings() isn't working as expected with BouncyCastle 1.65 > - > > Key: KUDU-3106 > URL: https://issues.apache.org/jira/browse/KUDU-3106 > Project: Kudu > Issue Type: Bug > Components: client, java, security >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.7.1, > 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Priority: Major > > With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in > https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 > isn't working as expected throwing an exception: > {noformat} > java.lang.RuntimeException: cert uses unknown signature algorithm: > SHA256WITHRSA > {noformat} > It seems BouncyCastle 1.65 converts the name of the certificate signature > algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3106) getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65
Alexey Serbin created KUDU-3106: --- Summary: getEndpointChannelBindings() isn't working as expected with BouncyCastle 2.65 Key: KUDU-3106 URL: https://issues.apache.org/jira/browse/KUDU-3106 Project: Kudu Issue Type: Bug Components: client, java, security Affects Versions: 1.11.1, 1.11.0, 1.10.1, 1.10.0, 1.9.0, 1.7.1, 1.8.0, 1.7.0, 1.6.0, 1.5.0, 1.4.0, 1.3.1, 1.3.0 Reporter: Alexey Serbin With [BouncyCastle|https://www.bouncycastle.org] 2.65 the code in https://github.com/apache/kudu/blob/25ae6c5108cc84289f69c467d862e298d3361ea8/java/kudu-client/src/main/java/org/apache/kudu/util/SecurityUtil.java#L136-L159 isn't working as expected throwing an exception: {noformat} java.lang.RuntimeException: cert uses unknown signature algorithm: SHA256WITHRSA {noformat} It seems BouncyCastle 2.65 converts the name of the certificate signature algorithm uppercase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2573) Fully support Chrony in place of NTP
[ https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076583#comment-17076583 ] Alexey Serbin commented on KUDU-2573: - With this [changelist|https://gerrit.cloudera.org/#/c/15456/], the necessary piece of the documentation will be in 1.12 release notes. > Fully support Chrony in place of NTP > > > Key: KUDU-2573 > URL: https://issues.apache.org/jira/browse/KUDU-2573 > Project: Kudu > Issue Type: New Feature > Components: clock, master, tserver >Reporter: Grant Henke >Assignee: Alexey Serbin >Priority: Major > Labels: clock > > This is to track fully supporting Chrony in place of NTP. Given Chrony is the > default in RHEL7+, running Kudu with Chrony is likely to be more common. > The work should entail: > * identifying and fixing or documenting any differences or gaps > * removing the experimental warnings from the documentation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-2573) Fully support Chrony in place of NTP
[ https://issues.apache.org/jira/browse/KUDU-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin resolved KUDU-2573. - Fix Version/s: 1.12.0 Resolution: Fixed > Fully support Chrony in place of NTP > > > Key: KUDU-2573 > URL: https://issues.apache.org/jira/browse/KUDU-2573 > Project: Kudu > Issue Type: New Feature > Components: clock, master, tserver >Reporter: Grant Henke >Assignee: Alexey Serbin >Priority: Major > Labels: clock > Fix For: 1.12.0 > > > This is to track fully supporting Chrony in place of NTP. Given Chrony is the > default in RHEL7+, running Kudu with Chrony is likely to be more common. > The work should entail: > * identifying and fixing or documenting any differences or gaps > * removing the experimental warnings from the documentation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-2798) Fix logging on deleted TSK entries
[ https://issues.apache.org/jira/browse/KUDU-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Serbin updated KUDU-2798: Affects Version/s: 1.10.1 1.11.0 1.11.1 > Fix logging on deleted TSK entries > -- > > Key: KUDU-2798 > URL: https://issues.apache.org/jira/browse/KUDU-2798 > Project: Kudu > Issue Type: Task >Affects Versions: 1.8.0, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.11.0, 1.11.1 >Reporter: Alexey Serbin >Assignee: Alexey Serbin >Priority: Minor > Labels: newbie > > It seems the identifiers of the deleted TSK entries in the log lines below > need decoding: > {noformat} > I0312 15:17:14.808763 71553 catalog_manager.cc:4095] T > P f05d759af7824df9aafedcc106674182: > Generated new TSK 2 > I0312 15:17:14.811144 71553 catalog_manager.cc:4133] T > P f05d759af7824df9aafedcc106674182: Deleted > TSKs: �, � > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)