[jira] [Assigned] (KUDU-3567) Resource leakage related to HashedWheelTimer in AsyncKuduScanner
[ https://issues.apache.org/jira/browse/KUDU-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3567: Assignee: YifanZhang > Resource leakage related to HashedWheelTimer in AsyncKuduScanner > > > Key: KUDU-3567 > URL: https://issues.apache.org/jira/browse/KUDU-3567 > Project: Kudu > Issue Type: Bug > Components: client, java >Affects Versions: 1.18.0 >Reporter: Alexey Serbin >Assignee: YifanZhang >Priority: Major > > With KUDU-3498 implemented in > [8683b8bdb|https://github.com/apache/kudu/commit/8683b8bdb675db96aac52d75a31d00232f7b9fb8], > now there are resource leak reports, see below. > Overall, the way how {{HashedWheelTimer}} is used for keeping scanners alive > is in direct contradiction with the recommendation at [this documentation > page|https://netty.io/4.1/api/io/netty/util/HashedWheelTimer.html]: > {quote}*Do not create many instances.* > HashedWheelTimer creates a new thread whenever it is instantiated and > started. Therefore, you should make sure to create only one instance and > share it across your application. One of the common mistakes, that makes your > application unresponsive, is to create a new instance for every connection. > {quote} > Probably, a better way of implementing the keep-alive feature for scanner > objects in Kudu Java client would be reusing the {{HashedWheelTimer}} > instance from corresponding {{AsyncKuduClient}} client instance, not creating > a new instance of the timer (along with corresponding thread) per > AsyncKuduScanner object. At least, an instance of {{HashedWheelTimer}} > should be properly released/shutdown to avoid resource leakages (a running > thread?) when GC-ing {{AsyncKuduScanner}} objects. > For example, below is an example how the leak is reported when running > {{TestKuduClient.testStrings}}: > {noformat} > 23:04:57.774 [ERROR - main] (ResourceLeakDetector.java:327) LEAK: > HashedWheelTimer.release() was not called before it's garbage-collected. See > https://netty.io/wiki/reference-counted-objects.html for more information. > Recent access records: > Created at: > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:312) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:251) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:224) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:203) > io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:185) > org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:296) > org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:431) > > org.apache.kudu.client.KuduScanner$KuduScannerBuilder.build(KuduScanner.java:260) > org.apache.kudu.client.TestKuduClient.testStrings(TestKuduClient.java:692) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > java.util.concurrent.FutureTask.run(FutureTask.java:266) > java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results
[ https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3564: - Description: Reproduce steps that copy from the Slack channel: {code:sql} // create the table and data in Impala: CREATE TABLE age_table ( id BIGINT, name STRING, age INT, PRIMARY KEY(id,name,age) ) PARTITION BY HASH (id) PARTITIONS 4, HASH (name) PARTITIONS 4, range (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90 ) STORED AS KUDU TBLPROPERTIES ('kudu.num_tablet_replicas' = '1'); ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3; insert into age_table values (3, 'alex', 50); insert into age_table values (12, 'bob', 100); // only predicate "in" for data in custom hash cannot be found, sudo -u kudu kudu table scan default.age_table -columns=id,age -predicates='["AND", ["IN", "id", [3,20]]]' (int64 id=3, int32 age=50) Total count 1 cost 0.0178102 seconds {code} was: Reproduce steps that copy from the Slack channel: create the table and data in Impala: // create the table and data in Impala: CREATE TABLE age_table ( id BIGINT, name STRING, age INT, PRIMARY KEY(id,name,age) ) PARTITION BY HASH (id) PARTITIONS 4, HASH (name) PARTITIONS 4, range (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90 ) STORED AS KUDU TBLPROPERTIES ('kudu.num_tablet_replicas' = '1'); ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3; insert into age_table values (3, 'alex', 50); insert into age_table values (12, 'bob', 100); // only predicate "in" for data in custom hash cannot be found, sudo -u kudu kudu table scan default.age_table -columns=id,age -predicates='["AND", ["IN", "id", [3,20]]]' (int64 id=3, int32 age=50) Total count 1 cost 0.0178102 seconds > Range specific hashing table when queried with InList predicate may lead to > incorrect results > - > > Key: KUDU-3564 > URL: https://issues.apache.org/jira/browse/KUDU-3564 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: YifanZhang >Priority: Major > > Reproduce steps that copy from the Slack channel: > > {code:sql} > // create the table and data in Impala: > CREATE TABLE age_table > ( > id BIGINT, > name STRING, > age INT, > PRIMARY KEY(id,name,age) > ) > PARTITION BY HASH (id) PARTITIONS 4, > HASH (name) PARTITIONS 4, > range (age) > ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90 > ) > STORED AS KUDU > TBLPROPERTIES ('kudu.num_tablet_replicas' = '1'); > ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120 > HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3; > insert into age_table values (3, 'alex', 50); > insert into age_table values (12, 'bob', 100); > // only predicate "in" for data in custom hash cannot be found, > sudo -u kudu kudu table scan default.age_table -columns=id,age > -predicates='["AND", ["IN", "id", [3,20]]]' > (int64 id=3, int32 age=50) > Total count 1 cost 0.0178102 seconds {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results
YifanZhang created KUDU-3564: Summary: Range specific hashing table when queried with InList predicate may lead to incorrect results Key: KUDU-3564 URL: https://issues.apache.org/jira/browse/KUDU-3564 Project: Kudu Issue Type: Bug Affects Versions: 1.17.0 Reporter: YifanZhang Reproduce steps that copy from the Slack channel: create the table and data in Impala: // create the table and data in Impala: CREATE TABLE age_table ( id BIGINT, name STRING, age INT, PRIMARY KEY(id,name,age) ) PARTITION BY HASH (id) PARTITIONS 4, HASH (name) PARTITIONS 4, range (age) ( PARTITION 30 <= VALUES < 60, PARTITION 60 <= VALUES < 90 ) STORED AS KUDU TBLPROPERTIES ('kudu.num_tablet_replicas' = '1'); ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3; insert into age_table values (3, 'alex', 50); insert into age_table values (12, 'bob', 100); // only predicate "in" for data in custom hash cannot be found, sudo -u kudu kudu table scan default.age_table -columns=id,age -predicates='["AND", ["IN", "id", [3,20]]]' (int64 id=3, int32 age=50) Total count 1 cost 0.0178102 seconds -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KUDU-3518) node error when impala query
[ https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778944#comment-17778944 ] YifanZhang edited comment on KUDU-3518 at 10/24/23 6:17 AM: I see in profile_error_1.17.txt: {code:java} 00:SCAN KUDU [member.qyexternaluserdetailinfo_new] predicates: shoptype NOT IN (35, 56), thirdnick != kudu predicates: thirdnick IS NOT NULL, isDelete = CAST(0 AS INT), ownercorpid IN (x, ), mainshopnick = x mem-estimate=3.00MB mem-reservation=0B thread-reservation=1 tuple-ids=0 row-size=32B cardinality=0 in pipelines: 00(GETNEXT) {code} IIUC, kudu predicates mean the predicates that should be pushed down to the kudu scanner, and the column 'shopnick' is not in kudu predicates. Since the column 'shopnick' is neither a predicate column nor a projection column, it shouldn't be scanned and shouldn't be used in the query execution. So it's quite weird that the error is related to this column "Invalid argument: No such column: shopnick". It's more like an impala issue instead of kudu issue. I think you can try to create an empty table to check if the error is related to the table schema: {code:java} create table new_empty_table like member.qyexternaluserdetailinfo_new; // query the new_empty_table to see if the error happens{code} was (Author: zhangyifan27): I see in profile_error_1.17.txt: {code:java} 00:SCAN KUDU [member.qyexternaluserdetailinfo_new] predicates: shoptype NOT IN (35, 56), thirdnick != kudu predicates: thirdnick IS NOT NULL, isDelete = CAST(0 AS INT), ownercorpid IN (x, ), mainshopnick = x mem-estimate=3.00MB mem-reservation=0B thread-reservation=1 tuple-ids=0 row-size=32B cardinality=0 in pipelines: 00(GETNEXT) {code} IIUC, kudu predicates mean the predicates that should be pushed down to the kudu scanner, and the column 'shopnick' is not in kudu predicates. Since the column 'shopnick' is neither a predicate column nor a projection column, it shouldn't be scanned and shouldn't be used in the query execution. So it's quite weird that the error is related to this column "Invalid argument: No such column: shopnick". > node error when impala query > > > Key: KUDU-3518 > URL: https://issues.apache.org/jira/browse/KUDU-3518 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 > Environment: centos7.9 >Reporter: Pain Sun >Priority: Major > Attachments: profile_error_1.17.txt, profile_success_1.16.txt, > profile_success_1.17.txt > > > Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an > empty string in primary key field. > sql: > select > count(distinct thirdnick) > from > member.qyexternaluserdetailinfo_new > where > ( > mainshopnick = "xxx" > and ownercorpid in ("xxx", "") > and shoptype not in ("35", "56") > and isDelete = 0 > and thirdnick != "" > and thirdnick is not null > ); > > error:ERROR: Unable to open scanner for node with id '1' for Kudu table > 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such > column: shopnick > > If update sql like this: > select > count(distinct thirdnick) > from > member.qyexternaluserdetailinfo_new > where > ( > mainshopnick = "xxx" > and ownercorpid in ("xxx", "") > and shopnick not in ('') > and shoptype not in ("35", "56") > and isDelete = 0 > and thirdnick != "" > and thirdnick is not null > ); > no error. > > this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good. > > There is 100 items in this table ,28 items where empty string. > table schema like this: > ++---+-+-++--+---+---+-++ > | name | type | comment | primary_key | key_unique | nullable > | default_value | encoding | compression | block_size | > ++---+-+-++--+---+---+-++ > | mainshopnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | shopnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | ownercorpid | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | shoptype | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | clientid |
[jira] [Commented] (KUDU-3518) node error when impala query
[ https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778944#comment-17778944 ] YifanZhang commented on KUDU-3518: -- I see in profile_error_1.17.txt: {code:java} 00:SCAN KUDU [member.qyexternaluserdetailinfo_new] predicates: shoptype NOT IN (35, 56), thirdnick != kudu predicates: thirdnick IS NOT NULL, isDelete = CAST(0 AS INT), ownercorpid IN (x, ), mainshopnick = x mem-estimate=3.00MB mem-reservation=0B thread-reservation=1 tuple-ids=0 row-size=32B cardinality=0 in pipelines: 00(GETNEXT) {code} IIUC, kudu predicates mean the predicates that should be pushed down to the kudu scanner, and the column 'shopnick' is not in kudu predicates. Since the column 'shopnick' is neither a predicate column nor a projection column, it shouldn't be scanned and shouldn't be used in the query execution. So it's quite weird that the error is related to this column "Invalid argument: No such column: shopnick". > node error when impala query > > > Key: KUDU-3518 > URL: https://issues.apache.org/jira/browse/KUDU-3518 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 > Environment: centos7.9 >Reporter: Pain Sun >Priority: Major > Attachments: profile_error_1.17.txt, profile_success_1.16.txt, > profile_success_1.17.txt > > > Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an > empty string in primary key field. > sql: > select > count(distinct thirdnick) > from > member.qyexternaluserdetailinfo_new > where > ( > mainshopnick = "xxx" > and ownercorpid in ("xxx", "") > and shoptype not in ("35", "56") > and isDelete = 0 > and thirdnick != "" > and thirdnick is not null > ); > > error:ERROR: Unable to open scanner for node with id '1' for Kudu table > 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such > column: shopnick > > If update sql like this: > select > count(distinct thirdnick) > from > member.qyexternaluserdetailinfo_new > where > ( > mainshopnick = "xxx" > and ownercorpid in ("xxx", "") > and shopnick not in ('') > and shoptype not in ("35", "56") > and isDelete = 0 > and thirdnick != "" > and thirdnick is not null > ); > no error. > > this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good. > > There is 100 items in this table ,28 items where empty string. > table schema like this: > ++---+-+-++--+---+---+-++ > | name | type | comment | primary_key | key_unique | nullable > | default_value | encoding | compression | block_size | > ++---+-+-++--+---+---+-++ > | mainshopnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | shopnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | ownercorpid | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | shoptype | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | clientid | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | thirdnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | id | bigint | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | receivermobile | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | thirdrealname | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | remark | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | createtime | timestamp | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | updatetime | timestamp | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | isdelete
[jira] [Commented] (KUDU-3518) node error when impala query
[ https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775760#comment-17775760 ] YifanZhang commented on KUDU-3518: -- [~MadBeeDo] Does this issue only affect the specific table? Is it possible to reproduce it again? > node error when impala query > > > Key: KUDU-3518 > URL: https://issues.apache.org/jira/browse/KUDU-3518 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 > Environment: centos7.9 >Reporter: Pain Sun >Priority: Major > > Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an > empty string in primary key field. > sql: > select > count(distinct thirdnick) > from > member.qyexternaluserdetailinfo_new > where > ( > mainshopnick = "xxx" > and ownercorpid in ("xxx", "") > and shoptype not in ("35", "56") > and isDelete = 0 > and thirdnick != "" > and thirdnick is not null > ); > > error:ERROR: Unable to open scanner for node with id '1' for Kudu table > 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such > column: shopnick > > If update sql like this: > select > count(distinct thirdnick) > from > member.qyexternaluserdetailinfo_new > where > ( > mainshopnick = "xxx" > and ownercorpid in ("xxx", "") > and shopnick not in ('') > and shoptype not in ("35", "56") > and isDelete = 0 > and thirdnick != "" > and thirdnick is not null > ); > no error. > > this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good. > > There is 100 items in this table ,28 items where empty string. > table schema like this: > ++---+-+-++--+---+---+-++ > | name | type | comment | primary_key | key_unique | nullable > | default_value | encoding | compression | block_size | > ++---+-+-++--+---+---+-++ > | mainshopnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | shopnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | ownercorpid | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | shoptype | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | clientid | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | thirdnick | string | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | id | bigint | | true | true | false > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | receivermobile | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | thirdrealname | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | remark | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | createtime | timestamp | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | updatetime | timestamp | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | isdelete | int | | false | | true > | 0 | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > | buyernick | string | | false | | true > | | AUTO_ENCODING | DEFAULT_COMPRESSION | 0 | > ++---+-+-++--+---+---+-++ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KUDU-3502) Linker errors on ARM based Macs
[ https://issues.apache.org/jira/browse/KUDU-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3502: - Description: When building Kudu(RELEASE mode) on a Mac M1 machine, I met this linker error: {code:bash} [ 95%] Linking CXX executable ../../../bin/kudu-master Undefined symbols for architecture arm64: "_nghttp2_http2_strerror", referenced from: _http2_handle_stream_close in libcurl.a(libcurl_la-http2.o) "_nghttp2_is_fatal", referenced from: _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) _http2_recv in libcurl.a(libcurl_la-http2.o) _http2_send in libcurl.a(libcurl_la-http2.o) _on_frame_recv in libcurl.a(libcurl_la-http2.o) "_nghttp2_pack_settings_payload", referenced from: _Curl_http2_request_upgrade in libcurl.a(libcurl_la-http2.o) "_nghttp2_priority_spec_init", referenced from: _h2_pri_spec in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_del", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_new", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_error_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_on_begin_headers_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_on_data_chunk_recv_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_on_frame_recv_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_on_header_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_on_stream_close_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_callbacks_set_send_callback", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_client_new", referenced from: _Curl_http2_setup in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_del", referenced from: _http2_disconnect in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_get_remote_settings", referenced from: _on_frame_recv in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_get_stream_user_data", referenced from: _on_frame_recv in libcurl.a(libcurl_la-http2.o) _on_data_chunk_recv in libcurl.a(libcurl_la-http2.o) _on_stream_close in libcurl.a(libcurl_la-http2.o) _on_begin_headers in libcurl.a(libcurl_la-http2.o) _on_header in libcurl.a(libcurl_la-http2.o) _data_source_read_callback in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_mem_recv", referenced from: _h2_process_pending_input in libcurl.a(libcurl_la-http2.o) _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) _http2_recv in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_resume_data", referenced from: _Curl_http2_done_sending in libcurl.a(libcurl_la-http2.o) _http2_send in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_send", referenced from: _Curl_http2_done in libcurl.a(libcurl_la-http2.o) _http2_send in libcurl.a(libcurl_la-http2.o) _h2_session_send in libcurl.a(libcurl_la-http2.o) _http2_conncheck in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_set_local_window_size", referenced from: _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_set_stream_user_data", referenced from: _Curl_http2_done in libcurl.a(libcurl_la-http2.o) _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) _on_frame_recv in libcurl.a(libcurl_la-http2.o) _on_stream_close in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_upgrade", referenced from: _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_want_read", referenced from: _should_close_session in libcurl.a(libcurl_la-http2.o) "_nghttp2_session_want_write", referenced from: _should_close_session in libcurl.a(libcurl_la-http2.o) _http2_getsock in libcurl.a(libcurl_la-http2.o) _http2_perform_getsock in libcurl.a(libcurl_la-http2.o) "_nghttp2_strerror", referenced from: _h2_process_pending_input in libcurl.a(libcurl_la-http2.o) _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) _http2_recv in libcurl.a(libcurl_la-http2.o) _http2_conncheck in libcurl.a(libcurl_la-http2.o) "_nghttp2_submit_ping", referenced from: _http2_conncheck in libcurl.a(libcurl_la-http2.o) "_nghttp2_submit_priority", referenced from: _h2_session_send in libcurl.a(libcurl_la-http2.o) "_nghttp2_submit_request", referenced from: _http2_send in libcurl.a(libcurl_la-http2.o) "_nghttp2_submit_rst_stream", referenced from: _Curl_http2_done in
[jira] [Updated] (KUDU-3502) Linker errors on ARM based Macs
[ https://issues.apache.org/jira/browse/KUDU-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3502: - Summary: Linker errors on ARM based Macs (was: Linker errors on ARM basedMacs) > Linker errors on ARM based Macs > --- > > Key: KUDU-3502 > URL: https://issues.apache.org/jira/browse/KUDU-3502 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > > When building Kudu(RELEASE mode) on a Mac M1 machine, I met this linker error: > {code:java} > Undefined symbols for architecture arm64: > "_nghttp2_http2_strerror", referenced from: > _http2_handle_stream_close in libcurl.a(libcurl_la-http2.o) > "_nghttp2_is_fatal", referenced from: > _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) > _http2_recv in libcurl.a(libcurl_la-http2.o) > _http2_send in libcurl.a(libcurl_la-http2.o) > _on_frame_recv in libcurl.a(libcurl_la-http2.o) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KUDU-3502) Linker errors on ARM basedMacs
YifanZhang created KUDU-3502: Summary: Linker errors on ARM basedMacs Key: KUDU-3502 URL: https://issues.apache.org/jira/browse/KUDU-3502 Project: Kudu Issue Type: Bug Reporter: YifanZhang When building Kudu(RELEASE mode) on a Mac M1 machine, I met this linker error: {code:java} Undefined symbols for architecture arm64: "_nghttp2_http2_strerror", referenced from: _http2_handle_stream_close in libcurl.a(libcurl_la-http2.o) "_nghttp2_is_fatal", referenced from: _Curl_http2_switched in libcurl.a(libcurl_la-http2.o) _http2_recv in libcurl.a(libcurl_la-http2.o) _http2_send in libcurl.a(libcurl_la-http2.o) _on_frame_recv in libcurl.a(libcurl_la-http2.o) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3463) KuduMaster leader consumes too much memory
[ https://issues.apache.org/jira/browse/KUDU-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713068#comment-17713068 ] YifanZhang commented on KUDU-3463: -- [~weizisheng] The metadata of deleted tables and tablets should be deleted from both memory and disk if the setting of 'enable_metadata_cleanup_for_deleted_tables_and_tablets' is true. For deleting them from persistent disks, we delete corresponding entries from the sys.catalog table where the metadata stored in: [https://github.com/apache/kudu/blob/a3a7c97be031f8fc32402e430eff1a89c19dbdfb/src/kudu/master/catalog_manager.cc#L6099.] Did you find that the data on disks did not decrease after deleting tables? > KuduMaster leader consumes too much memory > -- > > Key: KUDU-3463 > URL: https://issues.apache.org/jira/browse/KUDU-3463 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Weizisheng >Priority: Major > Attachments: heap321.txt > > > We rarely face a suspected memory leak on a cluster with 3-master and > 4-tserver, 800 tables and 3000 tablets. Leader master consume 50GB memory > while the other two only take one tenth of it ... > From the last 5 days, leader's memory usage grows 3%+ > attachment pprof heap > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3463) KuduMaster leader consumes too much memory
[ https://issues.apache.org/jira/browse/KUDU-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712196#comment-17712196 ] YifanZhang commented on KUDU-3463: -- [~weizisheng]This issue seems related to KUDU-3097 and KUDU-3344. The fix for KUDU-3344 will be included in Kudu 1.17.0 release. I think it's not a memory leak in the Kudu leader master. The reason is too many table and tablet metadata are preserved in the memory. Maybe you can try to pick the changes in KUDU-3344 and set --enable_metadata_cleanup_for_deleted_tables_and_tablets=true for Kudu masters, to see if the memory usage can be reduced. > KuduMaster leader consumes too much memory > -- > > Key: KUDU-3463 > URL: https://issues.apache.org/jira/browse/KUDU-3463 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: Weizisheng >Priority: Major > Attachments: heap321.txt > > > We rarely face a suspected memory leak on a cluster with 3-master and > 4-tserver, 800 tables and 3000 tablets. Leader master consume 50GB memory > while the other two only take one tenth of it ... > From the last 5 days, leader's memory usage grows 3%+ > attachment pprof heap > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KUDU-3451) Memory leak in scan_token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang resolved KUDU-3451. -- Fix Version/s: NA Resolution: Fixed > Memory leak in scan_token-test > -- > > Key: KUDU-3451 > URL: https://issues.apache.org/jira/browse/KUDU-3451 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: YifanZhang >Assignee: Marton Greber >Priority: Major > Fix For: NA > > Attachments: scan_token-test.txt.gz > > > We found test failures in scan_token-test sometimes recently. I've attached > the full test log. > The ASAN test output is: > {code:java} > Direct leak of 16 byte(s) in 2 object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4afd01 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #5 0x7fcab3bb10ec in void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #6 0x7fcab3ba5bda in testing::Test::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 > #7 0x7fcab3ba5d9c in testing::TestInfo::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 > #8 0x7fcab3ba6376 in testing::TestSuite::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 > #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 > #10 0x7fcab3bb160c in bool > testing::internal::HandleSehExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #11 0x7fcab3bb160c in bool > testing::internal::HandleExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #12 0x7fcab3ba5e62 in testing::UnitTest::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 > #13 0x7fcac70caf91 in RUN_ALL_TESTS() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 > #14 0x7fcac70c94a8 in main > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 > #15 0x7fcaaf308bf6 in __libc_start_main > (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 > object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4ae967 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) >
[jira] [Updated] (KUDU-3451) Memory leak in scan_token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3451: - Description: We found test failures in scan_token-test sometimes recently. I've attached the full test log. The ASAN test output is: {code:java} Direct leak of 16 byte(s) in 2 object(s) allocated from: #0 0x493e48 in operator new(unsigned long) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 #1 0x7fcabfc846dd in kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 #2 0x7fcabfaf38aa in kudu::client::KuduScanTokenBuilder::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 #3 0x4afd01 in kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 #4 0x7fcab3bb10ec in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #5 0x7fcab3bb10ec in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #6 0x7fcab3ba5bda in testing::Test::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 #7 0x7fcab3ba5d9c in testing::TestInfo::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 #8 0x7fcab3ba6376 in testing::TestSuite::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 #10 0x7fcab3bb160c in bool testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #11 0x7fcab3bb160c in bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #12 0x7fcab3ba5e62 in testing::UnitTest::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 #13 0x7fcac70caf91 in RUN_ALL_TESTS() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 #14 0x7fcac70c94a8 in main /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 #15 0x7fcaaf308bf6 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 object(s) allocated from: #0 0x493e48 in operator new(unsigned long) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 #1 0x7fcabfc846dd in kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 #2 0x7fcabfaf38aa in kudu::client::KuduScanTokenBuilder::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 #3 0x4ae967 in kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 #4 0x7fcab3bb10ec in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #5 0x7fcab3bb10ec in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #6 0x7fcab3ba5bda in testing::Test::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 #7 0x7fcab3ba5d9c in testing::TestInfo::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
[jira] [Updated] (KUDU-3451) Memory leak in scan_token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3451: - Summary: Memory leak in scan_token-test (was: Memory leak in scan-token-test) > Memory leak in scan_token-test > -- > > Key: KUDU-3451 > URL: https://issues.apache.org/jira/browse/KUDU-3451 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: YifanZhang >Assignee: Marton Greber >Priority: Major > Attachments: scan_token-test.txt.gz > > > We found test failures in scan-token-test sometimes recently. I've attached > the full test log. > The ASAN test output is: > {code:java} > Direct leak of 16 byte(s) in 2 object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4afd01 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #5 0x7fcab3bb10ec in void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #6 0x7fcab3ba5bda in testing::Test::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 > #7 0x7fcab3ba5d9c in testing::TestInfo::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 > #8 0x7fcab3ba6376 in testing::TestSuite::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 > #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 > #10 0x7fcab3bb160c in bool > testing::internal::HandleSehExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #11 0x7fcab3bb160c in bool > testing::internal::HandleExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #12 0x7fcab3ba5e62 in testing::UnitTest::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 > #13 0x7fcac70caf91 in RUN_ALL_TESTS() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 > #14 0x7fcac70c94a8 in main > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 > #15 0x7fcaaf308bf6 in __libc_start_main > (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 > object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4ae967 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) >
[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3451: - Description: We found test failures in scan-token-test sometimes recently. I've attached the full test log. The ASAN test output is: {code:java} Direct leak of 16 byte(s) in 2 object(s) allocated from: #0 0x493e48 in operator new(unsigned long) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 #1 0x7fcabfc846dd in kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 #2 0x7fcabfaf38aa in kudu::client::KuduScanTokenBuilder::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 #3 0x4afd01 in kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 #4 0x7fcab3bb10ec in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #5 0x7fcab3bb10ec in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #6 0x7fcab3ba5bda in testing::Test::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 #7 0x7fcab3ba5d9c in testing::TestInfo::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 #8 0x7fcab3ba6376 in testing::TestSuite::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 #10 0x7fcab3bb160c in bool testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #11 0x7fcab3bb160c in bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #12 0x7fcab3ba5e62 in testing::UnitTest::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 #13 0x7fcac70caf91 in RUN_ALL_TESTS() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 #14 0x7fcac70c94a8 in main /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 #15 0x7fcaaf308bf6 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 object(s) allocated from: #0 0x493e48 in operator new(unsigned long) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 #1 0x7fcabfc846dd in kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 #2 0x7fcabfaf38aa in kudu::client::KuduScanTokenBuilder::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 #3 0x4ae967 in kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 #4 0x7fcab3bb10ec in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #5 0x7fcab3bb10ec in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #6 0x7fcab3ba5bda in testing::Test::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 #7 0x7fcab3ba5d9c in testing::TestInfo::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3451: - Attachment: scan_token-test.txt.gz > Memory leak in scan-token-test > -- > > Key: KUDU-3451 > URL: https://issues.apache.org/jira/browse/KUDU-3451 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: YifanZhang >Priority: Major > Attachments: scan_token-test.txt.gz > > > We found test failures in scan-token-test sometimes recently: > [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] > The ASAN test output is: > {code:java} > Direct leak of 16 byte(s) in 2 object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4afd01 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #5 0x7fcab3bb10ec in void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #6 0x7fcab3ba5bda in testing::Test::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 > #7 0x7fcab3ba5d9c in testing::TestInfo::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 > #8 0x7fcab3ba6376 in testing::TestSuite::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 > #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 > #10 0x7fcab3bb160c in bool > testing::internal::HandleSehExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #11 0x7fcab3bb160c in bool > testing::internal::HandleExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #12 0x7fcab3ba5e62 in testing::UnitTest::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 > #13 0x7fcac70caf91 in RUN_ALL_TESTS() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 > #14 0x7fcac70c94a8 in main > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 > #15 0x7fcaaf308bf6 in __libc_start_main > (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 > object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4ae967 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) >
[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3451: - Attachment: (was: scan_token-test.txt) > Memory leak in scan-token-test > -- > > Key: KUDU-3451 > URL: https://issues.apache.org/jira/browse/KUDU-3451 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: YifanZhang >Priority: Major > > We found test failures in scan-token-test sometimes recently: > [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] > The ASAN test output is: > {code:java} > Direct leak of 16 byte(s) in 2 object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4afd01 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #5 0x7fcab3bb10ec in void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #6 0x7fcab3ba5bda in testing::Test::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 > #7 0x7fcab3ba5d9c in testing::TestInfo::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 > #8 0x7fcab3ba6376 in testing::TestSuite::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 > #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 > #10 0x7fcab3bb160c in bool > testing::internal::HandleSehExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #11 0x7fcab3bb160c in bool > testing::internal::HandleExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #12 0x7fcab3ba5e62 in testing::UnitTest::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 > #13 0x7fcac70caf91 in RUN_ALL_TESTS() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 > #14 0x7fcac70c94a8 in main > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 > #15 0x7fcaaf308bf6 in __libc_start_main > (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 > object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4ae967 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) >
[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test
[ https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3451: - Attachment: scan_token-test.txt > Memory leak in scan-token-test > -- > > Key: KUDU-3451 > URL: https://issues.apache.org/jira/browse/KUDU-3451 > Project: Kudu > Issue Type: Bug > Components: test >Reporter: YifanZhang >Priority: Major > > We found test failures in scan-token-test sometimes recently: > [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] > The ASAN test output is: > {code:java} > Direct leak of 16 byte(s) in 2 object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4afd01 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #5 0x7fcab3bb10ec in void > testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #6 0x7fcab3ba5bda in testing::Test::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 > #7 0x7fcab3ba5d9c in testing::TestInfo::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 > #8 0x7fcab3ba6376 in testing::TestSuite::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 > #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 > #10 0x7fcab3bb160c in bool > testing::internal::HandleSehExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 > #11 0x7fcab3bb160c in bool > testing::internal::HandleExceptionsInMethodIfSupported bool>(testing::internal::UnitTestImpl*, bool > (testing::internal::UnitTestImpl::*)(), char const*) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 > #12 0x7fcab3ba5e62 in testing::UnitTest::Run() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 > #13 0x7fcac70caf91 in RUN_ALL_TESTS() > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 > #14 0x7fcac70c94a8 in main > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 > #15 0x7fcaaf308bf6 in __libc_start_main > (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 > object(s) allocated from: > #0 0x493e48 in operator new(unsigned long) > /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 > #1 0x7fcabfc846dd in > kudu::client::KuduScanTokenBuilder::Data::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 > #2 0x7fcabfaf38aa in > kudu::client::KuduScanTokenBuilder::Build(std::vector std::allocator >*) > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 > #3 0x4ae967 in > kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() > /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 > #4 0x7fcab3bb10ec in void > testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) >
[jira] [Created] (KUDU-3451) Memory leak in scan-token-test
YifanZhang created KUDU-3451: Summary: Memory leak in scan-token-test Key: KUDU-3451 URL: https://issues.apache.org/jira/browse/KUDU-3451 Project: Kudu Issue Type: Bug Components: test Reporter: YifanZhang We found test failures in scan-token-test sometimes recently: [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] The ASAN test output is: {code:java} Direct leak of 16 byte(s) in 2 object(s) allocated from: #0 0x493e48 in operator new(unsigned long) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 #1 0x7fcabfc846dd in kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 #2 0x7fcabfaf38aa in kudu::client::KuduScanTokenBuilder::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 #3 0x4afd01 in kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5 #4 0x7fcab3bb10ec in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #5 0x7fcab3bb10ec in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #6 0x7fcab3ba5bda in testing::Test::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674 #7 0x7fcab3ba5d9c in testing::TestInfo::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853 #8 0x7fcab3ba6376 in testing::TestSuite::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012 #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870 #10 0x7fcab3bb160c in bool testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #11 0x7fcab3bb160c in bool testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #12 0x7fcab3ba5e62 in testing::UnitTest::Run() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444 #13 0x7fcac70caf91 in RUN_ALL_TESTS() /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73 #14 0x7fcac70c94a8 in main /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10 #15 0x7fcaaf308bf6 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 object(s) allocated from: #0 0x493e48 in operator new(unsigned long) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99 #1 0x7fcabfc846dd in kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49 #2 0x7fcabfaf38aa in kudu::client::KuduScanTokenBuilder::Build(std::vector >*) /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17 #3 0x4ae967 in kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5 #4 0x7fcab3bb10ec in void testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599 #5 0x7fcab3bb10ec in void testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635 #6 0x7fcab3ba5bda in testing::Test::Run()
[jira] [Commented] (KUDU-3367) Delta file with full of delete op can not be schedule to compact
[ https://issues.apache.org/jira/browse/KUDU-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633680#comment-17633680 ] YifanZhang commented on KUDU-3367: -- [~Koppa] [~laiyingchun] Ah, indeed, this GC operation relies on live row counting. I agree that we do need GC deleted rows on tablets that don't support live row counting. > Delta file with full of delete op can not be schedule to compact > > > Key: KUDU-3367 > URL: https://issues.apache.org/jira/browse/KUDU-3367 > Project: Kudu > Issue Type: New Feature > Components: compaction >Reporter: dengke >Assignee: dengke >Priority: Major > Attachments: image-2022-05-09-14-13-16-525.png, > image-2022-05-09-14-16-31-828.png, image-2022-05-09-14-18-05-647.png, > image-2022-05-09-14-19-56-933.png, image-2022-05-09-14-21-47-374.png, > image-2022-05-09-14-23-43-973.png, image-2022-05-09-14-26-45-313.png, > image-2022-05-09-14-32-51-573.png, image-2022-11-14-11-02-33-685.png > > > If we get a REDO delta with full of delete op, wich means there is no update > op in the file. The current compact algorithm will not schedule the file do > compact. If such files exist, after accumulating for a period of time, it > will greatly affect our scan speed. However, processing such files every time > compact reduces compact's performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3384) DRS-level scan optimization leads to failed scans
[ https://issues.apache.org/jira/browse/KUDU-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568882#comment-17568882 ] YifanZhang commented on KUDU-3384: -- A failure occurred at [cfile_set.cc#L445|https://github.com/apache/kudu/blob/dc4031f693382df08c0fab1d0c5ac6bc3c203c35/src/kudu/tablet/cfile_set.cc#L445], we want to increment primary key to set a new exclusive upper bound so that it can be used to simplify existing predicates. The boundary case described in this issue was not considered when implementing this optimization, I think we can fall back to not set a new upper bound if we can't increment the primary key. > DRS-level scan optimization leads to failed scans > - > > Key: KUDU-3384 > URL: https://issues.apache.org/jira/browse/KUDU-3384 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Assignee: YifanZhang >Priority: Major > > Recently, a new DRS-level optimization for scan operations has been > introduced with changelist > [936d7edc4|https://github.com/apache/kudu/commit/936d7edc4e4b69d2e1f1dffc96760cb3fd57a934]. > The newly introduced DRS-level optimization leads to scan failures when all > of the following turns true: > * all the primary key columns are of integer types > * the table has no hash partitioning > * the table contains a row with all primary key columns set to > {{INT\{x}_MAX}} correspondingly > * the scan request is to scan all the table's data > I suspect that some of the conditions above might be relaxed, but I have a > test case that reproduces the issue as described. See [this gerrit review > item|http://gerrit.cloudera.org:8080/18757] for the reproduction scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KUDU-3384) DRS-level scan optimization leads to failed scans
[ https://issues.apache.org/jira/browse/KUDU-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3384: Assignee: YifanZhang > DRS-level scan optimization leads to failed scans > - > > Key: KUDU-3384 > URL: https://issues.apache.org/jira/browse/KUDU-3384 > Project: Kudu > Issue Type: Bug > Components: tserver >Affects Versions: 1.17.0 >Reporter: Alexey Serbin >Assignee: YifanZhang >Priority: Major > > Recently, a new DRS-level optimization for scan operations has been > introduced with changelist > [936d7edc4|https://github.com/apache/kudu/commit/936d7edc4e4b69d2e1f1dffc96760cb3fd57a934]. > The newly introduced DRS-level optimization leads to scan failures when all > of the following turns true: > * all the primary key columns are of integer types > * the table has no hash partitioning > * the table contains a row with all primary key columns set to > {{INT\{x}_MAX}} correspondingly > * the scan request is to scan all the table's data > I suspect that some of the conditions above might be relaxed, but I have a > test case that reproduces the issue as described. See [this gerrit review > item|http://gerrit.cloudera.org:8080/18757] for the reproduction scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KUDU-3306) String column types in range partitions lead to issues while copying tables
[ https://issues.apache.org/jira/browse/KUDU-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3306: Assignee: YifanZhang (was: Mahesh Reddy) > String column types in range partitions lead to issues while copying tables > --- > > Key: KUDU-3306 > URL: https://issues.apache.org/jira/browse/KUDU-3306 > Project: Kudu > Issue Type: Bug > Components: CLI, partition >Reporter: Bankim Bhavsar >Assignee: YifanZhang >Priority: Major > > Range partitions with string column types leads to issues while creating > destination table. > {noformat} > create TABLE test3 ( > created_time STRING PRIMARY KEY > ) > PARTITION BY RANGE (created_time) > ( > PARTITION VALUE = "2020-01-01", > PARTITION VALUE = "2021-01-01" > ) > STORED as kudu; > # kudu table describe master-1 impala::default.test3 > TABLE impala::default.test3 ( > created_time STRING NOT NULL, > PRIMARY KEY (created_time) > ) > RANGE (created_time) ( > PARTITION "2020-01-01" <= VALUES < "2020-01-01\000", > PARTITION "2021-01-01" <= VALUES < "2021-01-01\000" > ) > OWNER root > REPLICAS 3 > # kudu table copy master-1 impala::default.test3 master-1 > -dst_table=kudu_test4 -write_type="" > Invalid argument: Error creating table kudu_test4 on the master: overlapping > range partitions: first range partition: "\000��\004\000\000\000\1" <= > VALUES < "2021-01-01\000", second range partition: > "\000��\004\000\000\000\1" <= VALUES < "2021-01-01\000" > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KUDU-3367) Delta file with full of delete op can not be schedule to compact
[ https://issues.apache.org/jira/browse/KUDU-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544095#comment-17544095 ] YifanZhang commented on KUDU-3367: -- Maybe related to KUDU-1625. > Delta file with full of delete op can not be schedule to compact > > > Key: KUDU-3367 > URL: https://issues.apache.org/jira/browse/KUDU-3367 > Project: Kudu > Issue Type: New Feature > Components: compaction >Reporter: dengke >Assignee: dengke >Priority: Major > Attachments: image-2022-05-09-14-13-16-525.png, > image-2022-05-09-14-16-31-828.png, image-2022-05-09-14-18-05-647.png, > image-2022-05-09-14-19-56-933.png, image-2022-05-09-14-21-47-374.png, > image-2022-05-09-14-23-43-973.png, image-2022-05-09-14-26-45-313.png, > image-2022-05-09-14-32-51-573.png > > > If we get a REDO delta with full of delete op, wich means there is no update > op in the file. The current compact algorithm will not schedule the file do > compact. If such files exist, after accumulating for a period of time, it > will greatly affect our scan speed. However, processing such files every time > compact reduces compact's performance. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KUDU-3367) Delta file with full of delete op can not be schedule to compact
[ https://issues.apache.org/jira/browse/KUDU-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543808#comment-17543808 ] YifanZhang commented on KUDU-3367: -- I'm curious about if setting `tablet_history_max_age_sec` to a small value is helpful for your case. If so, will DeletedRowsetGCOp be scheduled and empty RowSets be deleted in time? > Delta file with full of delete op can not be schedule to compact > > > Key: KUDU-3367 > URL: https://issues.apache.org/jira/browse/KUDU-3367 > Project: Kudu > Issue Type: New Feature > Components: compaction >Reporter: dengke >Assignee: dengke >Priority: Major > Attachments: image-2022-05-09-14-13-16-525.png, > image-2022-05-09-14-16-31-828.png, image-2022-05-09-14-18-05-647.png, > image-2022-05-09-14-19-56-933.png, image-2022-05-09-14-21-47-374.png, > image-2022-05-09-14-23-43-973.png, image-2022-05-09-14-26-45-313.png, > image-2022-05-09-14-32-51-573.png > > > If we get a REDO delta with full of delete op, wich means there is no update > op in the file. The current compact algorithm will not schedule the file do > compact. If such files exist, after accumulating for a period of time, it > will greatly affect our scan speed. However, processing such files every time > compact reduces compact's performance. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (KUDU-3364) Add TimerThread to ThreadPool to support a category of problem
[ https://issues.apache.org/jira/browse/KUDU-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538251#comment-17538251 ] YifanZhang edited comment on KUDU-3364 at 5/17/22 3:24 PM: --- [~shenxingwuying] I still have some questions about the motivation. {quote}The two ways maybe exist some conflicts at opeations race, because rebalance tool' logic is a litte complex at tool and auto rebalance is running at master. {quote} If we worry about these two ways of rebalancing interfere with each other, just disable auto rebalance and run the rebalance tool may be a solution. You mean we need tools to manually trigger long-time running tasks, like rebalancing and compactions. Maybe we can do this using a tool that can be executed asynchronously, like sending an asynchronously RPC or something? Why we need a TimeThread? was (Author: zhangyifan27): [~shenxingwuying] I still have some questions about the motivation. > Add TimerThread to ThreadPool to support a category of problem > -- > > Key: KUDU-3364 > URL: https://issues.apache.org/jira/browse/KUDU-3364 > Project: Kudu > Issue Type: New Feature >Reporter: shenxingwuying >Assignee: shenxingwuying >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > h1. Scenanios > In general, I am talking about a category of problem. > There are some periodic tasks or automatically triggered scheduling tasks in > kudu. > For example, automatic rebalance of cluster data, some GC task and compaction > tasks. > Their implementation is by kudu Thread, maybe std::thread or ThreadPool, the > really task internally periodic scheduled or internally strategy to trigge > execution. > They are all internal, we cann't do some. > In fact, we need a method our control to trigge the above types of actions. > In general, I am talking about a category of problem. > Some scenarios is significant. > Below is examples: > > h2. data rebalance > There are two rebalance ways: > 1. enable auto rebalance > 2. use rebalance tool 1.14 before. > The two ways maybe exist some conflicts at opeations race, because rebalance > tool' logic is a litte complex at tool and auto rebalance is running at > master. > In future, auto rebalance at master will become very steady and become the > main way for data rebalance. And at the same time, admin opers need a > external trigger the rebalance just like auto rebalance. > But, now auto rebalance is running in a thread and by time period. > Although we can add a api for MasterService, but the api is synchronize, and > will cose very much, we need a asynchronized method to trigger the rebalance. > h2. auto compaction > Another example is auto compaction, > I have found compaction strategy is not always valid, so maybe we need a > method controlled by admin users to triggle compaction. > If we can do a RowSetInCompaction, we need not restart the kudu cluster. > h1. > h1. My Solution > Add a timer in ThreadPool. This timer is a worker thread that schedules tasks > to the specified thread according to time. > We can limit only SERIAL ThreadPoolToken can enable TimerThread. > Pseudo code expresses my intention: > {code:java} > //代码占位符 > class TimerThread { > class Task { > ThreadPoolToken token; > std::function f; > }; > > void Schedule(Task task, int delay_ms) { > tasks_.insert(...); > } > void RunLoop() { > while (...) { > SleepFor(100ms); > tasks = FindTasks(); > for (auto task : tasks) { > token = task.token; > token->Submit(task.f); > tasks_.erase... > } > } > } > scoped_refptr thread_; > std::multimap tasks; > }; > class ThreadPool{ > ... > TimerThread* timer_; > ... > }; > class ThreadPoolToken { > void Scheduler(); > };{code} > This scheme can be compatible with the previous ThreadPool, and timer is > nullptr by default. > For periodic tasks, We can use a Control ThreadPool with timer to refact some > codes to make them more clear, to avoid the problem of too many single > threads in the past. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KUDU-3364) Add TimerThread to ThreadPool to support a category of problem
[ https://issues.apache.org/jira/browse/KUDU-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538251#comment-17538251 ] YifanZhang commented on KUDU-3364: -- [~shenxingwuying] I still have some questions about the motivation. > Add TimerThread to ThreadPool to support a category of problem > -- > > Key: KUDU-3364 > URL: https://issues.apache.org/jira/browse/KUDU-3364 > Project: Kudu > Issue Type: New Feature >Reporter: shenxingwuying >Assignee: shenxingwuying >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > h1. Scenanios > In general, I am talking about a category of problem. > There are some periodic tasks or automatically triggered scheduling tasks in > kudu. > For example, automatic rebalance of cluster data, some GC task and compaction > tasks. > Their implementation is by kudu Thread, maybe std::thread or ThreadPool, the > really task internally periodic scheduled or internally strategy to trigge > execution. > They are all internal, we cann't do some. > In fact, we need a method our control to trigge the above types of actions. > In general, I am talking about a category of problem. > Some scenarios is significant. > Below is examples: > > h2. data rebalance > There are two rebalance ways: > 1. enable auto rebalance > 2. use rebalance tool 1.14 before. > The two ways maybe exist some conflicts at opeations race, because rebalance > tool' logic is a litte complex at tool and auto rebalance is running at > master. > In future, auto rebalance at master will become very steady and become the > main way for data rebalance. And at the same time, admin opers need a > external trigger the rebalance just like auto rebalance. > But, now auto rebalance is running in a thread and by time period. > Although we can add a api for MasterService, but the api is synchronize, and > will cose very much, we need a asynchronized method to trigger the rebalance. > h2. auto compaction > Another example is auto compaction, > I have found compaction strategy is not always valid, so maybe we need a > method controlled by admin users to triggle compaction. > If we can do a RowSetInCompaction, we need not restart the kudu cluster. > h1. > h1. My Solution > Add a timer in ThreadPool. This timer is a worker thread that schedules tasks > to the specified thread according to time. > We can limit only SERIAL ThreadPoolToken can enable TimerThread. > Pseudo code expresses my intention: > {code:java} > //代码占位符 > class TimerThread { > class Task { > ThreadPoolToken token; > std::function f; > }; > > void Schedule(Task task, int delay_ms) { > tasks_.insert(...); > } > void RunLoop() { > while (...) { > SleepFor(100ms); > tasks = FindTasks(); > for (auto task : tasks) { > token = task.token; > token->Submit(task.f); > tasks_.erase... > } > } > } > scoped_refptr thread_; > std::multimap tasks; > }; > class ThreadPool{ > ... > TimerThread* timer_; > ... > }; > class ThreadPoolToken { > void Scheduler(); > };{code} > This scheme can be compatible with the previous ThreadPool, and timer is > nullptr by default. > For periodic tasks, We can use a Control ThreadPool with timer to refact some > codes to make them more clear, to avoid the problem of too many single > threads in the past. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (KUDU-3354) Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota
[ https://issues.apache.org/jira/browse/KUDU-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516304#comment-17516304 ] YifanZhang commented on KUDU-3354: -- {code:java} I0221 04:22:12.031422 8866 maintenance_manager.cc:382] P c4e995dc9e264d6fbcd01aacff4212bd: Scheduling CompactRowSetsOp(64bf8251dd594197b493a8a5cd2e3e9c): perf score=1279494840443460300019357683777979923760955028182284084266695498710755807790412216956702667800111812783791998911358244116400294695993868288.00 I0221 04:22:12.032940 8806 tablet.cc:1898] T 64bf8251dd594197b493a8a5cd2e3e9c P c4e995dc9e264d6fbcd01aacff4212bd: Compaction resulted in no output rows (all input rows were GCed!) Removing all input rowsets. {code} Seems that the maintenance manager sometimes schedule strange compact ops as shown in the above log, these ops block flush ops because in the test tserver is configured with '--maintenance_manager_num_threads=1'(the default value). > Flaky test: > DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota > -- > > Key: KUDU-3354 > URL: https://issues.apache.org/jira/browse/KUDU-3354 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > Attachments: write_limit-itest.txt > > > The test > `DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota` > sometimes fails (at least in debug mode). The output is: > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:229: > Failure > Value of: s.IsIOError() > Actual: false > Expected: true > OK > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:429: > Failure > Expected: TestSizeLimit() doesn't generate new fatal failures in the current > thread. > Actual: it does. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3354) Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota
[ https://issues.apache.org/jira/browse/KUDU-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3354: - Attachment: write_limit-itest.txt > Flaky test: > DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota > -- > > Key: KUDU-3354 > URL: https://issues.apache.org/jira/browse/KUDU-3354 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > Attachments: write_limit-itest.txt > > > The test > `DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota` > sometimes fails (at least in debug mode). The output is: > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:229: > Failure > Value of: s.IsIOError() > Actual: false > Expected: true > OK > /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:429: > Failure > Expected: TestSizeLimit() doesn't generate new fatal failures in the current > thread. > Actual: it does. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (KUDU-3354) Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota
YifanZhang created KUDU-3354: Summary: Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota Key: KUDU-3354 URL: https://issues.apache.org/jira/browse/KUDU-3354 Project: Kudu Issue Type: Bug Reporter: YifanZhang The test `DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota` sometimes fails (at least in debug mode). The output is: /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:229: Failure Value of: s.IsIOError() Actual: false Expected: true OK /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:429: Failure Expected: TestSizeLimit() doesn't generate new fatal failures in the current thread. Actual: it does. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (KUDU-3328) Disable move replicas to tablet servers in maintenance mode
[ https://issues.apache.org/jira/browse/KUDU-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang resolved KUDU-3328. -- Fix Version/s: 1.16.0 Resolution: Fixed > Disable move replicas to tablet servers in maintenance mode > --- > > Key: KUDU-3328 > URL: https://issues.apache.org/jira/browse/KUDU-3328 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > Fix For: 1.16.0 > > > When put some tablet servers in maintenance mode, new replicas are not > expected to be added to these tservers, but we still could run`kudu cluster > rebalance` or `kudu tablet change_config move_replica` to move replicas to > the tservers under maintenance. These operations should be disabled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster
[ https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang resolved KUDU-3346. -- Fix Version/s: 1.16.0 Resolution: Fixed > Rebalance fails when trying to decommission tserver on a rack-aware cluster > --- > > Key: KUDU-3346 > URL: https://issues.apache.org/jira/browse/KUDU-3346 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Georgiana Ogrean >Assignee: YifanZhang >Priority: Major > Fix For: 1.16.0 > > Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z > > > When following the steps [in the > docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html] > for decommissioning a tserver, the rebalance job fails with: > {code:java} > Invalid argument: ignored tserver is not reported among know > tservers > {code} > Steps followed: > 1. Checked that ksck passes. > 2. Put the tserver to be decommissioned in maintenance mode. > {code:java} > sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES > 5ae499b1b870419daabb0e8da90ef233 {code} > 3. Ran rebalance with {{-ignored_tservers}} and > {{-move_replicas_from_ignored_tservers}} flags. > {code:java} > sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES > -move_replicas_from_ignored_tservers > -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code} > The logs for the rebalace command are attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster
[ https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3346: Assignee: YifanZhang > Rebalance fails when trying to decommission tserver on a rack-aware cluster > --- > > Key: KUDU-3346 > URL: https://issues.apache.org/jira/browse/KUDU-3346 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Georgiana Ogrean >Assignee: YifanZhang >Priority: Major > Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z > > > When following the steps [in the > docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html] > for decommissioning a tserver, the rebalance job fails with: > {code:java} > Invalid argument: ignored tserver is not reported among know > tservers > {code} > Steps followed: > 1. Checked that ksck passes. > 2. Put the tserver to be decommissioned in maintenance mode. > {code:java} > sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES > 5ae499b1b870419daabb0e8da90ef233 {code} > 3. Ran rebalance with {{-ignored_tservers}} and > {{-move_replicas_from_ignored_tservers}} flags. > {code:java} > sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES > -move_replicas_from_ignored_tservers > -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code} > The logs for the rebalace command are attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster
[ https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464941#comment-17464941 ] YifanZhang commented on KUDU-3346: -- I think there is something wrong when populating `ClusterInfo::tservers_to_empty`, because sometimes the `ClusterRawInfo` only contains tservers/tablets info of a specific location. I plan to fix it. > Rebalance fails when trying to decommission tserver on a rack-aware cluster > --- > > Key: KUDU-3346 > URL: https://issues.apache.org/jira/browse/KUDU-3346 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.15.0 >Reporter: Georgiana Ogrean >Priority: Major > Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z > > > When following the steps [in the > docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html] > for decommissioning a tserver, the rebalance job fails with: > {code:java} > Invalid argument: ignored tserver is not reported among know > tservers > {code} > Steps followed: > 1. Checked that ksck passes. > 2. Put the tserver to be decommissioned in maintenance mode. > {code:java} > sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES > 5ae499b1b870419daabb0e8da90ef233 {code} > 3. Ran rebalance with {{-ignored_tservers}} and > {{-move_replicas_from_ignored_tservers}} flags. > {code:java} > sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES > -move_replicas_from_ignored_tservers > -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code} > The logs for the rebalace command are attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (KUDU-3344) Master could do some garbage collection work in CatalogManagerBgTasks thread
YifanZhang created KUDU-3344: Summary: Master could do some garbage collection work in CatalogManagerBgTasks thread Key: KUDU-3344 URL: https://issues.apache.org/jira/browse/KUDU-3344 Project: Kudu Issue Type: Improvement Components: master Reporter: YifanZhang Kudu master now reserve all tables/tablets' metadata in memory and disks, deleted tables and tablets were marked REMOVED/DELETED/REPLACED state but not really deleted. This could lead to huge memory usage as described in KUDU-3097. I think it's a good idea to cleanup them in the CatalogManagerBgTasks thread. But because the data deletion tasks are done asynchronously by tablet servers, it is uncertain when metadata can be safely deleted. Besides, we could also cleanup dead tablet servers from master's in-memory map in this thread, as I mentioned in KUDU-2915. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KUDU-3097) whether master load deleted entries into memory could be configuable
[ https://issues.apache.org/jira/browse/KUDU-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459186#comment-17459186 ] YifanZhang commented on KUDU-3097: -- The [master design doc|https://github.com/apache/kudu/blob/master/docs/design-docs/master.md] mentioned we could have a new backgroud task to cleanup 'deleted' state table/tablets from in-momery map and SysCatalogTable. Is it safe to do that or, why we need to keep these deleted table/tablets? > whether master load deleted entries into memory could be configuable > > > Key: KUDU-3097 > URL: https://issues.apache.org/jira/browse/KUDU-3097 > Project: Kudu > Issue Type: New Feature >Reporter: wangningito >Assignee: wangningito >Priority: Major > Attachments: image-2020-05-28-19-41-05-485.png, screenshot-1.png, > set-09.svg > > > The tablet of master is not under control of MVCC. > The deleted entries like table structure, deleted tablet ids would be load > into memory. > For those who use the massive columns or lots of tablets and frequently > switch table, it may result in some unnecessary memory usage. > By the way, the memory usage is different between leader and follower in > master. It may result in imbalance among master cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KUDU-3097) whether master load deleted entries into memory could be configuable
[ https://issues.apache.org/jira/browse/KUDU-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458956#comment-17458956 ] YifanZhang commented on KUDU-3097: -- We also notice that sometimes the memory usage of kudu master could be very huge. I collected sampled heap usage and it shows that the master allocated too much memory for storing `SysTabletEntryPB`. [^set-09.svg] The number of deleted tablets keeps growing in a typical usage scenario where we keep only the latest partition and then delete the historical tablets. Maybe it is not so necessary to keep all tablets(including deleted ones) in memory? > whether master load deleted entries into memory could be configuable > > > Key: KUDU-3097 > URL: https://issues.apache.org/jira/browse/KUDU-3097 > Project: Kudu > Issue Type: New Feature >Reporter: wangningito >Assignee: wangningito >Priority: Major > Attachments: image-2020-05-28-19-41-05-485.png, screenshot-1.png, > set-09.svg > > > The tablet of master is not under control of MVCC. > The deleted entries like table structure, deleted tablet ids would be load > into memory. > For those who use the massive columns or lots of tablets and frequently > switch table, it may result in some unnecessary memory usage. > By the way, the memory usage is different between leader and follower in > master. It may result in imbalance among master cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3097) whether master load deleted entries into memory could be configuable
[ https://issues.apache.org/jira/browse/KUDU-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3097: - Attachment: set-09.svg > whether master load deleted entries into memory could be configuable > > > Key: KUDU-3097 > URL: https://issues.apache.org/jira/browse/KUDU-3097 > Project: Kudu > Issue Type: New Feature >Reporter: wangningito >Assignee: wangningito >Priority: Major > Attachments: image-2020-05-28-19-41-05-485.png, screenshot-1.png, > set-09.svg > > > The tablet of master is not under control of MVCC. > The deleted entries like table structure, deleted tablet ids would be load > into memory. > For those who use the massive columns or lots of tablets and frequently > switch table, it may result in some unnecessary memory usage. > By the way, the memory usage is different between leader and follower in > master. It may result in imbalance among master cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang resolved KUDU-3341. -- Fix Version/s: 1.16.0 Resolution: Fixed > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_SERVER_UUID error > -- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > Fix For: 1.16.0 > > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and if any, outdated > replicas could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (KUDU-3328) Disable move replicas to tablet servers in maintenance mode
[ https://issues.apache.org/jira/browse/KUDU-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3328: Assignee: YifanZhang > Disable move replicas to tablet servers in maintenance mode > --- > > Key: KUDU-3328 > URL: https://issues.apache.org/jira/browse/KUDU-3328 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > > When put some tablet servers in maintenance mode, new replicas are not > expected to be added to these tservers, but we still could run`kudu cluster > rebalance` or `kudu tablet change_config move_replica` to move replicas to > the tservers under maintenance. These operations should be disabled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3341: - Description: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and if any, outdated replicas could be deleted finally. was: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{--follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and if any, outdated replicas could be deleted finally. > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_SERVER_UUID error > -- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and if any, outdated > replicas could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3341: Assignee: YifanZhang > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_SERVER_UUID error > -- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{--follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and if any, outdated > replicas could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3341: - Description: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{--follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and if any, outdated replicas could be deleted finally. was: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and outdated replicas could be deleted finally. > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_SERVER_UUID error > -- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{--follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and if any, outdated > replicas could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3341: - Summary: Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error (was: Catalog Manager should stop retrying DeleteTablet when receive WRONG_UUID_ERROR) > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_SERVER_UUID error > -- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and outdated replicas > could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_UUID_ERROR
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3341: - Description: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and outdated replicas could be deleted finally. was: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{--follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and outdated replicas could be deleted finally. > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_UUID_ERROR > --- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and outdated replicas > could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_UUID_ERROR
[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3341: - Component/s: master Description: Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared. Replicas could be replicated after {{--follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and outdated replicas could be deleted finally. Summary: Catalog Manager should stop retrying DeleteTablet when receive WRONG_UUID_ERROR (was: Catalog Manager should stop retrying DeleteTablet when receive WRON) > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_UUID_ERROR > --- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: YifanZhang >Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{--follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and outdated replicas > could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRON
YifanZhang created KUDU-3341: Summary: Catalog Manager should stop retrying DeleteTablet when receive WRON Key: KUDU-3341 URL: https://issues.apache.org/jira/browse/KUDU-3341 Project: Kudu Issue Type: Improvement Reporter: YifanZhang -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KUDU-2915) Support to delete dead tservers from CLI
[ https://issues.apache.org/jira/browse/KUDU-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449978#comment-17449978 ] YifanZhang commented on KUDU-2915: -- I think it's good that we could introduce a tool to unregister a dead tablet server from the master's in-memory state. And on the other hand, I also want to know whether it is safe or reasonable to make master take the initiative to forget a tablet server that have been in 'dead' state for 'a long time' and no replica is running on it. If the same tablet server comes back again, the master re-register it in it's in-memory state. Is there some problems? > Support to delete dead tservers from CLI > > > Key: KUDU-2915 > URL: https://issues.apache.org/jira/browse/KUDU-2915 > Project: Kudu > Issue Type: Improvement > Components: CLI, ops-tooling >Affects Versions: 1.10.0 >Reporter: Hexin >Assignee: Hexin >Priority: Major > Labels: supportability > > Sometimes the nodes in the cluster will crash due to machine problems such as > disk corruption, which can be very common. However, if there are some dead > tservers, ksck result will always show error (e.g. Not all Tablet Servers are > reachable) although all tables have recovered to be healthy. > The only way now to get the healthy status of ksck is to restart all masters > one by one. In some cases, for example, if the machine has completely > corrupted, we hope to get healthy status of ksck without restarting, since > after restarting masters the cluster will take some time to recover, during > which it will have influence on scanning or upsetting to tables. The recovery > time can be long which mainly depends on the scale of cluster. This problem > can be serious and annoying especially tservers crashed with high-frequency > in a large cluster. > It’s valuable if we have an easier way to delete dead tservers from master, I > will support a kudu command to realize it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (KUDU-3328) Disable move replicas to tablet servers in maintenance mode
YifanZhang created KUDU-3328: Summary: Disable move replicas to tablet servers in maintenance mode Key: KUDU-3328 URL: https://issues.apache.org/jira/browse/KUDU-3328 Project: Kudu Issue Type: Improvement Reporter: YifanZhang When put some tablet servers in maintenance mode, new replicas are not expected to be added to these tservers, but we still could run`kudu cluster rebalance` or `kudu tablet change_config move_replica` to move replicas to the tservers under maintenance. These operations should be disabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2064) Overall log cache usage doesn't respect the limit
[ https://issues.apache.org/jira/browse/KUDU-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351716#comment-17351716 ] YifanZhang commented on KUDU-2064: -- I also found actual log cache usage exceeded the log_cache_size_limit/global_log_cache_limit in a tserver's mem-tracker page(kudu version1.12.0): ||Id ||Parent ||Limit ||Current Consumption ||Peak Consumption || |root|none|none|44.97G|76.44G| |block_cache-sharded_lru_cache|root|none|40.01G|40.02G| |server|root|none|2.50G|26.29G| |log_cache|root|1.00G|2.46G|10.89G| |log_cache:adbee30f32664a48bc24f80b1e53d425:cbcc9aa7ac9c4167a7ba0b540c95c83a|log_cache|128.00M|854.01M|858.10M| |log_cache:adbee30f32664a48bc24f80b1e53d425:4b2cbe4fd0d64e7d998a8abddbc1fb47|log_cache|128.00M|793.87M|794.58M| |log_cache:adbee30f32664a48bc24f80b1e53d425:ea0d65bc2f384757b2259a19829fab9c|log_cache|128.00M|254.86M|429.48M| |log_cache:adbee30f32664a48bc24f80b1e53d425:65065df878a64d1bae52fcd0bf6a2e45|log_cache|128.00M|215.48M|392.56M| But the tablet that consumes largest log cache is TOMBSTONED, I'm not sure if the cache is actually occupied or the MemTracker is not updated. I also saw some kernel_stack_watchdog traces in the log: {code:java} W0526 11:35:35.414122 27289 kernel_stack_watchdog.cc:198] Thread 190027 stuck at /home/zhangyifan8/work/kudu-xm/src/kudu/consensus/log.cc:405 for 118ms: Kernel stack: [] futex_wait_queue_me+0xc6/0x130 [] futex_wait+0x17b/0x280 [] do_futex+0x106/0x5a0 [] SyS_futex+0x80/0x180 [] system_call_fastpath+0x1c/0x21 [] 0x User stack: @ 0x7fe923e72370 (unknown) @ 0x2318d54 kudu::RowOperationsPB::~RowOperationsPB() @ 0x20d0300 kudu::tserver::WriteRequestPB::SharedDtor() @ 0x20d37a8 kudu::tserver::WriteRequestPB::~WriteRequestPB() @ 0x2095703 kudu::consensus::ReplicateMsg::SharedDtor() @ 0x209b038 kudu::consensus::ReplicateMsg::~ReplicateMsg() @ 0xc3d617 kudu::consensus::LogCache::EvictSomeUnlocked() @ 0xc3e052 _ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_9consensus8LogCache16AppendOperationsERKSt6vectorI13scoped_refptrINS5_19RefCountedReplicateEESaISA_EERKSt8functionIS4_EEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_ @ 0xc89ea9 kudu::log::Log::AppendThread::HandleBatches() @ 0xc8a7ad kudu::log::Log::AppendThread::ProcessQueue() @ 0x2295cfe kudu::ThreadPool::DispatchThread() @ 0x228ecaf kudu::Thread::SuperviseThread() @ 0x7fe923e6adc5 start_thread @ 0x7fe92214c73d __clone {code} This often happens when there is a large number of write requests and results in slow writes. > Overall log cache usage doesn't respect the limit > - > > Key: KUDU-2064 > URL: https://issues.apache.org/jira/browse/KUDU-2064 > Project: Kudu > Issue Type: Bug > Components: log >Affects Versions: 1.4.0 >Reporter: Jean-Daniel Cryans >Priority: Major > Labels: data-scalability > > Looking at a fairly loaded machine (10TB of data in LBM, close to 10k > tablets), I can see in the mem-trackers page that the log cache is using > 1.83GB, that it peaked at 2.82GB, with a 1GB limit. It's consistent on other > similarly loaded tservers. It's unexpected. > Looking at the per-tablet breakdown, they all have between 0 and a handful of > MBs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3271) Tablet server crashed when handle scan request
[ https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315447#comment-17315447 ] YifanZhang edited comment on KUDU-3271 at 4/6/21, 11:34 AM: [~awong] I have attached the INFO log of that day related to the being scanned tablet. It was about 16:34 when the tablet server crashed. At that time a user executed the query `select count(1) from xxx`. An application deletes all records from this table and reloads new data every day. But we failed to reporduce this problem by executing the same query today.:( We set tserver flag `–tablet_history_max_age_sec=10` because users don't usually need to read historical data. was (Author: zhangyifan27): [~awong] I have attached the INFO log of that day related to the being scanned tablet. It was about 16:34 when the tablet server crashed. At that time a user executed the query `select count(1) from xxx`. An application deletes all records from this table and reloads new data every day. But we failed to reporduce this problem by executing the same query today. > Tablet server crashed when handle scan request > -- > > Key: KUDU-3271 > URL: https://issues.apache.org/jira/browse/KUDU-3271 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: YifanZhang >Priority: Major > Attachments: tablet-52a743.log > > > We found that one of kudu tablet server crashed when handle scan request. The > scanned table didn't have any row operations at that time. This issue only > came up once so far. > Coredump stack is: > {code:java} > Program terminated with signal 11, Segmentation fault. > (gdb) bt > #0 kudu::tablet::DeltaApplier::HasNext (this=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84 > #1 0x02185900 in kudu::UnionIterator::HasNext (this=) > at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051 > #2 0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner > (this=0x4fea140, scanner_id=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195 > #3 0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, > __in_chrg=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179 > #4 kudu::tserver::TabletServiceImpl::HandleContinueScanRequest > (this=this@entry=0x60edef0, req=req@entry=0x9582e880, > rpc_context=rpc_context@entry=0x8151d7800, > result_collector=result_collector@entry=0x7f2d721679f0, > has_more_results=has_more_results@entry=0x7f2d721678f9, > error_code=error_code@entry=0x7f2d721678fc) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737 > #5 0x009fb009 in kudu::tserver::TabletServiceImpl::Scan > (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907 > #6 0x0210f019 in operator() (__args#2=0x8151d7800, > __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at > /usr/include/c++/4.8.2/functional:2471 > #7 kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 > #8 0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) > at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 > #9 0x0228ecaf in operator() (this=0xc1a58c28) at > /usr/include/c++/4.8.2/functional:2471 > #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at > /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 > 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 > 0x7f2de4e6873d in clone () from /lib64/libc.so.6 > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3271) Tablet server crashed when handle scan request
[ https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315447#comment-17315447 ] YifanZhang edited comment on KUDU-3271 at 4/6/21, 11:21 AM: [~awong] I have attached the INFO log of that day related to the being scanned tablet. It was about 16:34 when the tablet server crashed. At that time a user executed the query `select count(1) from xxx`. An application deletes all records from this table and reloads new data every day. But we failed to reporduce this problem by executing the same query today. was (Author: zhangyifan27): [~awong] I have attached the INFO log of that day related to the being scanned tablet. It was about 16:34 when the tablet server crashed. At that time a user executed the query `select count(1) from xxx`. But we failed to reporduce this problem by executing the same query today. > Tablet server crashed when handle scan request > -- > > Key: KUDU-3271 > URL: https://issues.apache.org/jira/browse/KUDU-3271 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: YifanZhang >Priority: Major > Attachments: tablet-52a743.log > > > We found that one of kudu tablet server crashed when handle scan request. The > scanned table didn't have any row operations at that time. This issue only > came up once so far. > Coredump stack is: > {code:java} > Program terminated with signal 11, Segmentation fault. > (gdb) bt > #0 kudu::tablet::DeltaApplier::HasNext (this=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84 > #1 0x02185900 in kudu::UnionIterator::HasNext (this=) > at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051 > #2 0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner > (this=0x4fea140, scanner_id=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195 > #3 0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, > __in_chrg=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179 > #4 kudu::tserver::TabletServiceImpl::HandleContinueScanRequest > (this=this@entry=0x60edef0, req=req@entry=0x9582e880, > rpc_context=rpc_context@entry=0x8151d7800, > result_collector=result_collector@entry=0x7f2d721679f0, > has_more_results=has_more_results@entry=0x7f2d721678f9, > error_code=error_code@entry=0x7f2d721678fc) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737 > #5 0x009fb009 in kudu::tserver::TabletServiceImpl::Scan > (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907 > #6 0x0210f019 in operator() (__args#2=0x8151d7800, > __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at > /usr/include/c++/4.8.2/functional:2471 > #7 kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 > #8 0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) > at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 > #9 0x0228ecaf in operator() (this=0xc1a58c28) at > /usr/include/c++/4.8.2/functional:2471 > #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at > /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 > 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 > 0x7f2de4e6873d in clone () from /lib64/libc.so.6 > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3271) Tablet server crashed when handle scan request
[ https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315447#comment-17315447 ] YifanZhang commented on KUDU-3271: -- [~awong] I have attached the INFO log of that day related to the being scanned tablet. It was about 16:34 when the tablet server crashed. At that time a user executed the query `select count(1) from xxx`. But we failed to reporduce this problem by executing the same query today. > Tablet server crashed when handle scan request > -- > > Key: KUDU-3271 > URL: https://issues.apache.org/jira/browse/KUDU-3271 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: YifanZhang >Priority: Major > Attachments: tablet-52a743.log > > > We found that one of kudu tablet server crashed when handle scan request. The > scanned table didn't have any row operations at that time. This issue only > came up once so far. > Coredump stack is: > {code:java} > Program terminated with signal 11, Segmentation fault. > (gdb) bt > #0 kudu::tablet::DeltaApplier::HasNext (this=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84 > #1 0x02185900 in kudu::UnionIterator::HasNext (this=) > at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051 > #2 0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner > (this=0x4fea140, scanner_id=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195 > #3 0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, > __in_chrg=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179 > #4 kudu::tserver::TabletServiceImpl::HandleContinueScanRequest > (this=this@entry=0x60edef0, req=req@entry=0x9582e880, > rpc_context=rpc_context@entry=0x8151d7800, > result_collector=result_collector@entry=0x7f2d721679f0, > has_more_results=has_more_results@entry=0x7f2d721678f9, > error_code=error_code@entry=0x7f2d721678fc) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737 > #5 0x009fb009 in kudu::tserver::TabletServiceImpl::Scan > (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907 > #6 0x0210f019 in operator() (__args#2=0x8151d7800, > __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at > /usr/include/c++/4.8.2/functional:2471 > #7 kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 > #8 0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) > at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 > #9 0x0228ecaf in operator() (this=0xc1a58c28) at > /usr/include/c++/4.8.2/functional:2471 > #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at > /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 > 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 > 0x7f2de4e6873d in clone () from /lib64/libc.so.6 > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3271) Tablet server crashed when handle scan request
[ https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3271: - Attachment: tablet-52a743.log > Tablet server crashed when handle scan request > -- > > Key: KUDU-3271 > URL: https://issues.apache.org/jira/browse/KUDU-3271 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.12.0 >Reporter: YifanZhang >Priority: Major > Attachments: tablet-52a743.log > > > We found that one of kudu tablet server crashed when handle scan request. The > scanned table didn't have any row operations at that time. This issue only > came up once so far. > Coredump stack is: > {code:java} > Program terminated with signal 11, Segmentation fault. > (gdb) bt > #0 kudu::tablet::DeltaApplier::HasNext (this=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84 > #1 0x02185900 in kudu::UnionIterator::HasNext (this=) > at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051 > #2 0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner > (this=0x4fea140, scanner_id=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195 > #3 0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, > __in_chrg=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179 > #4 kudu::tserver::TabletServiceImpl::HandleContinueScanRequest > (this=this@entry=0x60edef0, req=req@entry=0x9582e880, > rpc_context=rpc_context@entry=0x8151d7800, > result_collector=result_collector@entry=0x7f2d721679f0, > has_more_results=has_more_results@entry=0x7f2d721678f9, > error_code=error_code@entry=0x7f2d721678fc) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737 > #5 0x009fb009 in kudu::tserver::TabletServiceImpl::Scan > (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907 > #6 0x0210f019 in operator() (__args#2=0x8151d7800, > __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at > /usr/include/c++/4.8.2/functional:2471 > #7 kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 > #8 0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) > at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 > #9 0x0228ecaf in operator() (this=0xc1a58c28) at > /usr/include/c++/4.8.2/functional:2471 > #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at > /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 > 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 > 0x7f2de4e6873d in clone () from /lib64/libc.so.6 > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3271) Tablet server crashed when handle scan request
YifanZhang created KUDU-3271: Summary: Tablet server crashed when handle scan request Key: KUDU-3271 URL: https://issues.apache.org/jira/browse/KUDU-3271 Project: Kudu Issue Type: Bug Affects Versions: 1.12.0 Reporter: YifanZhang We found that one of kudu tablet server crashed when handle scan request. The scanned table didn't have any row operations at that time. This issue only came up once so far. Coredump stack is: {code:java} Program terminated with signal 11, Segmentation fault. (gdb) bt #0 kudu::tablet::DeltaApplier::HasNext (this=) at /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84 #1 0x02185900 in kudu::UnionIterator::HasNext (this=) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051 #2 0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner (this=0x4fea140, scanner_id=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195 #3 0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, __in_chrg=) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179 #4 kudu::tserver::TabletServiceImpl::HandleContinueScanRequest (this=this@entry=0x60edef0, req=req@entry=0x9582e880, rpc_context=rpc_context@entry=0x8151d7800, result_collector=result_collector@entry=0x7f2d721679f0, has_more_results=has_more_results@entry=0x7f2d721678f9, error_code=error_code@entry=0x7f2d721678fc) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737 #5 0x009fb009 in kudu::tserver::TabletServiceImpl::Scan (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907 #6 0x0210f019 in operator() (__args#2=0x8151d7800, __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at /usr/include/c++/4.8.2/functional:2471 #7 kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 #8 0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 #9 0x0228ecaf in operator() (this=0xc1a58c28) at /usr/include/c++/4.8.2/functional:2471 #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 0x7f2de4e6873d in clone () from /lib64/libc.so.6 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3198: - Description: We recently got an error when deleted full rows from a table with 64 columns using sparkSQL, however if we delete a column from the table, this error will not appear. The error is: {code:java} Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: Unknown row operation type (error 0){code} I tested this by deleting a full row from a table with 64 column using java client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: {code:java} Row error for primary key=[-128, 0, 0, 1], tablet=null, server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE should not have a value for column: c63 STRING NULLABLE (error 0) {code} if the row is set values for all columns , I got an error like: {code:java} Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) {code} I also tested this with tables with different number of columns. The weird thing is I could delete full rows from a table with 8/16/32/63/65 columns, but couldn't do this if the table has 64/128 columns. was: We recently got an error when deleted full rows from a table with 64 columns using sparkSQL, however if we delete a column, this error will not appear. The error is: {code:java} Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: Unknown row operation type (error 0){code} I tested this by deleting a full row from a table with 64 column using java client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: {code:java} Row error for primary key=[-128, 0, 0, 1], tablet=null, server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE should not have a value for column: c63 STRING NULLABLE (error 0) {code} if the row is set values for all columns , I got an error like: {code:java} Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) {code} I also tested this with tables with different number of columns. The weird thing is I could delete full rows from a table with 8/16/32/63/65 columns, but couldn't do this if the table has 64/128 columns. > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.12.0, 1.13.0 >Reporter: YifanZhang >Priority: Major > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column from the table, this error will > not appear. The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203148#comment-17203148 ] YifanZhang edited comment on KUDU-3198 at 9/28/20, 10:50 AM: - As java client allows delete options with extra column set while c++ client doesn't support this, I fould there may be a problem in encodeRow : [https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373] java api doc of BitSet.clear(int fromIndex, int toIndex): {quote}public void clear(int fromIndex, int toIndex) Sets the bits from the specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to {{false}}. {quote} It seems that the last non-key field would not be cleared. But why it works well with non-64-column tables? was (Author: zhangyifan27): As java client allows delete options with extra column set while c++ client doesn't support this, I fould there may be a problem in encodeRow : [https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373] java api doc of BitSet.clear(int fromIndex, int toIndex): {quote}public void clear(int fromIndex, int toIndex) Sets the bits from the specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to {{false}}. {quote} But why it works well with non-64-column tables? > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.12.0, 1.13.0 >Reporter: YifanZhang >Priority: Major > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column, this error will not appear. > The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203148#comment-17203148 ] YifanZhang commented on KUDU-3198: -- As java client allows delete options with extra column set while c++ client doesn't support this, I fould there may be a problem in encodeRow : [https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373] java api doc of BitSet.clear(int fromIndex, int toIndex): {quote}public void clear(int fromIndex, int toIndex)Sets the bits from the specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to {{false}}.{quote} But why it works well with non-64-column tables? > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.12.0, 1.13.0 >Reporter: YifanZhang >Priority: Major > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column, this error will not appear. > The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
[ https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203148#comment-17203148 ] YifanZhang edited comment on KUDU-3198 at 9/28/20, 10:43 AM: - As java client allows delete options with extra column set while c++ client doesn't support this, I fould there may be a problem in encodeRow : [https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373] java api doc of BitSet.clear(int fromIndex, int toIndex): {quote}public void clear(int fromIndex, int toIndex) Sets the bits from the specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to {{false}}. {quote} But why it works well with non-64-column tables? was (Author: zhangyifan27): As java client allows delete options with extra column set while c++ client doesn't support this, I fould there may be a problem in encodeRow : [https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373] java api doc of BitSet.clear(int fromIndex, int toIndex): {quote}public void clear(int fromIndex, int toIndex)Sets the bits from the specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to {{false}}.{quote} But why it works well with non-64-column tables? > Unable to delete a full row from a table with 64 columns when using java > client > --- > > Key: KUDU-3198 > URL: https://issues.apache.org/jira/browse/KUDU-3198 > Project: Kudu > Issue Type: Bug > Components: java >Affects Versions: 1.12.0, 1.13.0 >Reporter: YifanZhang >Priority: Major > > We recently got an error when deleted full rows from a table with 64 columns > using sparkSQL, however if we delete a column, this error will not appear. > The error is: > {code:java} > Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: > Unknown row operation type (error 0){code} > I tested this by deleting a full row from a table with 64 column using java > client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, > server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE > should not have a value for column: c63 STRING NULLABLE (error 0) > {code} > if the row is set values for all columns , I got an error like: > {code:java} > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) > {code} > I also tested this with tables with different number of columns. The weird > thing is I could delete full rows from a table with 8/16/32/63/65 columns, > but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client
YifanZhang created KUDU-3198: Summary: Unable to delete a full row from a table with 64 columns when using java client Key: KUDU-3198 URL: https://issues.apache.org/jira/browse/KUDU-3198 Project: Kudu Issue Type: Bug Components: java Affects Versions: 1.13.0, 1.12.0 Reporter: YifanZhang We recently got an error when deleted full rows from a table with 64 columns using sparkSQL, however if we delete a column, this error will not appear. The error is: {code:java} Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: Unknown row operation type (error 0){code} I tested this by deleting a full row from a table with 64 column using java client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error: {code:java} Row error for primary key=[-128, 0, 0, 1], tablet=null, server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE should not have a value for column: c63 STRING NULLABLE (error 0) {code} if the row is set values for all columns , I got an error like: {code:java} Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0) {code} I also tested this with tables with different number of columns. The weird thing is I could delete full rows from a table with 8/16/32/63/65 columns, but couldn't do this if the table has 64/128 columns. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (KUDU-2879) Build hangs in DEBUG type on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/KUDU-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang closed KUDU-2879. Resolution: Cannot Reproduce > Build hangs in DEBUG type on Ubuntu 18.04 > - > > Key: KUDU-2879 > URL: https://issues.apache.org/jira/browse/KUDU-2879 > Project: Kudu > Issue Type: Improvement >Reporter: Yingchun Lai >Priority: Major > Attachments: config.diff, config.log > > > Few months ago, I report this issue on Slack: > [https://getkudu.slack.com/archives/C0CPXJ3CH/p1549942641041600] > I switch to RELEASE type then on, and haven't try build on DEBUG type on my > Ubuntu environment. > Now, when I try build DEBUG type to check 1.10.0-RC2, this issue occurred > again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang resolved KUDU-3180. -- Fix Version/s: 1.13.0 Resolution: Fixed > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Major > Fix For: 1.13.0 > > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3180: Assignee: YifanZhang > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173104#comment-17173104 ] YifanZhang edited comment on KUDU-3180 at 8/10/20, 10:57 AM: - If we lower {{-memory_pressure_percentage}}, we should also lower {{-block_cache_capacity_mb}} accordingly, then we may not make full use of the memory resources. In fact most time of a day the memory usage of our kudu server is not very high(about 50%), but there will be a lot of insert/update in one hour or two and the memory usage is significantly growing, at this time kudu did flush big MRSs/DMSs in priority but sometimes OOM still occurred, even though we have tuned {{-maintenance_manager_num_threads}} to 20. After we tuned {{-flush_threshold_secs}} to 1800(was 3600 before), we could avoid OOM occurring but I found that {{average_diskrowset_height}} of most tablets become larger, that means these tablets need to be compacted more. In general we want to prioritize flushes so we could free more memory, but also don't want to get more small DRSs. So maybe prioritize bigger MRS/DMS flushes would help. Maybe could use {{max(memory_size, time_since_last_flush }} to define perf improvement of a mem-store flush, so that both big mem-stores and long_lived mem-stores could be flushed in priority. was (Author: zhangyifan27): If we lower {{-memory_pressure_percentage}}, we should also lower {{-block_cache_capacity_mb}} accordingly, that may not make full use of the memory resources. In fact most time of a day the memory usage of our kudu server is not very high(about 50%), but there will be a lot of insert/update in one hour or two and the memory usage is significantly growing, at this time kudu did flush big MRSs/DMSs in priority but sometimes OOM still occurred, even though we have tuned {{-maintenance_manager_num_threads}} to 20. After we tuned {{-flush_threshold_secs}} to 1800(was 3600 before), we could avoid OOM occurring but I found that {{average_diskrowset_height}} of most tablets become larger, that means these tablets need to be compacted more. In general we want to prioritize flushes so we could free more memory, but also don't want to get more small DRSs. So maybe prioritize bigger MRS/DMS flushes would help. > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173104#comment-17173104 ] YifanZhang commented on KUDU-3180: -- If we lower {{-memory_pressure_percentage}}, we should also lower {{-block_cache_capacity_mb}} accordingly, that may not make full use of the memory resources. In fact most time of a day the memory usage of our kudu server is not very high(about 50%), but there will be a lot of insert/update in one hour or two and the memory usage is significantly growing, at this time kudu did flush big MRSs/DMSs in priority but sometimes OOM still occurred, even though we have tuned {{-maintenance_manager_num_threads}} to 20. After we tuned {{-flush_threshold_secs}} to 1800(was 3600 before), we could avoid OOM occurring but I found that {{average_diskrowset_height}} of most tablets become larger, that means these tablets need to be compacted more. In general we want to prioritize flushes so we could free more memory, but also don't want to get more small DRSs. So maybe prioritize bigger MRS/DMS flushes would help. > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3180: - Issue Type: Improvement (was: Bug) > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Improvement >Reporter: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172925#comment-17172925 ] YifanZhang commented on KUDU-3180: -- Thanks [~aserbin]. I agree that using {{memory_size * time_since_last_flush}} instead of just {{time_since_last_flush}} to pick which MRS should be flush is a easy way to improve current policy. Also if we prefer flush to compactions, current policy ensures that if an MRS over {{flush_threshold_mb}}, a flush will be more likely to be selected than a compaction. > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
[ https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172082#comment-17172082 ] YifanZhang commented on KUDU-3180: -- [~awong] Thanks for your comments! The kudu cluster shown in the screenshot is 1.11.1 version, and I also found mem-stores in a 1.12.0 cluster that anchor 0B WAL on /maintenance-manager page, maybe the log size is less than 1B and this should be a interger so becomes 0B. We mainly want to tradeoff memory used by mem-stores and rowset size on disk. If we flush frequently we could get some small DRSs and need to do more compactions, if we don't flush frequently mem-stores will anchor more memory. So define a cost function based on the time since last flush and memory used might be useful. It's not always true that older or larger mem-stores anchor more WAL bytes as far as I saw on /maintenance-manager page, so maybe we shouldn't always use WAL bytes anchored to determine what to flush. In our cases, we are running low on memory now, I think that is more common than low on WAL disk space because OS allocate memory for various ops. If we want to free more WAL disk space, lower --log_target_replay_size_mb should be effective. > kudu don't always prefer to flush MRS/DMS that anchor more memory > - > > Key: KUDU-3180 > URL: https://issues.apache.org/jira/browse/KUDU-3180 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > Attachments: image-2020-08-04-20-26-53-749.png, > image-2020-08-04-20-28-00-665.png > > > Current time-based flush policy always give a flush op a high score if we > haven't flushed for the tablet in a long time, that may lead to starvation of > ops that could free more memory. > We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and > find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS > flushes and compactions, which seems not so reasonable. > !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory
YifanZhang created KUDU-3180: Summary: kudu don't always prefer to flush MRS/DMS that anchor more memory Key: KUDU-3180 URL: https://issues.apache.org/jira/browse/KUDU-3180 Project: Kudu Issue Type: Bug Reporter: YifanZhang Attachments: image-2020-08-04-20-26-53-749.png, image-2020-08-04-20-28-00-665.png Current time-based flush policy always give a flush op a high score if we haven't flushed for the tablet in a long time, that may lead to starvation of ops that could free more memory. We set -flush_threshold_mb=32, -flush_threshold_secs=1800 in a cluster, and find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS flushes and compactions, which seems not so reasonable. !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3150) UI tables page sorts tablet count column incorrectly.
[ https://issues.apache.org/jira/browse/KUDU-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139177#comment-17139177 ] YifanZhang commented on KUDU-3150: -- I think this was fixed in 1d1a85804b8ce132021661a8fdb053141c2781c, but wasn't cherry picked into 1.12. > UI tables page sorts tablet count column incorrectly. > -- > > Key: KUDU-3150 > URL: https://issues.apache.org/jira/browse/KUDU-3150 > Project: Kudu > Issue Type: Bug > Components: ui >Reporter: Grant Henke >Priority: Major > Labels: beginner, supportability > > It looks like the tables page in the master web ui sorts the "Tablet Count" > wrong. I think it must be sorting lexicographically instead of numerically. > This was especially evident when 5.49k tablets was not sorted to the top in a > cluster recently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KUDU-2879) Build hangs in DEBUG type on Ubuntu 18.04
[ https://issues.apache.org/jira/browse/KUDU-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reopened KUDU-2879: -- I hit this issue again. I can build DEBUG type of kudu 1.12.0 successfully on CentOS 7.3, but when I try to run any binary in build/debug/bin, it somehow hangs. The pstack is: {code:java} #0 0x7febae90ce40 in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x7febadb08849 in base::internal::SpinLockDelay(int volatile*, int, int) () from /usr/lib64/libprofiler.so.0 #2 0x7febadb087cf in SpinLock::SlowLock() () from /usr/lib64/libprofiler.so.0 #3 0x7febabd1ef08 in tcmalloc::ThreadCache::InitModule() () from /usr/lib64/libtcmalloc.so.4 #4 0x7febabd1effd in tcmalloc::ThreadCache::CreateCacheIfNecessary() () from /usr/lib64/libtcmalloc.so.4 #5 0x7febabd2d325 in tc_calloc () from /usr/lib64/libtcmalloc.so.4 #6 0x7febabaf4550 in _dlerror_run (operate=operate@entry=0x7febabaf3ff0 , args=args@entry=0x7fffc9085dd0) at dlerror.c:141 #7 0x7febabaf4058 in __dlsym (handle=, name=) at dlsym.c:70 #8 0x7febac4925ae in (anonymous namespace)::dlsym_or_die (sym=0x7febac5b29eb "dlopen") at /home/zhangyifan8/work/kudu/src/kudu/util/debug/unwind_safeness.cc:74 #9 0x7febac4926d2 in (anonymous namespace)::InitIfNecessary () at /home/zhangyifan8/work/kudu/src/kudu/util/debug/unwind_safeness.cc:100 #10 0x7febac49280a in dl_iterate_phdr (callback=0x7feba9c2f280 <_Unwind_IteratePhdrCallback>, data=0x7fffc9085ed0) at /home/zhangyifan8/work/kudu/src/kudu/util/debug/unwind_safeness.cc:158 #11 0x7feba9c2fbbf in _Unwind_Find_FDE (pc=0x7feba9c2df87 <_Unwind_Backtrace+55>, bases=bases@entry=0x7fffc9086228) at ../../../libgcc/unwind-dw2-fde-dip.c:461 #12 0x7feba9c2cd2c in uw_frame_state_for (context=context@entry=0x7fffc9086180, fs=fs@entry=0x7fffc9085fd0) at ../../../libgcc/unwind-dw2.c:1245 #13 0x7feba9c2d6ed in uw_init_context_1 (context=context@entry=0x7fffc9086180, outer_cfa=outer_cfa@entry=0x7fffc9086430, outer_ra=0x7febadb071da) at ../../../libgcc/unwind-dw2.c:1566 #14 0x7feba9c2df88 in _Unwind_Backtrace (trace=0x7febadb07410, trace_argument=0x7fffc9086430) at ../../../libgcc/unwind.inc:283 #15 0x7febadb071da in ?? () from /usr/lib64/libprofiler.so.0 #16 0x7febadb078e4 in GetStackTrace(void**, int, int) () from /usr/lib64/libprofiler.so.0 #17 0x7febabd1c386 in tcmalloc::PageHeap::GrowHeap(unsigned long) () from /usr/lib64/libtcmalloc.so.4 #18 0x7febabd1c613 in tcmalloc::PageHeap::New(unsigned long) () from /usr/lib64/libtcmalloc.so.4 #19 0x7febabd1b139 in tcmalloc::CentralFreeList::Populate() () from /usr/lib64/libtcmalloc.so.4 #20 0x7febabd1b338 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**) () from /usr/lib64/libtcmalloc.so.4 #21 0x7febabd1b3d0 in tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) () from /usr/lib64/libtcmalloc.so.4 #22 0x7febabd1e2a7 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int) () from /usr/lib64/libtcmalloc.so.4 #23 0x7febabd2ce16 in tcmalloc::allocate_full_malloc_oom(unsigned long) () from /usr/lib64/libtcmalloc.so.4 #24 0x7feba98bdb6d in __fopen_internal (filename=0x7feba8105f37 "/proc/filesystems", mode=0x7feba8105da1 "r", is32=1) at iofopen.c:69 #25 0x7feba80f7956 in selinuxfs_exists () from /usr/lib64/libselinux.so.1 #26 0x7feba80efc28 in init_lib () from /usr/lib64/libselinux.so.1 #27 0x7febb3d69973 in call_init (env=0x7fffc9086828, argv=0x7fffc9086818, argc=1, l=) at dl-init.c:82 #28 _dl_init (main_map=0x7febb3f7d150, argc=1, argv=0x7fffc9086818, env=0x7fffc9086828) at dl-init.c:131 #29 0x7febb3d5b15a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2 #30 0x0001 in ?? () #31 0x7fffc9087195 in ?? () #32 0x in ?? () {code} It works well if built with RELEASE type. And this didn't happen when build kudu 1.11.1 with Debug type. > Build hangs in DEBUG type on Ubuntu 18.04 > - > > Key: KUDU-2879 > URL: https://issues.apache.org/jira/browse/KUDU-2879 > Project: Kudu > Issue Type: Improvement >Reporter: Yingchun Lai >Priority: Major > Attachments: config.diff, config.log > > > Few months ago, I report this issue on Slack: > [https://getkudu.slack.com/archives/C0CPXJ3CH/p1549942641041600] > I switch to RELEASE type then on, and haven't try build on DEBUG type on my > Ubuntu environment. > Now, when I try build DEBUG type to check 1.10.0-RC2, this issue occurred > again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3121) Allow users to pick the next best op
[ https://issues.apache.org/jira/browse/KUDU-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106875#comment-17106875 ] YifanZhang commented on KUDU-3121: -- Maybe this issue is similar to KUDU-2824, now we can use 'kudu table set_flag' tool to give some tables a high priority in MM compaction. And also according to [~wdberkeley_impala_f7d4]'s suggestion in [https://gerrit.cloudera.org/c/12852/], we could also improve the maintenance manager by accounting for how often tablets are read or written. > Allow users to pick the next best op > > > Key: KUDU-3121 > URL: https://issues.apache.org/jira/browse/KUDU-3121 > Project: Kudu > Issue Type: New Feature > Components: ops-tooling >Reporter: Andrew Wong >Priority: Major > > Time and again, we'll see a case where the maintenance manager scheduler > thread is, for whatever reason, scheduling an op that is actually not that > helpful. KUDU-2929, KUDU-3002, and KUDU-1400 come to mind. > It might be convenient in some cases to temporarily (maybe for a single round > of scheduling) give a specific tablet or op priority. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3108) Tablet server crashes when handle diffscan request
[ https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088805#comment-17088805 ] YifanZhang commented on KUDU-3108: -- [~granthenke] Thanks for your reply. The OS version is CentOS 7.3. Some non-default configurations of tablet server are: {code:java} log_target_replay_size_mb = 128 maintenance_manager_num_threads = 10 maintenance_manager_polling_interval_ms = 50 memory_limit_hard_bytes = 107374182475 memory_limit_soft_percentage = 85 memory_pressure_percentage = 80 num_tablets_to_open_simultaneously = 20 redact = none rpc_authentication = disabled rpc_bind_addresses = 0.0.0.0:14100 rpc_encryption = disabled rpc_num_service_threads = 128 rpc_service_queue_length = 1024 server_thread_pool_max_thread_count = 128 tablet_history_max_age_sec = 10 tablet_transaction_memory_limit_mb = 1024 unlock_experimental_flags = true vmodule = maintenance=2 {code} `tablet_history_max_age_sec` was set to 10. I ran the first full backup job right after setting tables' history_max_age_sec configuration. This setting seems succeed with no timeout or something and the first full backup jobs succeed. I run an incremental backup job of these tables after about a day and a half. Non-default flag of backup job is: --numParallelBackups 10. I have tried to run this incremental backup job once more, and this crash happened again. Because it's a production cluster I didn't try many times. Besides, there are some rows delete operations on backup tables all the time. I hope provided information would help. > Tablet server crashes when handle diffscan request > --- > > Key: KUDU-3108 > URL: https://issues.apache.org/jira/browse/KUDU-3108 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > When we did an incremental backup for tables in a cluster with 20 tservers, > 3 tservers crashed, coredump stacks are the same: > {code:java} > Unable to find source-code formatter for language: shell. Available languages > are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, > groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, > perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, > yamlProgram terminated with signal 11, Segmentation fault.Program terminated > with signal 11, Segmentation fault. > #0 kudu::Schema::Compare > (this=0x25b883680, lhs=..., rhs=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 > 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file > or directory. > Missing separate debuginfos, use: debuginfo-install > bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 > cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 > cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 > elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 > keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 > libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 > libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 > libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 > ncurses-libs-5.9-13.20130511.el7.x86_64 > nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 > openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 > systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 > zlib-1.2.7-17.el7.x86_64 > (gdb) bt > #0 kudu::Schema::Compare > (this=0x25b883680, lhs=..., rhs=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 > #1 0x01da51fb in kudu::MergeIterator::RefillHotHeap > (this=this@entry=0x78f6ec500) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720 > #2 0x01da622b in kudu::MergeIterator::AdvanceAndReheap > (this=this@entry=0x78f6ec500, state=0xd1661a000, > num_rows_to_advance=num_rows_to_advance@entry=1) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690 > #3 0x01da7927 in kudu::MergeIterator::MaterializeOneRow > (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, > dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894 > #4 0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, > dst=0x7f0d5cc9ffc0) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796 > #5 0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock > (this=, dst=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499 > #6 0x0095475c in > kudu::tserver::TabletServiceImpl::HandleContinueScanRequest > (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720, >
[jira] [Updated] (KUDU-3108) Tablet server crashes when handle diffscan request
[ https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3108: - Description: When we did an incremental backup for tables in a cluster with 20 tservers, 3 tservers crashed, coredump stacks are the same: {code:java} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlProgram terminated with signal 11, Segmentation fault.Program terminated with signal 11, Segmentation fault. #0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or directory. Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 #1 0x01da51fb in kudu::MergeIterator::RefillHotHeap (this=this@entry=0x78f6ec500) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720 #2 0x01da622b in kudu::MergeIterator::AdvanceAndReheap (this=this@entry=0x78f6ec500, state=0xd1661a000, num_rows_to_advance=num_rows_to_advance@entry=1) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690 #3 0x01da7927 in kudu::MergeIterator::MaterializeOneRow (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894 #4 0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, dst=0x7f0d5cc9ffc0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796 #5 0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock (this=, dst=) at /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499 #6 0x0095475c in kudu::tserver::TabletServiceImpl::HandleContinueScanRequest (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565 #7 0x00966564 in kudu::tserver::TabletServiceImpl::HandleNewScanRequest (this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, scanner_id=scanner_id@entry=0x7f0d5cca0940, snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476 #8 0x00967f4b in kudu::tserver::TabletServiceImpl::Scan (this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674 #9 0x01d2e449 in operator() (__args#2=0x5e512a460, __args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at /usr/include/c++/4.8.2/functional:2471 #10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 #11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 #12 0x01e9e924 in operator() (this=0x90fb52e8) at /home/zhangyifan8/work/kudu-xm/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:771 #13 kudu::Thread::SuperviseThread (arg=0x90fb52c0) at /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:657 #14 0x7f103b20cdc5 in start_thread () from /lib64/libpthread.so.0 #15 0x7f103956673d in clone () from /lib64/libc.so.6 {code} Before we
[jira] [Updated] (KUDU-3108) Tablet server crashes when handle diffscan request
[ https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3108: - Description: When we did an incremental backup for tables in a cluster with 20 tservers, 3 tservers crashed, coredump stacks are the same: {code} Unable to find source-code formatter for language: shell. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamlProgram terminated with signal 11, Segmentation fault.Program terminated with signal 11, Segmentation fault. #0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or directory. Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 #1 0x01da51fb in kudu::MergeIterator::RefillHotHeap (this=this@entry=0x78f6ec500) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720 #2 0x01da622b in kudu::MergeIterator::AdvanceAndReheap (this=this@entry=0x78f6ec500, state=0xd1661a000, num_rows_to_advance=num_rows_to_advance@entry=1) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690 #3 0x01da7927 in kudu::MergeIterator::MaterializeOneRow (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894 #4 0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, dst=0x7f0d5cc9ffc0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796 #5 0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock (this=, dst=) at /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499 #6 0x0095475c in kudu::tserver::TabletServiceImpl::HandleContinueScanRequest (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565 #7 0x00966564 in kudu::tserver::TabletServiceImpl::HandleNewScanRequest (this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, scanner_id=scanner_id@entry=0x7f0d5cca0940, snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476 #8 0x00967f4b in kudu::tserver::TabletServiceImpl::Scan (this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674 #9 0x01d2e449 in operator() (__args#2=0x5e512a460, __args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at /usr/include/c++/4.8.2/functional:2471 #10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 #11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 #12 0x01e9e924 in operator() (this=0x90fb52e8) at /home/zhangyifan8/work/kudu-xm/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:771 #13 kudu::Thread::SuperviseThread (arg=0x90fb52c0) at /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:657 #14 0x7f103b20cdc5 in start_thread () from /lib64/libpthread.so.0 #15 0x7f103956673d in clone () from /lib64/libc.so.6 {code} was: When we
[jira] [Updated] (KUDU-3108) Tablet server crashes when handle diffscan request
[ https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3108: - Summary: Tablet server crashes when handle diffscan request (was: Tablet server crashes when handle scan request ) > Tablet server crashes when handle diffscan request > --- > > Key: KUDU-3108 > URL: https://issues.apache.org/jira/browse/KUDU-3108 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > When we use KuduBackup Spark job to backup tables in a cluster with 20 > tservers, 3 tservers crashed, coredump stacks are the same: > {code:java} > Program terminated with signal 11, Segmentation fault.Program terminated with > signal 11, Segmentation fault. > #0 kudu::Schema::Compare > (this=0x25b883680, lhs=..., rhs=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 > 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file > or directory. > Missing separate debuginfos, use: debuginfo-install > bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 > cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 > cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 > elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 > keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 > libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 > libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 > libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 > ncurses-libs-5.9-13.20130511.el7.x86_64 > nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 > openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 > systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 > zlib-1.2.7-17.el7.x86_64 > (gdb) bt > #0 kudu::Schema::Compare > (this=0x25b883680, lhs=..., rhs=...) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 > #1 0x01da51fb in kudu::MergeIterator::RefillHotHeap > (this=this@entry=0x78f6ec500) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720 > #2 0x01da622b in kudu::MergeIterator::AdvanceAndReheap > (this=this@entry=0x78f6ec500, state=0xd1661a000, > num_rows_to_advance=num_rows_to_advance@entry=1) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690 > #3 0x01da7927 in kudu::MergeIterator::MaterializeOneRow > (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, > dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894 > #4 0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, > dst=0x7f0d5cc9ffc0) at > /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796 > #5 0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock > (this=, dst=) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499 > #6 0x0095475c in > kudu::tserver::TabletServiceImpl::HandleContinueScanRequest > (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720, > rpc_context=rpc_context@entry=0x5e512a460, > result_collector=result_collector@entry=0x7f0d5cca0a00, > has_more_results=has_more_results@entry=0x7f0d5cca0886, > error_code=error_code@entry=0x7f0d5cca0888) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565 > #7 0x00966564 in > kudu::tserver::TabletServiceImpl::HandleNewScanRequest > (this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240, > rpc_context=rpc_context@entry=0x5e512a460, > result_collector=result_collector@entry=0x7f0d5cca0a00, > scanner_id=scanner_id@entry=0x7f0d5cca0940, > snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, > has_more_results=has_more_results@entry=0x7f0d5cca0886, > error_code=error_code@entry=0x7f0d5cca0888) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476 > #8 0x00967f4b in kudu::tserver::TabletServiceImpl::Scan > (this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460) at > /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674 > #9 0x01d2e449 in operator() (__args#2=0x5e512a460, > __args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at > /usr/include/c++/4.8.2/functional:2471 > #10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 > #11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) > at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 > #12 0x01e9e924 in operator() (this=0x90fb52e8) at >
[jira] [Updated] (KUDU-3108) Tablet server crashes when handle scan request
[ https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3108: - Description: When we use KuduBackup Spark job to backup tables in a cluster with 20 tservers, 3 tservers crashed, coredump stacks are the same: {code:java} Program terminated with signal 11, Segmentation fault.Program terminated with signal 11, Segmentation fault. #0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or directory. Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267 #1 0x01da51fb in kudu::MergeIterator::RefillHotHeap (this=this@entry=0x78f6ec500) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720 #2 0x01da622b in kudu::MergeIterator::AdvanceAndReheap (this=this@entry=0x78f6ec500, state=0xd1661a000, num_rows_to_advance=num_rows_to_advance@entry=1) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690 #3 0x01da7927 in kudu::MergeIterator::MaterializeOneRow (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894 #4 0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, dst=0x7f0d5cc9ffc0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796 #5 0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock (this=, dst=) at /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499 #6 0x0095475c in kudu::tserver::TabletServiceImpl::HandleContinueScanRequest (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565 #7 0x00966564 in kudu::tserver::TabletServiceImpl::HandleNewScanRequest (this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, scanner_id=scanner_id@entry=0x7f0d5cca0940, snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476 #8 0x00967f4b in kudu::tserver::TabletServiceImpl::Scan (this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674 #9 0x01d2e449 in operator() (__args#2=0x5e512a460, __args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at /usr/include/c++/4.8.2/functional:2471 #10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139 #11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225 #12 0x01e9e924 in operator() (this=0x90fb52e8) at /home/zhangyifan8/work/kudu-xm/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:771 #13 kudu::Thread::SuperviseThread (arg=0x90fb52c0) at /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:657 #14 0x7f103b20cdc5 in start_thread () from /lib64/libpthread.so.0 #15 0x7f103956673d in clone () from /lib64/libc.so.6 {code} was: When we use KuduBackup{{}} Spark job to backup tables in a cluster with 20 tservers, 3 tservers crashed, coredump stacks are the same: {code:java} [Thread debugging using libthread_db enabled][Thread debugging using libthread_db enabled]Using host libthread_db library "/lib64/libthread_db.so.1".Missing
[jira] [Created] (KUDU-3108) Tablet server crashes when handle scan request
YifanZhang created KUDU-3108: Summary: Tablet server crashes when handle scan request Key: KUDU-3108 URL: https://issues.apache.org/jira/browse/KUDU-3108 Project: Kudu Issue Type: Bug Affects Versions: 1.10.1 Reporter: YifanZhang When we use KuduBackup{{}} Spark job to backup tables in a cluster with 20 tservers, 3 tservers crashed, coredump stacks are the same: {code:java} [Thread debugging using libthread_db enabled][Thread debugging using libthread_db enabled]Using host libthread_db library "/lib64/libthread_db.so.1".Missing separate debuginfo for /home/work/app/kudu/zjyprc-hadoop/tablet_server/package/libstdc++.so.6Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b3/d9128bcf6786292a339a477953167d0ddab5ba.debugCore was generated by `/home/work/app/kudu/zjyprc-hadoop/tablet_server/package/kudu_tablet_server -tse'.Program terminated with signal 11, Segmentation fault.#0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or directory.Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 ncurses-libs-5.9-13.20130511.el7.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64(gdb) bt#0 kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267#1 0x01da51fb in kudu::MergeIterator::RefillHotHeap (this=this@entry=0x78f6ec500) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720#2 0x01da622b in kudu::MergeIterator::AdvanceAndReheap (this=this@entry=0x78f6ec500, state=0xd1661a000, num_rows_to_advance=num_rows_to_advance@entry=1) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690#3 0x01da7927 in kudu::MergeIterator::MaterializeOneRow (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894#4 0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, dst=0x7f0d5cc9ffc0) at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796#5 0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock (this=, dst=) at /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499#6 0x0095475c in kudu::tserver::TabletServiceImpl::HandleContinueScanRequest (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565#7 0x00966564 in kudu::tserver::TabletServiceImpl::HandleNewScanRequest (this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240, rpc_context=rpc_context@entry=0x5e512a460, result_collector=result_collector@entry=0x7f0d5cca0a00, scanner_id=scanner_id@entry=0x7f0d5cca0940, snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, has_more_results=has_more_results@entry=0x7f0d5cca0886, error_code=error_code@entry=0x7f0d5cca0888) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476#8 0x00967f4b in kudu::tserver::TabletServiceImpl::Scan (this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460) at /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674#9 0x01d2e449 in operator() (__args#2=0x5e512a460, __args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at /usr/include/c++/4.8.2/functional:2471#10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139#11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225#12 0x01e9e924 in operator() (this=0x90fb52e8) at
[jira] [Updated] (KUDU-3098) leadership change during tablet_copy process may lead to an isolate replica
[ https://issues.apache.org/jira/browse/KUDU-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3098: - Summary: leadership change during tablet_copy process may lead to an isolate replica (was: leader change during 'add_peer' process for a tablet may lead to an isolate replica) > leadership change during tablet_copy process may lead to an isolate replica > --- > > Key: KUDU-3098 > URL: https://issues.apache.org/jira/browse/KUDU-3098 > Project: Kudu > Issue Type: Bug > Components: consensus, master >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > Lately we found some tablets in a cluster with a very large > "time_since_last_leader_heartbeat" metric, they are LEARNER/NON_VOTER and > seems couldn't become VOTER for a long time. > These replicas created during the rebalance/tablet_copy process. After > beginning a new copy session from leader to the new added NON_VOTER peer, > leadership changed, old leader aborted uncommited CHANGE_CONFIG_OP operation. > Finally the tablet_copy session ended but new leader knew nothing about the > new peer. > Master didn't delete this new added replica because it has a larger > opid_index than the latest reported committed config. See the comments in > CatalogManager::ProcessTabletReport > {code:java} > // 5. Tombstone a replica that is no longer part of the Raft config (and > // not already tombstoned or deleted outright). > // > // If the report includes a committed raft config, we only tombstone if > // the opid_index is strictly less than the latest reported committed > // config. This prevents us from spuriously deleting replicas that have > // just been added to the committed config and are in the process of copying. > {code} > Maybe we shouldn't use opid_index to determine if replicas are in the process > of copying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3098) leader change during 'add_peer' process for a tablet may lead to an isolate replica
YifanZhang created KUDU-3098: Summary: leader change during 'add_peer' process for a tablet may lead to an isolate replica Key: KUDU-3098 URL: https://issues.apache.org/jira/browse/KUDU-3098 Project: Kudu Issue Type: Bug Components: consensus, master Affects Versions: 1.10.1 Reporter: YifanZhang Lately we found some tablets in a cluster with a very large "time_since_last_leader_heartbeat" metric, they are LEARNER/NON_VOTER and seems couldn't become VOTER for a long time. These replicas created during the rebalance/tablet_copy process. After beginning a new copy session from leader to the new added NON_VOTER peer, leadership changed, old leader aborted uncommited CHANGE_CONFIG_OP operation. Finally the tablet_copy session ended but new leader knew nothing about the new peer. Master didn't delete this new added replica because it has a larger opid_index than the latest reported committed config. See the comments in CatalogManager::ProcessTabletReport {code:java} // 5. Tombstone a replica that is no longer part of the Raft config (and // not already tombstoned or deleted outright). // // If the report includes a committed raft config, we only tombstone if // the opid_index is strictly less than the latest reported committed // config. This prevents us from spuriously deleting replicas that have // just been added to the committed config and are in the process of copying. {code} Maybe we shouldn't use opid_index to determine if replicas are in the process of copying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3082: - Attachment: master_leader.log > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > Attachments: master_leader.log, ts25.info.gz, ts26.log.gz > > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T >
[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3082: - Attachment: ts25.info.gz > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > Attachments: ts25.info.gz, ts26.log.gz > > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P
[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3082: - Attachment: ts26.log.gz > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > Attachments: ts25.info.gz, ts26.log.gz > > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P
[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069389#comment-17069389 ] YifanZhang commented on KUDU-3082: -- Unfortunately, most logs were cleaned up due to expiration before I want to analyze them. Now I get partial logs about tablet 7404240f458f462d92b6588d07583a52(full logs on ts26 and partial logs on ts25). I'll attach them in a moment. The logs on ts27 and the leader master before ts27 restart are completely cleaned up:( I also keep some fragmented logs on the master and I'm not sure if it is helpful. I think the state of ts27 was abnormal when the problem occurs because some replicas could't communicate with their leader on ts27. {code:java} I0313 03:50:14.118202 99494 raft_consensus.cc:1149] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [term 2 LEADER]: Rejecting Update request from peer 47af52df1adc47e1903eb097e9c88f2e for earlier term 1. Current term is 2. Ops: [] I0313 03:50:14.250483 56182 consensus_queue.cc:984] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 55445, Time since last communication: 0.000s I0313 03:50:14.327806 56430 consensus_queue.cc:984] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Connected to new peer: Peer: permanent_uuid: "d1952499f94a4e6087bee28466fcb09f" member_type: VOTER last_known_addr { host: "kudu-ts25" port: 14100 }, Status: LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 54648, Time since last communication: 0.000s I0313 03:50:14.330118 56430 consensus_queue.cc:689] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been garbage collected. The follower will never be able to catch up (Not found: Failed to read ops 54649..55444: Segment 157 which contained index 54649 has been GCed) I0313 03:50:14.330137 56430 consensus_queue.cc:544] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been garbage collected. The replica will never be able to catch up I0313 03:50:14.335949 99494 consensus_queue.cc:206] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Queue going to LEADER mode. State: All replicated index: 0, Majority replicated index: 55446, Committed index: 55446, Last appended: 2.55446, Last appended by leader: 55445, Current term: 2, Majority size: 2, State: 0, Mode: LEADER, active raft config: opid_index: 55447 OBSOLETE_local: false peers { permanent_uuid: "7380d797d2ea49e88d71091802fb1c81" member_type: VOTER last_known_addr { host: "kudu-ts26" port: 14100 } } peers { permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 } } I0313 03:50:14.336225 56182 consensus_queue.cc:984] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: LMP_MISMATCH, Last received: 0.0, Next index: 55447, Last known committed idx: 55446, Time since last communication: 0.000s W0313 03:50:14.336508 98349 consensus_peers.cc:458] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 -> Peer 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27:14100): Couldn't send request to peer 47af52df1adc47e1903eb097e9c88f2e. Status: Illegal state: Rejecting Update request from peer 7380d797d2ea49e88d71091802fb1c81 for term 2. Could not prepare a single transaction due to: Illegal state: RaftConfig change currently pending. Only one is allowed at a time. {code} Judging from the above logs on ts26, it reject the update request from peer 47af52d and it also send update request to this peer but failed. Maybe it means the config change operation of replica 47af52d is failed but the pending config isn't cleared. This case maybe similar to KUDU-1338. > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet
[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067541#comment-17067541 ] YifanZhang commented on KUDU-3082: -- [~aihai] It seems a different problem, what I encountered was not a checksum error but a consistency mismatch error. > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279
[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3082: - Component/s: consensus > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 > LEADER]: attempt to
[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063045#comment-17063045 ] YifanZhang commented on KUDU-3082: -- Sorry I forgot to explain, the cluster version is 1.10.1. > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 > LEADER]:
[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3082: - Affects Version/s: 1.10.1 > tablets in "CONSENSUS_MISMATCH" state for a long time > - > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.10.1 >Reporter: YifanZhang >Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 5| -1 | Yes > B | A B C| 5| -1 | Yes > C | A B C* D~ | 5| 54649| No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B* C| | | Yes > A | A B* C| 5| 5| Yes > B | A B* C D~ | 5| 96176| No > C | A B* C| 5| 5| Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae84954d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A B C* | | | Yes > A | A B C* | 1| -1 | Yes > B | A B C* | 1| -1 | Yes > C | A B C* D~ | 1| 2| No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C D~ | 1| 1991 | No > B | A* B C| 1| 4| Yes > C | A* B C| 1| 4| Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 > LEADER]: attempt to promote peer
[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang updated KUDU-3082: - Description: Lately we found a few tablets in one of our clusters are unhealthy, the ksck output is like: {code:java} Tablet Summary Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' active configs disagree with the leader master's 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] All reported replicas are: A = 7380d797d2ea49e88d71091802fb1c81 B = d1952499f94a4e6087bee28466fcb09f C = 47af52df1adc47e1903eb097e9c88f2e D = 08beca5ed4d04003b6979bf8bac378d2 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A B C* | | | Yes A | A B C* | 5| -1 | Yes B | A B C| 5| -1 | Yes C | A B C* D~ | 5| 54649| No Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active configs disagree with the leader master's d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING All reported replicas are: A = d1952499f94a4e6087bee28466fcb09f B = 47af52df1adc47e1903eb097e9c88f2e C = 5a8aeadabdd140c29a09dabcae919b31 D = 14632cdbb0d04279bc772f64e06389f9 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A B* C| | | Yes A | A B* C| 5| 5| Yes B | A B* C D~ | 5| 96176| No C | A B* C| 5| 5| Yes Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' active configs disagree with the leader master's a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] All reported replicas are: A = a9eaff3cf1ed483aae84954d649a B = f75df4a6b5ce404884313af5f906b392 C = 47af52df1adc47e1903eb097e9c88f2e D = d1952499f94a4e6087bee28466fcb09f The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A B C* | | | Yes A | A B C* | 1| -1 | Yes B | A B C* | 1| -1 | Yes C | A B C* D~ | 1| 2| No Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active configs disagree with the leader master's 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING All reported replicas are: A = 47af52df1adc47e1903eb097e9c88f2e B = f0f7b2f4b9d344e6929105f48365f38e C = f75df4a6b5ce404884313af5f906b392 D = d1952499f94a4e6087bee28466fcb09f The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A* B C| | | Yes A | A* B C D~ | 1| 1991 | No B | A* B C| 1| 4| Yes C | A* B C| 1| 4| Yes{code} These tablets couldn't recover for a couple of days until we restart kudu-ts27. I found so many duplicated logs in kudu-ts27 are like: {code:java} I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing. I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing. {code} There seems to be some RaftConfig change operations that somehow cannot
[jira] [Created] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time
YifanZhang created KUDU-3082: Summary: tablets in "CONSENSUS_MISMATCH" state for a long time Key: KUDU-3082 URL: https://issues.apache.org/jira/browse/KUDU-3082 Project: Kudu Issue Type: Bug Reporter: YifanZhang Lately we found a few tablets in one of our clusters are unhealthy, the ksck output is like: {code:java} Tablet Summary Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' active configs disagree with the leader master's 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] All reported replicas are: A = 7380d797d2ea49e88d71091802fb1c81 B = d1952499f94a4e6087bee28466fcb09f C = 47af52df1adc47e1903eb097e9c88f2e D = 08beca5ed4d04003b6979bf8bac378d2 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A B C* | | | Yes A | A B C* | 5| -1 | Yes B | A B C| 5| -1 | Yes C | A B C* D~ | 5| 54649| NoTablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active configs disagree with the leader master's d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING All reported replicas are: A = d1952499f94a4e6087bee28466fcb09f B = 47af52df1adc47e1903eb097e9c88f2e C = 5a8aeadabdd140c29a09dabcae919b31 D = 14632cdbb0d04279bc772f64e06389f9 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A B* C| | | Yes A | A B* C| 5| 5| Yes B | A B* C D~ | 5| 96176| No C | A B* C| 5| 5| Yes Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' active configs disagree with the leader master's a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] All reported replicas are: A = a9eaff3cf1ed483aae84954d649a B = f75df4a6b5ce404884313af5f906b392 C = 47af52df1adc47e1903eb097e9c88f2e D = d1952499f94a4e6087bee28466fcb09f The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A B C* | | | Yes A | A B C* | 1| -1 | Yes B | A B C* | 1| -1 | Yes C | A B C* D~ | 1| 2| NoTablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active configs disagree with the leader master's 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING All reported replicas are: A = 47af52df1adc47e1903eb097e9c88f2e B = f0f7b2f4b9d344e6929105f48365f38e C = f75df4a6b5ce404884313af5f906b392 D = d1952499f94a4e6087bee28466fcb09f The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---+--+--+--+ master| A* B C| | | Yes A | A* B C D~ | 1| 1991 | No B | A* B C| 1| 4| Yes C | A* B C| 1| 4| Yes{code} These tablets couldn't recover for a couple of days until we restart kudu-ts27. I found so many duplicated logs in kudu-ts27 are like: {code:java} I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing. I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is already a config change operation in progress. Unable to promote
[jira] [Created] (KUDU-3069) Support to alter the number of hash buckets for newly added range partitions
YifanZhang created KUDU-3069: Summary: Support to alter the number of hash buckets for newly added range partitions Key: KUDU-3069 URL: https://issues.apache.org/jira/browse/KUDU-3069 Project: Kudu Issue Type: Improvement Components: client, master Reporter: YifanZhang Now a table in kudu has an immutable HashBucketSchema once created. Sometimes we can't accurately predict the growth of data, after a period of time the number of hash buckets is too small for data of a time range. Is it possible to support to alter the number of hash buckets for newly added range partitions? It seems has no effect on old data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-2992) Limit concurrent alter request of a table
[ https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984778#comment-16984778 ] YifanZhang edited comment on KUDU-2992 at 12/5/19 10:10 AM: I tried to reproduced this case by deleting many tablets at the same time, such as dropping a big table or some big partitions, I found many logs in master are like: {code:java} $ grep "Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b" kudu_master.c3-hadoop-kudu-prc-ct02.bj.work.log.INFO.20191128-213038.14672 I1129 11:21:42.995760 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:43.501857 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:44.394129 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:45.001634 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:45.618881 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:46.610380 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:47.086390 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:47.972025 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:48.973754 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:49.514094 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:50.040673 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:51.057112 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:51.800305 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet {code} That means the master would send delete tablet requests when receiving reports from 'deleted' tablets, maybe we could do something to prevent this kind of duplicate requests. was (Author: zhangyifan27): I tried to reproduced this case by deleting many tablets at the same time, such as dropping a big table or some big partitions, I found many logs in master are like:
[jira] [Assigned] (KUDU-2992) Limit concurrent alter request of a table
[ https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-2992: Assignee: YifanZhang > Limit concurrent alter request of a table > - > > Key: KUDU-2992 > URL: https://issues.apache.org/jira/browse/KUDU-2992 > Project: Kudu > Issue Type: Improvement > Components: master >Reporter: Yingchun Lai >Assignee: YifanZhang >Priority: Major > > One of our production environment clusters cause an accident some days ago, > one user has a table, partition schema looks like: > {code:java} > HASH (uuid) PARTITIONS 80,RANGE (date_hour) ( > PARTITION 2019102900 <= VALUES < 2019102901, > PARTITION 2019102901 <= VALUES < 2019102902, > PARTITION 2019102902 <= VALUES < 2019102903, > PARTITION 2019102903 <= VALUES < 2019102904, > PARTITION 2019102904 <= VALUES < 2019102905, > ...) > {code} > He try to remove many outdated partitions once by SparkSQL, but it returns an > timeout error at first, then he try again and again, and SparkSQL failed > again and again. Then the cluster became unstable, memory usage and CPU load > increasing. > > I found many log like: > {code:java} > W1030 17:29:53.382287 7588 rpcz_store.cc:259] Trace: > 1030 17:26:19.714799 (+ 0us) service_pool.cc:162] Inserting onto call > queue > 1030 17:26:19.714808 (+ 9us) service_pool.cc:221] Handling call > 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet > c52c5f43f7884d08b07fd0005e878fed > 1030 17:29:53.382205 (+ 1us) ts_tablet_manager.cc:794] Acquired tablet > manager lock > 1030 17:29:53.382208 (+ 3us) inbound_call.cc:162] Queueing success > response > Metrics: {"tablet-delete.queue_time_us":213667360} > W1030 17:29:53.382300 7586 rpcz_store.cc:253] Call > kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 > (request call id 1820316) took 213667 ms (3.56 min). Client timeout 2 ms > (30 s) > W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call > kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 > (request call id 1820315) took 213667 ms (3.56 min). Client timeout 2 ms > (30 s) > W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace: > 1030 17:26:19.714825 (+ 0us) service_pool.cc:162] Inserting onto call > queue > 1030 17:26:19.714833 (+ 8us) service_pool.cc:221] Handling call > 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet > 479f8c592f16408c830637a0129359e1 > 1030 17:29:53.382241 (+ 2us) ts_tablet_manager.cc:794] Acquired tablet > manager lock > 1030 17:29:53.382244 (+ 3us) inbound_call.cc:162] Queueing success > response > Metrics: {"tablet-delete.queue_time_us":213667378} > ...{code} > That means 'Acquired tablet manager lock' cost much time, right? > {code:java} > Status TSTabletManager::BeginReplicaStateTransition( > const string& tablet_id, > const string& reason, > scoped_refptr* replica, > scoped_refptr* deleter, > TabletServerErrorPB::Code* error_code) { > // Acquire the lock in exclusive mode as we'll add a entry to the > // transition_in_progress_ map. > std::lock_guard lock(lock_); > TRACE("Acquired tablet manager lock"); > RETURN_NOT_OK(CheckRunningUnlocked(error_code)); > ... > }{code} > But I think the root case is Kudu master send too many duplicate 'alter > table/delete tablet' request to tserver. I found more info in master's log: > {code:java} > $ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" > kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | > egrep "attempt = 1\)" > I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1) > I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1) > I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1) > I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1) > I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1) > I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC
[jira] [Commented] (KUDU-2992) Limit concurrent alter request of a table
[ https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984778#comment-16984778 ] YifanZhang commented on KUDU-2992: -- I tried to reproduced this case by deleting many tablets at the same time, such as dropping a big table or some big partitions, I found many logs in master are like: {code:java} $ grep "Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b" kudu_master.c3-hadoop-kudu-prc-ct02.bj.work.log.INFO.20191128-213038.14672 I1129 11:21:42.995760 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:43.501857 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:44.394129 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:45.001634 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:45.618881 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:46.610380 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:47.086390 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:47.972025 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:48.973754 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:49.514094 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:50.040673 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:51.057112 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:51.800305 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet {code} That means the master would send delete tablet requests when receiving reports from 'deleted' tablets, maybe we could do something to prevent this kind of duplicated requests. > Limit concurrent alter request of a table > - > > Key: KUDU-2992 > URL: https://issues.apache.org/jira/browse/KUDU-2992 > Project: Kudu > Issue Type:
[jira] [Assigned] (KUDU-3006) RebalanceIgnoredTserversTest.Basic is flaky
[ https://issues.apache.org/jira/browse/KUDU-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-3006: Assignee: YifanZhang > RebalanceIgnoredTserversTest.Basic is flaky > --- > > Key: KUDU-3006 > URL: https://issues.apache.org/jira/browse/KUDU-3006 > Project: Kudu > Issue Type: Bug >Reporter: Hao Hao >Assignee: YifanZhang >Priority: Minor > Attachments: rebalancer_tool-test.1.txt > > > RebalanceIgnoredTserversTest.Basic of the rebalancer_tool-test sometimes > fails with an error like below. I attached full test log. > {noformat} > /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/rebalancer_tool-test.cc:350: > Failure > Value of: out > Expected: has substring "2dd9365c71c54e5d83294b31046c5478 | 0" > Actual: "Per-server replica distribution summary for tservers_to_empty:\n > Server UUID| Replica > Count\n--+---\n > 2dd9365c71c54e5d83294b31046c5478 | 1\n\nPer-server replica distribution > summary:\n Statistic | > Value\n---+--\n Minimum Replica Count | 0\n > Maximum Replica Count | 1\n Average Replica Count | 0.50\n\nPer-table > replica distribution summary:\n Replica Skew | > Value\n--+--\n Minimum | 1\n Maximum | 1\n > Average | 1.00\n\n\nrebalancing is complete: cluster is balanced > (moved 0 replicas)\n" (of type std::string) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-3006) RebalanceIgnoredTserversTest.Basic is flaky
[ https://issues.apache.org/jira/browse/KUDU-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979945#comment-16979945 ] YifanZhang commented on KUDU-3006: -- I reproduced this error and printed some logs to debug, some logs were like: {code:java} I1122 15:54:18.806080 28648 rebalancer_tool.cc:190] replacing replicas on healthy ignored tservers I1122 15:54:18.825372 28648 rebalancer_tool.cc:1438] tablet 6274e4d23add454d97ed7b2d7208a097: not considering replicas for movement since the tablet's status is 'CONSENSUS_MISMATCH' {code} That means a replica is not healthy so the rebalancer tool would not move it, I'll try to fix it. > RebalanceIgnoredTserversTest.Basic is flaky > --- > > Key: KUDU-3006 > URL: https://issues.apache.org/jira/browse/KUDU-3006 > Project: Kudu > Issue Type: Bug >Reporter: Hao Hao >Priority: Minor > Attachments: rebalancer_tool-test.1.txt > > > RebalanceIgnoredTserversTest.Basic of the rebalancer_tool-test sometimes > fails with an error like below. I attached full test log. > {noformat} > /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/rebalancer_tool-test.cc:350: > Failure > Value of: out > Expected: has substring "2dd9365c71c54e5d83294b31046c5478 | 0" > Actual: "Per-server replica distribution summary for tservers_to_empty:\n > Server UUID| Replica > Count\n--+---\n > 2dd9365c71c54e5d83294b31046c5478 | 1\n\nPer-server replica distribution > summary:\n Statistic | > Value\n---+--\n Minimum Replica Count | 0\n > Maximum Replica Count | 1\n Average Replica Count | 0.50\n\nPer-table > replica distribution summary:\n Replica Skew | > Value\n--+--\n Minimum | 1\n Maximum | 1\n > Average | 1.00\n\n\nrebalancing is complete: cluster is balanced > (moved 0 replicas)\n" (of type std::string) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-2986) Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables
YifanZhang created KUDU-2986: Summary: Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables Key: KUDU-2986 URL: https://issues.apache.org/jira/browse/KUDU-2986 Project: Kudu Issue Type: Bug Components: client, master, metrics Affects Versions: 1.11.0 Reporter: YifanZhang When we upgraded the cluster with pre-1.11.0 tables, we got inconsistent values for the 'live_row_count' metric of these tables: When visiting masterURL:port/metrics, we got 0 for old tables, and got a positive integer for a old table with a newly added partition, which is the count of rows in the newly added partition. When getting table statistics via `kudu table statistics` CLI tool, we got 0 for old tables and the old table with a new parition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-2914) Rebalance tool support moving replicas from some specific tablet servers
[ https://issues.apache.org/jira/browse/KUDU-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941809#comment-16941809 ] YifanZhang edited comment on KUDU-2914 at 10/1/19 1:09 PM: --- Thanks for [~aserbin]'s suggestion, it's very useful. I have two questions. If we let the master replace all replicas at the tablet server, how to know when the whole replacement process ends, do we need to keep checking whether all relicas have been removed? And if it is possible to mark multiple tablet servers as 'decomissioned'? I mean mark two or more replicas of a tablet with the REPLACE attribute. was (Author: zhangyifan27): Thanks for [~aserbin]'s suggestions, it's very useful. I have tow questions. If we let the master replace all replicas at the tablet server, how to know when the whole replacement process ends, do we need to keep checking whether all relicas have been removed? And if it is possible to mark multiple tablet servers as 'decomissioned'? I mean mark two or more replicas of a tablet with the REPLACE attribute. > Rebalance tool support moving replicas from some specific tablet servers > > > Key: KUDU-2914 > URL: https://issues.apache.org/jira/browse/KUDU-2914 > Project: Kudu > Issue Type: Improvement > Components: CLI >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > > When we need to remove some tservers from a kudu cluster (maybe just for > saving resources or replacing these servers with new servers), it's better to > move all replicas on these tservers to other tservers in a cluster in > advance, instead of waiting for all replicas kicked out and evicting new > replicas. This can be achieved by rebalance tool supporting specifying > 'blacklist_tservers'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2914) Rebalance tool support moving replicas from some specific tablet servers
[ https://issues.apache.org/jira/browse/KUDU-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941809#comment-16941809 ] YifanZhang commented on KUDU-2914: -- Thanks for [~aserbin]'s suggestions, it's very useful. I have tow questions. If we let the master replace all replicas at the tablet server, how to know when the whole replacement process ends, do we need to keep checking whether all relicas have been removed? And if it is possible to mark multiple tablet servers as 'decomissioned'? I mean mark two or more replicas of a tablet with the REPLACE attribute. > Rebalance tool support moving replicas from some specific tablet servers > > > Key: KUDU-2914 > URL: https://issues.apache.org/jira/browse/KUDU-2914 > Project: Kudu > Issue Type: Improvement > Components: CLI >Reporter: YifanZhang >Assignee: YifanZhang >Priority: Minor > > When we need to remove some tservers from a kudu cluster (maybe just for > saving resources or replacing these servers with new servers), it's better to > move all replicas on these tservers to other tservers in a cluster in > advance, instead of waiting for all replicas kicked out and evicting new > replicas. This can be achieved by rebalance tool supporting specifying > 'blacklist_tservers'. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KUDU-2934) Bad merge behavior for some metrics
[ https://issues.apache.org/jira/browse/KUDU-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YifanZhang reassigned KUDU-2934: Assignee: YifanZhang > Bad merge behavior for some metrics > --- > > Key: KUDU-2934 > URL: https://issues.apache.org/jira/browse/KUDU-2934 > Project: Kudu > Issue Type: Bug > Components: metrics >Affects Versions: 1.11.0 >Reporter: Yingchun Lai >Assignee: YifanZhang >Priority: Minor > > We added a feature to merge metrics by commit > fe6e5cc0c9c1573de174d1ce7838b449373ae36e ([metrics] Merge metrics by the same > attribute), for AtomicGauge type metrics, we sum up of merged metrics, this > work for almost all of metrics in Kudu. > But I found a metric that could not be merged like this simply, i.e. > "average_diskrowset_height", because it's a "average" value. -- This message was sent by Atlassian Jira (v8.3.4#803005)