[jira] [Assigned] (KUDU-3567) Resource leakage related to HashedWheelTimer in AsyncKuduScanner

2024-06-13 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3567:


Assignee: YifanZhang

> Resource leakage related to HashedWheelTimer in AsyncKuduScanner
> 
>
> Key: KUDU-3567
> URL: https://issues.apache.org/jira/browse/KUDU-3567
> Project: Kudu
>  Issue Type: Bug
>  Components: client, java
>Affects Versions: 1.18.0
>Reporter: Alexey Serbin
>Assignee: YifanZhang
>Priority: Major
>
> With KUDU-3498 implemented in 
> [8683b8bdb|https://github.com/apache/kudu/commit/8683b8bdb675db96aac52d75a31d00232f7b9fb8],
>  now there are resource leak reports, see below.
> Overall, the way how {{HashedWheelTimer}} is used for keeping scanners alive 
> is in direct contradiction with the recommendation at [this documentation 
> page|https://netty.io/4.1/api/io/netty/util/HashedWheelTimer.html]:
> {quote}*Do not create many instances.*
> HashedWheelTimer creates a new thread whenever it is instantiated and 
> started. Therefore, you should make sure to create only one instance and 
> share it across your application. One of the common mistakes, that makes your 
> application unresponsive, is to create a new instance for every connection.
> {quote}
> Probably, a better way of implementing the keep-alive feature for scanner 
> objects in Kudu Java client would be reusing the {{HashedWheelTimer}} 
> instance from corresponding {{AsyncKuduClient}} client instance, not creating 
> a new instance of the timer (along with corresponding thread) per 
> AsyncKuduScanner object.  At least, an instance of {{HashedWheelTimer}} 
> should be properly released/shutdown to avoid resource leakages (a running 
> thread?) when GC-ing {{AsyncKuduScanner}} objects.
> For example, below is an example how the leak is reported when running 
> {{TestKuduClient.testStrings}}:
> {noformat}
> 23:04:57.774 [ERROR - main] (ResourceLeakDetector.java:327) LEAK: 
> HashedWheelTimer.release() was not called before it's garbage-collected. See 
> https://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records:
> Created at:
>   io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:312)
>   io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:251)
>   io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:224)
>   io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:203)
>   io.netty.util.HashedWheelTimer.(HashedWheelTimer.java:185)
>   org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:296)
>   org.apache.kudu.client.AsyncKuduScanner.(AsyncKuduScanner.java:431)
>   
> org.apache.kudu.client.KuduScanner$KuduScannerBuilder.build(KuduScanner.java:260)
>   org.apache.kudu.client.TestKuduClient.testStrings(TestKuduClient.java:692)
>   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   java.lang.reflect.Method.invoke(Method.java:498)
>   
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>  
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>   java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-03 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3564:
-
Description: 
Reproduce steps that copy from the Slack channel:
 
{code:sql}
// create the table and data in Impala:
CREATE TABLE age_table
(
id BIGINT,
name STRING,
age INT,
PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
HASH (name) PARTITIONS 4,
range (age)
( 
PARTITION 30 <= VALUES < 60,
PARTITION 60 <= VALUES < 90
) 
STORED AS KUDU 
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;


insert into age_table values (3, 'alex', 50);
insert into age_table values (12, 'bob', 100);

// only predicate "in" for data in custom hash cannot be found,
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds {code}

  was:
Reproduce steps that copy from the Slack channel:
 
create the table and data in Impala:
// create the table and data in Impala:
CREATE TABLE age_table
(
  id BIGINT,
  name STRING,
  age INT,
  PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
 HASH (name) PARTITIONS 4,
range (age)
(
  PARTITION 30 <= VALUES < 60,
  PARTITION 60 <= VALUES < 90
)   
STORED AS KUDU  
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;

insert into age_table values (3, 'alex',  50);
insert into age_table values (12, 'bob',  100);

// only predicate "in" for data in custom hash cannot be found,
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds


> Range specific hashing table when queried with InList predicate may lead to 
> incorrect results
> -
>
> Key: KUDU-3564
> URL: https://issues.apache.org/jira/browse/KUDU-3564
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: YifanZhang
>Priority: Major
>
> Reproduce steps that copy from the Slack channel:
>  
> {code:sql}
> // create the table and data in Impala:
> CREATE TABLE age_table
> (
> id BIGINT,
> name STRING,
> age INT,
> PRIMARY KEY(id,name,age)
> )
> PARTITION BY HASH (id) PARTITIONS 4,
> HASH (name) PARTITIONS 4,
> range (age)
> ( 
> PARTITION 30 <= VALUES < 60,
> PARTITION 60 <= VALUES < 90
> ) 
> STORED AS KUDU 
> TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');
> ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
> HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;
> insert into age_table values (3, 'alex', 50);
> insert into age_table values (12, 'bob', 100);
> // only predicate "in" for data in custom hash cannot be found,
> sudo -u kudu kudu table scan  default.age_table -columns=id,age 
> -predicates='["AND", ["IN", "id", [3,20]]]'
> (int64 id=3, int32 age=50)
> Total count 1 cost 0.0178102 seconds {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3564) Range specific hashing table when queried with InList predicate may lead to incorrect results

2024-04-03 Thread YifanZhang (Jira)
YifanZhang created KUDU-3564:


 Summary: Range specific hashing table when queried with InList 
predicate may lead to incorrect results
 Key: KUDU-3564
 URL: https://issues.apache.org/jira/browse/KUDU-3564
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: YifanZhang


Reproduce steps that copy from the Slack channel:
 
create the table and data in Impala:
// create the table and data in Impala:
CREATE TABLE age_table
(
  id BIGINT,
  name STRING,
  age INT,
  PRIMARY KEY(id,name,age)
)
PARTITION BY HASH (id) PARTITIONS 4,
 HASH (name) PARTITIONS 4,
range (age)
(
  PARTITION 30 <= VALUES < 60,
  PARTITION 60 <= VALUES < 90
)   
STORED AS KUDU  
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1');

ALTER TABLE age_table ADD RANGE PARTITION 90<= VALUES <120
HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3;

insert into age_table values (3, 'alex',  50);
insert into age_table values (12, 'bob',  100);

// only predicate "in" for data in custom hash cannot be found,
sudo -u kudu kudu table scan  default.age_table -columns=id,age 
-predicates='["AND", ["IN", "id", [3,20]]]'
(int64 id=3, int32 age=50)
Total count 1 cost 0.0178102 seconds



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KUDU-3518) node error when impala query

2023-10-24 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778944#comment-17778944
 ] 

YifanZhang edited comment on KUDU-3518 at 10/24/23 6:17 AM:


I see in profile_error_1.17.txt:
{code:java}
00:SCAN KUDU [member.qyexternaluserdetailinfo_new]
   predicates: shoptype NOT IN (35, 56), thirdnick != 

   kudu predicates: thirdnick IS NOT NULL, isDelete = CAST(0 AS INT), 
ownercorpid IN (x, ), mainshopnick = 
x
   mem-estimate=3.00MB mem-reservation=0B thread-reservation=1
   tuple-ids=0 row-size=32B cardinality=0
   in pipelines: 00(GETNEXT) {code}
IIUC, kudu predicates mean the predicates that should be pushed down to the 
kudu scanner, and the column 'shopnick' is not in kudu predicates. Since the 
column 'shopnick' is neither a predicate column nor a projection column, it 
shouldn't be scanned and shouldn't be used in the query execution. So it's 
quite weird that the error is related to this column "Invalid argument: No such 
column: shopnick".

It's more like an impala issue instead of kudu issue. I think you can try to 
create an empty table to check if the error is related to the table schema:
{code:java}
create table new_empty_table like  member.qyexternaluserdetailinfo_new;
// query the new_empty_table to see if the error happens{code}


was (Author: zhangyifan27):
I see in profile_error_1.17.txt:
{code:java}
00:SCAN KUDU [member.qyexternaluserdetailinfo_new]
   predicates: shoptype NOT IN (35, 56), thirdnick != 

   kudu predicates: thirdnick IS NOT NULL, isDelete = CAST(0 AS INT), 
ownercorpid IN (x, ), mainshopnick = 
x
   mem-estimate=3.00MB mem-reservation=0B thread-reservation=1
   tuple-ids=0 row-size=32B cardinality=0
   in pipelines: 00(GETNEXT) {code}
IIUC, kudu predicates mean the predicates that should be pushed down to the 
kudu scanner, and the column 'shopnick' is not in kudu predicates. Since the 
column 'shopnick' is neither a predicate column nor a projection column, it 
shouldn't be scanned and shouldn't be used in the query execution. So it's 
quite weird that the error is related to this column "Invalid argument: No such 
column: shopnick".

 

> node error when impala query
> 
>
> Key: KUDU-3518
> URL: https://issues.apache.org/jira/browse/KUDU-3518
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
> Attachments: profile_error_1.17.txt, profile_success_1.16.txt, 
> profile_success_1.17.txt
>
>
> Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an 
> empty string in primary key field.
> sql:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
>  
> error:ERROR: Unable to open scanner for node with id '1' for Kudu table 
> 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such 
> column: shopnick
>  
> If update sql like this:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shopnick not in ('')
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
> no error.
>  
> this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good.
>  
> There is 100 items in this table ,28 items where empty string.
> table schema like this:
> ++---+-+-++--+---+---+-++
> | name           | type      | comment | primary_key | key_unique | nullable 
> | default_value | encoding      | compression         | block_size |
> ++---+-+-++--+---+---+-++
> | mainshopnick   | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shopnick       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | ownercorpid    | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shoptype       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | clientid       | 

[jira] [Commented] (KUDU-3518) node error when impala query

2023-10-24 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778944#comment-17778944
 ] 

YifanZhang commented on KUDU-3518:
--

I see in profile_error_1.17.txt:
{code:java}
00:SCAN KUDU [member.qyexternaluserdetailinfo_new]
   predicates: shoptype NOT IN (35, 56), thirdnick != 

   kudu predicates: thirdnick IS NOT NULL, isDelete = CAST(0 AS INT), 
ownercorpid IN (x, ), mainshopnick = 
x
   mem-estimate=3.00MB mem-reservation=0B thread-reservation=1
   tuple-ids=0 row-size=32B cardinality=0
   in pipelines: 00(GETNEXT) {code}
IIUC, kudu predicates mean the predicates that should be pushed down to the 
kudu scanner, and the column 'shopnick' is not in kudu predicates. Since the 
column 'shopnick' is neither a predicate column nor a projection column, it 
shouldn't be scanned and shouldn't be used in the query execution. So it's 
quite weird that the error is related to this column "Invalid argument: No such 
column: shopnick".

 

> node error when impala query
> 
>
> Key: KUDU-3518
> URL: https://issues.apache.org/jira/browse/KUDU-3518
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
> Attachments: profile_error_1.17.txt, profile_success_1.16.txt, 
> profile_success_1.17.txt
>
>
> Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an 
> empty string in primary key field.
> sql:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
>  
> error:ERROR: Unable to open scanner for node with id '1' for Kudu table 
> 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such 
> column: shopnick
>  
> If update sql like this:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shopnick not in ('')
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
> no error.
>  
> this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good.
>  
> There is 100 items in this table ,28 items where empty string.
> table schema like this:
> ++---+-+-++--+---+---+-++
> | name           | type      | comment | primary_key | key_unique | nullable 
> | default_value | encoding      | compression         | block_size |
> ++---+-+-++--+---+---+-++
> | mainshopnick   | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shopnick       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | ownercorpid    | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shoptype       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | clientid       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | thirdnick      | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | id             | bigint    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | receivermobile | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | thirdrealname  | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | remark         | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | createtime     | timestamp |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | updatetime     | timestamp |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | isdelete     

[jira] [Commented] (KUDU-3518) node error when impala query

2023-10-16 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775760#comment-17775760
 ] 

YifanZhang commented on KUDU-3518:
--

[~MadBeeDo] Does this issue only affect the specific table? Is it possible to 
reproduce it again?

> node error when impala query
> 
>
> Key: KUDU-3518
> URL: https://issues.apache.org/jira/browse/KUDU-3518
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
> Environment: centos7.9
>Reporter: Pain Sun
>Priority: Major
>
> Scan kudu with impala-4.3.0 ,there is a bug when reading a table with an 
> empty string in primary key field.
> sql:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
>  
> error:ERROR: Unable to open scanner for node with id '1' for Kudu table 
> 'impala::member.qyexternaluserdetailinfo_new': Invalid argument: No such 
> column: shopnick
>  
> If update sql like this:
> select
>     count(distinct thirdnick)
> from
>     member.qyexternaluserdetailinfo_new
> where
>     (
>         mainshopnick = "xxx"
>         and ownercorpid in ("xxx", "")
>         and shopnick not in ('')
>         and shoptype not in ("35", "56")
>         and isDelete = 0
>         and thirdnick != ""
>         and thirdnick is not null
>     );
> no error.
>  
> this error appears in kudu-1.17.0 ,but kudu-1.16.0 is good.
>  
> There is 100 items in this table ,28 items where empty string.
> table schema like this:
> ++---+-+-++--+---+---+-++
> | name           | type      | comment | primary_key | key_unique | nullable 
> | default_value | encoding      | compression         | block_size |
> ++---+-+-++--+---+---+-++
> | mainshopnick   | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shopnick       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | ownercorpid    | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | shoptype       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | clientid       | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | thirdnick      | string    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | id             | bigint    |         | true        | true       | false    
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | receivermobile | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | thirdrealname  | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | remark         | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | createtime     | timestamp |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | updatetime     | timestamp |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | isdelete       | int       |         | false       |            | true     
> | 0             | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> | buyernick      | string    |         | false       |            | true     
> |               | AUTO_ENCODING | DEFAULT_COMPRESSION | 0          |
> ++---+-+-++--+---+---+-++



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KUDU-3502) Linker errors on ARM based Macs

2023-08-09 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3502:
-
Description: 
When building Kudu(RELEASE mode) on a Mac M1 machine, I met this linker error:
{code:bash}
[ 95%] Linking CXX executable ../../../bin/kudu-master
Undefined symbols for architecture arm64:
  "_nghttp2_http2_strerror", referenced from:
      _http2_handle_stream_close in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_is_fatal", referenced from:
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
      _http2_recv in libcurl.a(libcurl_la-http2.o)
      _http2_send in libcurl.a(libcurl_la-http2.o)
      _on_frame_recv in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_pack_settings_payload", referenced from:
      _Curl_http2_request_upgrade in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_priority_spec_init", referenced from:
      _h2_pri_spec in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_del", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_new", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_error_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_on_begin_headers_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_on_data_chunk_recv_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_on_frame_recv_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_on_header_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_on_stream_close_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_callbacks_set_send_callback", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_client_new", referenced from:
      _Curl_http2_setup in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_del", referenced from:
      _http2_disconnect in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_get_remote_settings", referenced from:
      _on_frame_recv in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_get_stream_user_data", referenced from:
      _on_frame_recv in libcurl.a(libcurl_la-http2.o)
      _on_data_chunk_recv in libcurl.a(libcurl_la-http2.o)
      _on_stream_close in libcurl.a(libcurl_la-http2.o)
      _on_begin_headers in libcurl.a(libcurl_la-http2.o)
      _on_header in libcurl.a(libcurl_la-http2.o)
      _data_source_read_callback in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_mem_recv", referenced from:
      _h2_process_pending_input in libcurl.a(libcurl_la-http2.o)
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
      _http2_recv in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_resume_data", referenced from:
      _Curl_http2_done_sending in libcurl.a(libcurl_la-http2.o)
      _http2_send in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_send", referenced from:
      _Curl_http2_done in libcurl.a(libcurl_la-http2.o)
      _http2_send in libcurl.a(libcurl_la-http2.o)
      _h2_session_send in libcurl.a(libcurl_la-http2.o)
      _http2_conncheck in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_set_local_window_size", referenced from:
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_set_stream_user_data", referenced from:
      _Curl_http2_done in libcurl.a(libcurl_la-http2.o)
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
      _on_frame_recv in libcurl.a(libcurl_la-http2.o)
      _on_stream_close in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_upgrade", referenced from:
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_want_read", referenced from:
      _should_close_session in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_session_want_write", referenced from:
      _should_close_session in libcurl.a(libcurl_la-http2.o)
      _http2_getsock in libcurl.a(libcurl_la-http2.o)
      _http2_perform_getsock in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_strerror", referenced from:
      _h2_process_pending_input in libcurl.a(libcurl_la-http2.o)
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
      _http2_recv in libcurl.a(libcurl_la-http2.o)
      _http2_conncheck in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_submit_ping", referenced from:
      _http2_conncheck in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_submit_priority", referenced from:
      _h2_session_send in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_submit_request", referenced from:
      _http2_send in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_submit_rst_stream", referenced from:
      _Curl_http2_done in 

[jira] [Updated] (KUDU-3502) Linker errors on ARM based Macs

2023-08-09 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3502:
-
Summary: Linker errors on ARM based Macs  (was: Linker errors on ARM 
basedMacs)

> Linker errors on ARM based Macs
> ---
>
> Key: KUDU-3502
> URL: https://issues.apache.org/jira/browse/KUDU-3502
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
>
> When building Kudu(RELEASE mode) on a Mac M1 machine, I met this linker error:
> {code:java}
> Undefined symbols for architecture arm64:
>   "_nghttp2_http2_strerror", referenced from:
>       _http2_handle_stream_close in libcurl.a(libcurl_la-http2.o)
>   "_nghttp2_is_fatal", referenced from:
>       _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
>       _http2_recv in libcurl.a(libcurl_la-http2.o)
>       _http2_send in libcurl.a(libcurl_la-http2.o)
>       _on_frame_recv in libcurl.a(libcurl_la-http2.o) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KUDU-3502) Linker errors on ARM basedMacs

2023-08-09 Thread YifanZhang (Jira)
YifanZhang created KUDU-3502:


 Summary: Linker errors on ARM basedMacs
 Key: KUDU-3502
 URL: https://issues.apache.org/jira/browse/KUDU-3502
 Project: Kudu
  Issue Type: Bug
Reporter: YifanZhang


When building Kudu(RELEASE mode) on a Mac M1 machine, I met this linker error:
{code:java}
Undefined symbols for architecture arm64:
  "_nghttp2_http2_strerror", referenced from:
      _http2_handle_stream_close in libcurl.a(libcurl_la-http2.o)
  "_nghttp2_is_fatal", referenced from:
      _Curl_http2_switched in libcurl.a(libcurl_la-http2.o)
      _http2_recv in libcurl.a(libcurl_la-http2.o)
      _http2_send in libcurl.a(libcurl_la-http2.o)
      _on_frame_recv in libcurl.a(libcurl_la-http2.o) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3463) KuduMaster leader consumes too much memory

2023-04-17 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17713068#comment-17713068
 ] 

YifanZhang commented on KUDU-3463:
--

[~weizisheng] The metadata of deleted tables and tablets should be deleted from 
both memory and disk if the setting of 
'enable_metadata_cleanup_for_deleted_tables_and_tablets' is true. For deleting 
them from persistent disks, we delete corresponding entries from the 
sys.catalog table where the metadata stored in: 
[https://github.com/apache/kudu/blob/a3a7c97be031f8fc32402e430eff1a89c19dbdfb/src/kudu/master/catalog_manager.cc#L6099.]
 Did you find that the data on disks did not decrease after deleting tables?

> KuduMaster leader consumes too much memory
> --
>
> Key: KUDU-3463
> URL: https://issues.apache.org/jira/browse/KUDU-3463
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Weizisheng
>Priority: Major
> Attachments: heap321.txt
>
>
> We rarely face a suspected memory leak on a cluster with 3-master and 
> 4-tserver, 800 tables and 3000 tablets. Leader master consume 50GB memory 
> while the other two only take one tenth of it ...
> From the last 5 days, leader's memory usage grows 3%+
> attachment pprof heap
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3463) KuduMaster leader consumes too much memory

2023-04-13 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712196#comment-17712196
 ] 

YifanZhang commented on KUDU-3463:
--

[~weizisheng]This issue seems related to KUDU-3097 and KUDU-3344. The fix for 
KUDU-3344 will be included in Kudu 1.17.0 release. 

I think it's not a memory leak in the Kudu leader master. The reason is too 
many table and tablet metadata are preserved in the memory. Maybe you can try 
to pick the changes in KUDU-3344 and set 
--enable_metadata_cleanup_for_deleted_tables_and_tablets=true for Kudu masters, 
to see if the memory usage can be reduced.

> KuduMaster leader consumes too much memory
> --
>
> Key: KUDU-3463
> URL: https://issues.apache.org/jira/browse/KUDU-3463
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: Weizisheng
>Priority: Major
> Attachments: heap321.txt
>
>
> We rarely face a suspected memory leak on a cluster with 3-master and 
> 4-tserver, 800 tables and 3000 tablets. Leader master consume 50GB memory 
> while the other two only take one tenth of it ...
> From the last 5 days, leader's memory usage grows 3%+
> attachment pprof heap
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KUDU-3451) Memory leak in scan_token-test

2023-02-22 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang resolved KUDU-3451.
--
Fix Version/s: NA
   Resolution: Fixed

> Memory leak in scan_token-test
> --
>
> Key: KUDU-3451
> URL: https://issues.apache.org/jira/browse/KUDU-3451
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: YifanZhang
>Assignee: Marton Greber
>Priority: Major
> Fix For: NA
>
> Attachments: scan_token-test.txt.gz
>
>
> We found test failures in scan_token-test sometimes recently. I've attached 
> the full test log.
> The ASAN test output is:
> {code:java}
> Direct leak of 16 byte(s) in 2 object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4afd01 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #5 0x7fcab3bb10ec in void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #6 0x7fcab3ba5bda in testing::Test::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
>     #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
>     #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
>     #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
>     #10 0x7fcab3bb160c in bool 
> testing::internal::HandleSehExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #11 0x7fcab3bb160c in bool 
> testing::internal::HandleExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
>     #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
>     #14 0x7fcac70c94a8 in main 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
>     #15 0x7fcaaf308bf6 in __libc_start_main 
> (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
> object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4ae967 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> 

[jira] [Updated] (KUDU-3451) Memory leak in scan_token-test

2023-02-22 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3451:
-
Description: 
We found test failures in scan_token-test sometimes recently. I've attached the 
full test log.

The ASAN test output is:
{code:java}
Direct leak of 16 byte(s) in 2 object(s) allocated from:
    #0 0x493e48 in operator new(unsigned long) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
    #1 0x7fcabfc846dd in 
kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
    #2 0x7fcabfaf38aa in 
kudu::client::KuduScanTokenBuilder::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
    #3 0x4afd01 in 
kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
    #4 0x7fcab3bb10ec in void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #5 0x7fcab3bb10ec in void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #6 0x7fcab3ba5bda in testing::Test::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
    #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
    #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
    #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
    #10 0x7fcab3bb160c in bool 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #11 0x7fcab3bb160c in bool 
testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
    #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
    #14 0x7fcac70c94a8 in main 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
    #15 0x7fcaaf308bf6 in __libc_start_main 
(/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
object(s) allocated from:
    #0 0x493e48 in operator new(unsigned long) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
    #1 0x7fcabfc846dd in 
kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
    #2 0x7fcabfaf38aa in 
kudu::client::KuduScanTokenBuilder::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
    #3 0x4ae967 in 
kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
    #4 0x7fcab3bb10ec in void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #5 0x7fcab3bb10ec in void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #6 0x7fcab3ba5bda in testing::Test::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
    #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
   

[jira] [Updated] (KUDU-3451) Memory leak in scan_token-test

2023-02-22 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3451:
-
Summary: Memory leak in scan_token-test  (was: Memory leak in 
scan-token-test)

> Memory leak in scan_token-test
> --
>
> Key: KUDU-3451
> URL: https://issues.apache.org/jira/browse/KUDU-3451
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: YifanZhang
>Assignee: Marton Greber
>Priority: Major
> Attachments: scan_token-test.txt.gz
>
>
> We found test failures in scan-token-test sometimes recently. I've attached 
> the full test log.
> The ASAN test output is:
> {code:java}
> Direct leak of 16 byte(s) in 2 object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4afd01 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #5 0x7fcab3bb10ec in void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #6 0x7fcab3ba5bda in testing::Test::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
>     #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
>     #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
>     #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
>     #10 0x7fcab3bb160c in bool 
> testing::internal::HandleSehExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #11 0x7fcab3bb160c in bool 
> testing::internal::HandleExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
>     #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
>     #14 0x7fcac70c94a8 in main 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
>     #15 0x7fcaaf308bf6 in __libc_start_main 
> (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
> object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4ae967 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> 

[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test

2023-02-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3451:
-
Description: 
We found test failures in scan-token-test sometimes recently. I've attached the 
full test log.

The ASAN test output is:
{code:java}
Direct leak of 16 byte(s) in 2 object(s) allocated from:
    #0 0x493e48 in operator new(unsigned long) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
    #1 0x7fcabfc846dd in 
kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
    #2 0x7fcabfaf38aa in 
kudu::client::KuduScanTokenBuilder::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
    #3 0x4afd01 in 
kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
    #4 0x7fcab3bb10ec in void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #5 0x7fcab3bb10ec in void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #6 0x7fcab3ba5bda in testing::Test::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
    #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
    #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
    #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
    #10 0x7fcab3bb160c in bool 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #11 0x7fcab3bb160c in bool 
testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
    #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
    #14 0x7fcac70c94a8 in main 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
    #15 0x7fcaaf308bf6 in __libc_start_main 
(/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
object(s) allocated from:
    #0 0x493e48 in operator new(unsigned long) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
    #1 0x7fcabfc846dd in 
kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
    #2 0x7fcabfaf38aa in 
kudu::client::KuduScanTokenBuilder::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
    #3 0x4ae967 in 
kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
    #4 0x7fcab3bb10ec in void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #5 0x7fcab3bb10ec in void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #6 0x7fcab3ba5bda in testing::Test::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
    #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
   

[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test

2023-02-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3451:
-
Attachment: scan_token-test.txt.gz

> Memory leak in scan-token-test
> --
>
> Key: KUDU-3451
> URL: https://issues.apache.org/jira/browse/KUDU-3451
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: YifanZhang
>Priority: Major
> Attachments: scan_token-test.txt.gz
>
>
> We found test failures in scan-token-test sometimes recently: 
> [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] 
> The ASAN test output is:
> {code:java}
> Direct leak of 16 byte(s) in 2 object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4afd01 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #5 0x7fcab3bb10ec in void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #6 0x7fcab3ba5bda in testing::Test::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
>     #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
>     #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
>     #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
>     #10 0x7fcab3bb160c in bool 
> testing::internal::HandleSehExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #11 0x7fcab3bb160c in bool 
> testing::internal::HandleExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
>     #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
>     #14 0x7fcac70c94a8 in main 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
>     #15 0x7fcaaf308bf6 in __libc_start_main 
> (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
> object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4ae967 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> 

[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test

2023-02-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3451:
-
Attachment: (was: scan_token-test.txt)

> Memory leak in scan-token-test
> --
>
> Key: KUDU-3451
> URL: https://issues.apache.org/jira/browse/KUDU-3451
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: YifanZhang
>Priority: Major
>
> We found test failures in scan-token-test sometimes recently: 
> [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] 
> The ASAN test output is:
> {code:java}
> Direct leak of 16 byte(s) in 2 object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4afd01 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #5 0x7fcab3bb10ec in void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #6 0x7fcab3ba5bda in testing::Test::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
>     #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
>     #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
>     #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
>     #10 0x7fcab3bb160c in bool 
> testing::internal::HandleSehExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #11 0x7fcab3bb160c in bool 
> testing::internal::HandleExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
>     #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
>     #14 0x7fcac70c94a8 in main 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
>     #15 0x7fcaaf308bf6 in __libc_start_main 
> (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
> object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4ae967 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> 

[jira] [Updated] (KUDU-3451) Memory leak in scan-token-test

2023-02-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3451:
-
Attachment: scan_token-test.txt

> Memory leak in scan-token-test
> --
>
> Key: KUDU-3451
> URL: https://issues.apache.org/jira/browse/KUDU-3451
> Project: Kudu
>  Issue Type: Bug
>  Components: test
>Reporter: YifanZhang
>Priority: Major
>
> We found test failures in scan-token-test sometimes recently: 
> [http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] 
> The ASAN test output is:
> {code:java}
> Direct leak of 16 byte(s) in 2 object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4afd01 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #5 0x7fcab3bb10ec in void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #6 0x7fcab3ba5bda in testing::Test::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
>     #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
>     #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
>     #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
>     #10 0x7fcab3bb160c in bool 
> testing::internal::HandleSehExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
>     #11 0x7fcab3bb160c in bool 
> testing::internal::HandleExceptionsInMethodIfSupported  bool>(testing::internal::UnitTestImpl*, bool 
> (testing::internal::UnitTestImpl::*)(), char const*) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
>     #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
>     #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
>     #14 0x7fcac70c94a8 in main 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
>     #15 0x7fcaaf308bf6 in __libc_start_main 
> (/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
> object(s) allocated from:
>     #0 0x493e48 in operator new(unsigned long) 
> /home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
>     #1 0x7fcabfc846dd in 
> kudu::client::KuduScanTokenBuilder::Data::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
>     #2 0x7fcabfaf38aa in 
> kudu::client::KuduScanTokenBuilder::Build(std::vector  std::allocator >*) 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
>     #3 0x4ae967 in 
> kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
>     #4 0x7fcab3bb10ec in void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) 
> 

[jira] [Created] (KUDU-3451) Memory leak in scan-token-test

2023-02-21 Thread YifanZhang (Jira)
YifanZhang created KUDU-3451:


 Summary: Memory leak in scan-token-test
 Key: KUDU-3451
 URL: https://issues.apache.org/jira/browse/KUDU-3451
 Project: Kudu
  Issue Type: Bug
  Components: test
Reporter: YifanZhang


We found test failures in scan-token-test sometimes recently: 
[http://dist-test.cloudera.org/job?job_id=jenkins-slave.1676788752.1415999] 

The ASAN test output is:
{code:java}
Direct leak of 16 byte(s) in 2 object(s) allocated from:
    #0 0x493e48 in operator new(unsigned long) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
    #1 0x7fcabfc846dd in 
kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
    #2 0x7fcabfaf38aa in 
kudu::client::KuduScanTokenBuilder::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
    #3 0x4afd01 in 
kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:748:5
    #4 0x7fcab3bb10ec in void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #5 0x7fcab3bb10ec in void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #6 0x7fcab3ba5bda in testing::Test::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2674
    #7 0x7fcab3ba5d9c in testing::TestInfo::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2853
    #8 0x7fcab3ba6376 in testing::TestSuite::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:3012
    #9 0x7fcab3ba677b in testing::internal::UnitTestImpl::RunAllTests() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5870
    #10 0x7fcab3bb160c in bool 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #11 0x7fcab3bb160c in bool 
testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #12 0x7fcab3ba5e62 in testing::UnitTest::Run() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:5444
    #13 0x7fcac70caf91 in RUN_ALL_TESTS() 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/installed/uninstrumented/include/gtest/gtest.h:2293:73
    #14 0x7fcac70c94a8 in main 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/util/test_main.cc:109:10
    #15 0x7fcaaf308bf6 in __libc_start_main 
(/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)Direct leak of 16 byte(s) in 2 
object(s) allocated from:
    #0 0x493e48 in operator new(unsigned long) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/llvm-11.0.0.src/projects/compiler-rt/lib/asan/asan_new_delete.cpp:99
    #1 0x7fcabfc846dd in 
kudu::client::KuduScanTokenBuilder::Data::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-internal.cc:616:49
    #2 0x7fcabfaf38aa in 
kudu::client::KuduScanTokenBuilder::Build(std::vector >*) 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/client.cc:2337:17
    #3 0x4ae967 in 
kudu::client::ScanTokenTest_TestScanTokensWithQueryId_Test::TestBody() 
/home/jenkins-slave/workspace/kudu-master/2/src/kudu/client/scan_token-test.cc:724:5
    #4 0x7fcab3bb10ec in void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2599
    #5 0x7fcab3bb10ec in void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) 
/home/jenkins-slave/workspace/kudu-master/2/thirdparty/src/googletest-release-1.12.1/googletest/src/gtest.cc:2635
    #6 0x7fcab3ba5bda in testing::Test::Run() 

[jira] [Commented] (KUDU-3367) Delta file with full of delete op can not be schedule to compact

2022-11-14 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17633680#comment-17633680
 ] 

YifanZhang commented on KUDU-3367:
--

[~Koppa] [~laiyingchun] Ah, indeed, this GC operation relies on live row 
counting. I agree that we do need GC deleted rows on tablets that don't support 
live row counting. 

> Delta file with full of delete op can not be schedule to compact
> 
>
> Key: KUDU-3367
> URL: https://issues.apache.org/jira/browse/KUDU-3367
> Project: Kudu
>  Issue Type: New Feature
>  Components: compaction
>Reporter: dengke
>Assignee: dengke
>Priority: Major
> Attachments: image-2022-05-09-14-13-16-525.png, 
> image-2022-05-09-14-16-31-828.png, image-2022-05-09-14-18-05-647.png, 
> image-2022-05-09-14-19-56-933.png, image-2022-05-09-14-21-47-374.png, 
> image-2022-05-09-14-23-43-973.png, image-2022-05-09-14-26-45-313.png, 
> image-2022-05-09-14-32-51-573.png, image-2022-11-14-11-02-33-685.png
>
>
> If we get a REDO delta with full of delete op, wich means there is no update 
> op in the file. The current compact algorithm will not schedule the file do 
> compact. If such files exist, after accumulating for a period of time, it 
> will greatly affect our scan speed. However, processing such files every time 
> compact reduces  compact's performance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3384) DRS-level scan optimization leads to failed scans

2022-07-20 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568882#comment-17568882
 ] 

YifanZhang commented on KUDU-3384:
--

A failure occurred at 
[cfile_set.cc#L445|https://github.com/apache/kudu/blob/dc4031f693382df08c0fab1d0c5ac6bc3c203c35/src/kudu/tablet/cfile_set.cc#L445],
 we want to increment primary key to set a new exclusive upper bound so that it 
can be used to simplify existing predicates. The boundary case described in 
this issue was not considered when implementing this optimization, I think we 
can fall back to not set a new upper bound if we can't increment the primary 
key.

> DRS-level scan optimization leads to failed scans
> -
>
> Key: KUDU-3384
> URL: https://issues.apache.org/jira/browse/KUDU-3384
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Assignee: YifanZhang
>Priority: Major
>
> Recently, a new DRS-level optimization for scan operations has been 
> introduced with changelist 
> [936d7edc4|https://github.com/apache/kudu/commit/936d7edc4e4b69d2e1f1dffc96760cb3fd57a934].
> The newly introduced DRS-level optimization leads to scan failures when all 
> of the following turns true:
>  * all the primary key columns are of integer types
>  * the table has no hash partitioning
>  * the table contains a row with all primary key columns set to 
> {{INT\{x}_MAX}} correspondingly
>  * the scan request is to scan all the table's data
> I suspect that some of the conditions above might be relaxed, but I have a 
> test case that reproduces the issue as described.  See [this gerrit review 
> item|http://gerrit.cloudera.org:8080/18757] for the reproduction scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KUDU-3384) DRS-level scan optimization leads to failed scans

2022-07-20 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3384:


Assignee: YifanZhang

> DRS-level scan optimization leads to failed scans
> -
>
> Key: KUDU-3384
> URL: https://issues.apache.org/jira/browse/KUDU-3384
> Project: Kudu
>  Issue Type: Bug
>  Components: tserver
>Affects Versions: 1.17.0
>Reporter: Alexey Serbin
>Assignee: YifanZhang
>Priority: Major
>
> Recently, a new DRS-level optimization for scan operations has been 
> introduced with changelist 
> [936d7edc4|https://github.com/apache/kudu/commit/936d7edc4e4b69d2e1f1dffc96760cb3fd57a934].
> The newly introduced DRS-level optimization leads to scan failures when all 
> of the following turns true:
>  * all the primary key columns are of integer types
>  * the table has no hash partitioning
>  * the table contains a row with all primary key columns set to 
> {{INT\{x}_MAX}} correspondingly
>  * the scan request is to scan all the table's data
> I suspect that some of the conditions above might be relaxed, but I have a 
> test case that reproduces the issue as described.  See [this gerrit review 
> item|http://gerrit.cloudera.org:8080/18757] for the reproduction scenario.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KUDU-3306) String column types in range partitions lead to issues while copying tables

2022-07-14 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3306:


Assignee: YifanZhang  (was: Mahesh Reddy)

> String column types in range partitions lead to issues while copying tables
> ---
>
> Key: KUDU-3306
> URL: https://issues.apache.org/jira/browse/KUDU-3306
> Project: Kudu
>  Issue Type: Bug
>  Components: CLI, partition
>Reporter: Bankim Bhavsar
>Assignee: YifanZhang
>Priority: Major
>
> Range partitions with string column types leads to issues while creating 
> destination table.
> {noformat}
> create TABLE test3 (
> created_time STRING PRIMARY KEY
> )
> PARTITION BY RANGE (created_time) 
> (
> PARTITION VALUE = "2020-01-01",
> PARTITION VALUE = "2021-01-01"
> )
> STORED as kudu;
> # kudu table describe master-1 impala::default.test3
> TABLE impala::default.test3 (
> created_time STRING NOT NULL,
> PRIMARY KEY (created_time)
> )
> RANGE (created_time) (
> PARTITION "2020-01-01" <= VALUES < "2020-01-01\000",
> PARTITION "2021-01-01" <= VALUES < "2021-01-01\000"
> )
> OWNER root
> REPLICAS 3
> # kudu table copy master-1 impala::default.test3 master-1 
> -dst_table=kudu_test4 -write_type=""
> Invalid argument: Error creating table kudu_test4 on the master: overlapping 
> range partitions: first range partition: "\000��\004\000\000\000\1" <= 
> VALUES < "2021-01-01\000", second range partition: 
> "\000��\004\000\000\000\1" <= VALUES < "2021-01-01\000"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KUDU-3367) Delta file with full of delete op can not be schedule to compact

2022-05-30 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544095#comment-17544095
 ] 

YifanZhang commented on KUDU-3367:
--

Maybe related to KUDU-1625.

> Delta file with full of delete op can not be schedule to compact
> 
>
> Key: KUDU-3367
> URL: https://issues.apache.org/jira/browse/KUDU-3367
> Project: Kudu
>  Issue Type: New Feature
>  Components: compaction
>Reporter: dengke
>Assignee: dengke
>Priority: Major
> Attachments: image-2022-05-09-14-13-16-525.png, 
> image-2022-05-09-14-16-31-828.png, image-2022-05-09-14-18-05-647.png, 
> image-2022-05-09-14-19-56-933.png, image-2022-05-09-14-21-47-374.png, 
> image-2022-05-09-14-23-43-973.png, image-2022-05-09-14-26-45-313.png, 
> image-2022-05-09-14-32-51-573.png
>
>
> If we get a REDO delta with full of delete op, wich means there is no update 
> op in the file. The current compact algorithm will not schedule the file do 
> compact. If such files exist, after accumulating for a period of time, it 
> will greatly affect our scan speed. However, processing such files every time 
> compact reduces  compact's performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KUDU-3367) Delta file with full of delete op can not be schedule to compact

2022-05-30 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543808#comment-17543808
 ] 

YifanZhang commented on KUDU-3367:
--

I'm curious about if setting `tablet_history_max_age_sec` to a small value is 
helpful for your case. If so, will DeletedRowsetGCOp be scheduled and empty 
RowSets be deleted in time?

> Delta file with full of delete op can not be schedule to compact
> 
>
> Key: KUDU-3367
> URL: https://issues.apache.org/jira/browse/KUDU-3367
> Project: Kudu
>  Issue Type: New Feature
>  Components: compaction
>Reporter: dengke
>Assignee: dengke
>Priority: Major
> Attachments: image-2022-05-09-14-13-16-525.png, 
> image-2022-05-09-14-16-31-828.png, image-2022-05-09-14-18-05-647.png, 
> image-2022-05-09-14-19-56-933.png, image-2022-05-09-14-21-47-374.png, 
> image-2022-05-09-14-23-43-973.png, image-2022-05-09-14-26-45-313.png, 
> image-2022-05-09-14-32-51-573.png
>
>
> If we get a REDO delta with full of delete op, wich means there is no update 
> op in the file. The current compact algorithm will not schedule the file do 
> compact. If such files exist, after accumulating for a period of time, it 
> will greatly affect our scan speed. However, processing such files every time 
> compact reduces  compact's performance.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (KUDU-3364) Add TimerThread to ThreadPool to support a category of problem

2022-05-17 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538251#comment-17538251
 ] 

YifanZhang edited comment on KUDU-3364 at 5/17/22 3:24 PM:
---

[~shenxingwuying] I still have some questions about the motivation.

 
{quote}The two ways maybe exist some conflicts at opeations race, because 
rebalance tool' logic is a litte complex at tool and auto rebalance is running 
at master.
{quote}
If we worry about these two ways of rebalancing interfere with each other, just 
disable auto rebalance and run the rebalance tool may be a solution. 

You mean we need tools to manually trigger long-time running tasks, like 
rebalancing and compactions. Maybe we can do this using a tool that can be 
executed asynchronously, like sending an asynchronously RPC or something? Why 
we need a TimeThread?

 


was (Author: zhangyifan27):
[~shenxingwuying] I still have some questions about the motivation.

> Add TimerThread to ThreadPool to support a category of problem
> --
>
> Key: KUDU-3364
> URL: https://issues.apache.org/jira/browse/KUDU-3364
> Project: Kudu
>  Issue Type: New Feature
>Reporter: shenxingwuying
>Assignee: shenxingwuying
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h1. Scenanios
> In general, I am talking about a category of problem.
> There are some periodic tasks or automatically triggered scheduling tasks in 
> kudu. 
> For example, automatic rebalance of cluster data, some GC task and compaction 
> tasks.
> Their implementation is by kudu Thread, maybe std::thread or ThreadPool, the 
> really task internally periodic scheduled or internally strategy to trigge 
> execution. 
> They are all internal, we cann't do some.
> In fact, we need a method our control to trigge the above types of actions.
> In general, I am talking about a category of problem. 
> Some scenarios is significant.
> Below is examples:
>  
> h2. data rebalance
> There are two rebalance ways:
> 1. enable auto rebalance
> 2. use rebalance tool 1.14 before.
> The two ways maybe exist some conflicts at opeations race, because rebalance 
> tool' logic is a litte complex at tool and auto rebalance is running at 
> master.
> In future, auto rebalance at master will become very steady and become the 
> main way for data rebalance. And at the same time, admin opers need a 
> external trigger the rebalance just like auto rebalance.
> But, now auto rebalance is running in a thread and by time period.
> Although we can add a api for MasterService, but the api is synchronize, and 
> will cose very much, we need a asynchronized method to trigger the rebalance.
> h2. auto compaction
> Another example is auto compaction,
> I have found compaction strategy is not always valid, so maybe we need a 
> method  controlled by admin users to triggle compaction.
> If we can do a RowSetInCompaction, we need not restart the kudu cluster.
> h1.  
> h1. My Solution
> Add a timer in ThreadPool. This timer is a worker thread that schedules tasks 
> to the specified thread according to time.
> We can limit only SERIAL ThreadPoolToken can enable TimerThread.
> Pseudo code expresses my intention:
> {code:java}
> //代码占位符
> class TimerThread {
> class Task {         
> ThreadPoolToken token;         
> std::function f;     
> };
>     
> void Schedule(Task task, int delay_ms) {         
>   tasks_.insert(...);     
> }
> void RunLoop() {
>   while (...) {
> SleepFor(100ms);
> tasks = FindTasks();
> for (auto task : tasks) {
>   token = task.token;
>   token->Submit(task.f);
>   tasks_.erase...             
> }
>   }
> }
>   scoped_refptr thread_;
>   std::multimap  tasks;
> };
> class ThreadPool{  
> ...  
> TimerThread* timer_;
> ... 
> };
> class ThreadPoolToken {
>   void Scheduler();      
> };{code}
> This scheme can be compatible with the previous ThreadPool, and timer is 
> nullptr by default.
> For periodic tasks, We can use a Control ThreadPool with timer to refact some 
> codes to make them more clear, to avoid the problem of too many single 
> threads in the past.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KUDU-3364) Add TimerThread to ThreadPool to support a category of problem

2022-05-17 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538251#comment-17538251
 ] 

YifanZhang commented on KUDU-3364:
--

[~shenxingwuying] I still have some questions about the motivation.

> Add TimerThread to ThreadPool to support a category of problem
> --
>
> Key: KUDU-3364
> URL: https://issues.apache.org/jira/browse/KUDU-3364
> Project: Kudu
>  Issue Type: New Feature
>Reporter: shenxingwuying
>Assignee: shenxingwuying
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h1. Scenanios
> In general, I am talking about a category of problem.
> There are some periodic tasks or automatically triggered scheduling tasks in 
> kudu. 
> For example, automatic rebalance of cluster data, some GC task and compaction 
> tasks.
> Their implementation is by kudu Thread, maybe std::thread or ThreadPool, the 
> really task internally periodic scheduled or internally strategy to trigge 
> execution. 
> They are all internal, we cann't do some.
> In fact, we need a method our control to trigge the above types of actions.
> In general, I am talking about a category of problem. 
> Some scenarios is significant.
> Below is examples:
>  
> h2. data rebalance
> There are two rebalance ways:
> 1. enable auto rebalance
> 2. use rebalance tool 1.14 before.
> The two ways maybe exist some conflicts at opeations race, because rebalance 
> tool' logic is a litte complex at tool and auto rebalance is running at 
> master.
> In future, auto rebalance at master will become very steady and become the 
> main way for data rebalance. And at the same time, admin opers need a 
> external trigger the rebalance just like auto rebalance.
> But, now auto rebalance is running in a thread and by time period.
> Although we can add a api for MasterService, but the api is synchronize, and 
> will cose very much, we need a asynchronized method to trigger the rebalance.
> h2. auto compaction
> Another example is auto compaction,
> I have found compaction strategy is not always valid, so maybe we need a 
> method  controlled by admin users to triggle compaction.
> If we can do a RowSetInCompaction, we need not restart the kudu cluster.
> h1.  
> h1. My Solution
> Add a timer in ThreadPool. This timer is a worker thread that schedules tasks 
> to the specified thread according to time.
> We can limit only SERIAL ThreadPoolToken can enable TimerThread.
> Pseudo code expresses my intention:
> {code:java}
> //代码占位符
> class TimerThread {
> class Task {         
> ThreadPoolToken token;         
> std::function f;     
> };
>     
> void Schedule(Task task, int delay_ms) {         
>   tasks_.insert(...);     
> }
> void RunLoop() {
>   while (...) {
> SleepFor(100ms);
> tasks = FindTasks();
> for (auto task : tasks) {
>   token = task.token;
>   token->Submit(task.f);
>   tasks_.erase...             
> }
>   }
> }
>   scoped_refptr thread_;
>   std::multimap  tasks;
> };
> class ThreadPool{  
> ...  
> TimerThread* timer_;
> ... 
> };
> class ThreadPoolToken {
>   void Scheduler();      
> };{code}
> This scheme can be compatible with the previous ThreadPool, and timer is 
> nullptr by default.
> For periodic tasks, We can use a Control ThreadPool with timer to refact some 
> codes to make them more clear, to avoid the problem of too many single 
> threads in the past.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KUDU-3354) Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota

2022-04-02 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516304#comment-17516304
 ] 

YifanZhang commented on KUDU-3354:
--

{code:java}
I0221 04:22:12.031422  8866 maintenance_manager.cc:382] P 
c4e995dc9e264d6fbcd01aacff4212bd: Scheduling 
CompactRowSetsOp(64bf8251dd594197b493a8a5cd2e3e9c): perf 
score=1279494840443460300019357683777979923760955028182284084266695498710755807790412216956702667800111812783791998911358244116400294695993868288.00
I0221 04:22:12.032940  8806 tablet.cc:1898] T 64bf8251dd594197b493a8a5cd2e3e9c 
P c4e995dc9e264d6fbcd01aacff4212bd: Compaction resulted in no output rows (all 
input rows were GCed!)  Removing all input rowsets. {code}
Seems that the maintenance manager sometimes schedule strange compact ops as 
shown in the above log, these ops block flush ops because in the test tserver 
is configured with '--maintenance_manager_num_threads=1'(the default value).

> Flaky test: 
> DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota
> --
>
> Key: KUDU-3354
> URL: https://issues.apache.org/jira/browse/KUDU-3354
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
> Attachments: write_limit-itest.txt
>
>
> The test 
> `DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota`
>  sometimes fails (at least in debug mode). The output is:
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:229:
>  Failure
> Value of: s.IsIOError()
>   Actual: false
> Expected: true
> OK
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:429:
>  Failure
> Expected: TestSizeLimit() doesn't generate new fatal failures in the current 
> thread.
>   Actual: it does.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3354) Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota

2022-04-02 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3354:
-
Attachment: write_limit-itest.txt

> Flaky test: 
> DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota
> --
>
> Key: KUDU-3354
> URL: https://issues.apache.org/jira/browse/KUDU-3354
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
> Attachments: write_limit-itest.txt
>
>
> The test 
> `DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota`
>  sometimes fails (at least in debug mode). The output is:
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:229:
>  Failure
> Value of: s.IsIOError()
>   Actual: false
> Expected: true
> OK
> /home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:429:
>  Failure
> Expected: TestSizeLimit() doesn't generate new fatal failures in the current 
> thread.
>   Actual: it does.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KUDU-3354) Flaky test: DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota

2022-02-22 Thread YifanZhang (Jira)
YifanZhang created KUDU-3354:


 Summary: Flaky test: 
DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota
 Key: KUDU-3354
 URL: https://issues.apache.org/jira/browse/KUDU-3354
 Project: Kudu
  Issue Type: Bug
Reporter: YifanZhang


The test 
`DisableWriteWhenExceedingQuotaTest.TestDisableWritePrivilegeWhenExceedingSizeQuota`
 sometimes fails (at least in debug mode). The output is:

/home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:229:
 Failure
Value of: s.IsIOError()
  Actual: false
Expected: true
OK
/home/jenkins-slave/workspace/kudu-master/0/src/kudu/integration-tests/write_limit-itest.cc:429:
 Failure
Expected: TestSizeLimit() doesn't generate new fatal failures in the current 
thread.
  Actual: it does.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KUDU-3328) Disable move replicas to tablet servers in maintenance mode

2022-01-06 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang resolved KUDU-3328.
--
Fix Version/s: 1.16.0
   Resolution: Fixed

> Disable move replicas to tablet servers in maintenance mode
> ---
>
> Key: KUDU-3328
> URL: https://issues.apache.org/jira/browse/KUDU-3328
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
> Fix For: 1.16.0
>
>
> When put some tablet servers in maintenance mode, new replicas are not 
> expected to be added to these tservers, but we still could run`kudu cluster 
> rebalance` or `kudu tablet change_config move_replica` to move replicas to 
> the tservers under maintenance. These operations should be disabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster

2022-01-04 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang resolved KUDU-3346.
--
Fix Version/s: 1.16.0
   Resolution: Fixed

> Rebalance fails when trying to decommission tserver on a rack-aware cluster
> ---
>
> Key: KUDU-3346
> URL: https://issues.apache.org/jira/browse/KUDU-3346
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Georgiana Ogrean
>Assignee: YifanZhang
>Priority: Major
> Fix For: 1.16.0
>
> Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z
>
>
> When following the steps [in the 
> docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html]
>  for decommissioning a tserver, the rebalance job fails with:
> {code:java}
> Invalid argument: ignored tserver  is not reported among know 
> tservers 
> {code}
> Steps followed:
> 1. Checked that ksck passes.
> 2. Put the tserver to be decommissioned in maintenance mode.
> {code:java}
> sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES 
> 5ae499b1b870419daabb0e8da90ef233 {code}
> 3. Ran rebalance with {{-ignored_tservers}} and 
> {{-move_replicas_from_ignored_tservers}} flags.
> {code:java}
> sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES 
> -move_replicas_from_ignored_tservers 
> -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code}
> The logs for the rebalace command are attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster

2021-12-24 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3346:


Assignee: YifanZhang

> Rebalance fails when trying to decommission tserver on a rack-aware cluster
> ---
>
> Key: KUDU-3346
> URL: https://issues.apache.org/jira/browse/KUDU-3346
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Georgiana Ogrean
>Assignee: YifanZhang
>Priority: Major
> Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z
>
>
> When following the steps [in the 
> docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html]
>  for decommissioning a tserver, the rebalance job fails with:
> {code:java}
> Invalid argument: ignored tserver  is not reported among know 
> tservers 
> {code}
> Steps followed:
> 1. Checked that ksck passes.
> 2. Put the tserver to be decommissioned in maintenance mode.
> {code:java}
> sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES 
> 5ae499b1b870419daabb0e8da90ef233 {code}
> 3. Ran rebalance with {{-ignored_tservers}} and 
> {{-move_replicas_from_ignored_tservers}} flags.
> {code:java}
> sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES 
> -move_replicas_from_ignored_tservers 
> -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code}
> The logs for the rebalace command are attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KUDU-3346) Rebalance fails when trying to decommission tserver on a rack-aware cluster

2021-12-24 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464941#comment-17464941
 ] 

YifanZhang commented on KUDU-3346:
--

I think there is something wrong when populating 
`ClusterInfo::tservers_to_empty`, because sometimes the `ClusterRawInfo` only 
contains  tservers/tablets info of a specific location.  I plan to fix it.

> Rebalance fails when trying to decommission tserver on a rack-aware cluster
> ---
>
> Key: KUDU-3346
> URL: https://issues.apache.org/jira/browse/KUDU-3346
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Georgiana Ogrean
>Priority: Major
> Attachments: rebalance_ignored_tserver_1c.log.Z, rebalance_v1.log.Z
>
>
> When following the steps [in the 
> docs|https://docs.cloudera.com/runtime/7.2.0/administering-kudu/topics/kudu-decommissioning-or-permanently-removing-tablet-server-from-cluster.html]
>  for decommissioning a tserver, the rebalance job fails with:
> {code:java}
> Invalid argument: ignored tserver  is not reported among know 
> tservers 
> {code}
> Steps followed:
> 1. Checked that ksck passes.
> 2. Put the tserver to be decommissioned in maintenance mode.
> {code:java}
> sudo -u kudu kudu tserver state enter_maintenance $MASTER_ADDRESSES 
> 5ae499b1b870419daabb0e8da90ef233 {code}
> 3. Ran rebalance with {{-ignored_tservers}} and 
> {{-move_replicas_from_ignored_tservers}} flags.
> {code:java}
> sudo -u kudu kudu cluster rebalance $MASTER_ADDRESSES 
> -move_replicas_from_ignored_tservers 
> -ignored_tservers=5ae499b1b870419daabb0e8da90ef233 -v=1{code}
> The logs for the rebalace command are attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KUDU-3344) Master could do some garbage collection work in CatalogManagerBgTasks thread

2021-12-17 Thread YifanZhang (Jira)
YifanZhang created KUDU-3344:


 Summary: Master could do some garbage collection work in 
CatalogManagerBgTasks thread
 Key: KUDU-3344
 URL: https://issues.apache.org/jira/browse/KUDU-3344
 Project: Kudu
  Issue Type: Improvement
  Components: master
Reporter: YifanZhang


Kudu master now reserve all tables/tablets' metadata in memory and disks, 
deleted tables and tablets were marked REMOVED/DELETED/REPLACED state but not 
really deleted. This could lead to huge memory usage as described in KUDU-3097. 
 

I think it's a good idea to cleanup them in the CatalogManagerBgTasks thread. 
But because the data deletion tasks are done asynchronously by tablet servers, 
it is uncertain when metadata can be safely deleted.

Besides, we could also cleanup dead tablet servers from master's in-memory map 
in this thread, as I mentioned in KUDU-2915.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KUDU-3097) whether master load deleted entries into memory could be configuable

2021-12-14 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459186#comment-17459186
 ] 

YifanZhang commented on KUDU-3097:
--

The [master design 
doc|https://github.com/apache/kudu/blob/master/docs/design-docs/master.md] 
mentioned we could have a new backgroud task to cleanup 'deleted' state 
table/tablets from in-momery map and SysCatalogTable. Is it safe to do that or, 
why we need to keep these deleted table/tablets?

> whether master load deleted entries into memory could be configuable
> 
>
> Key: KUDU-3097
> URL: https://issues.apache.org/jira/browse/KUDU-3097
> Project: Kudu
>  Issue Type: New Feature
>Reporter: wangningito
>Assignee: wangningito
>Priority: Major
> Attachments: image-2020-05-28-19-41-05-485.png, screenshot-1.png, 
> set-09.svg
>
>
> The tablet of master is not under control of MVCC.
> The deleted entries like table structure, deleted tablet ids would be load 
> into memory.
> For those who use the massive columns or lots of tablets and frequently 
> switch table, it may result in some unnecessary memory usage. 
> By the way, the memory usage is different between leader and follower in 
> master. It may result in imbalance among master cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KUDU-3097) whether master load deleted entries into memory could be configuable

2021-12-13 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458956#comment-17458956
 ] 

YifanZhang commented on KUDU-3097:
--

We also notice that sometimes the memory usage of kudu master could be very 
huge. I collected sampled heap usage and it shows that the master allocated too 
much memory for storing `SysTabletEntryPB`.   [^set-09.svg] 

The number of deleted tablets keeps growing in a typical usage scenario where 
we keep only the latest partition and then delete the historical tablets. Maybe 
it is not so necessary to keep all tablets(including deleted ones) in memory?

> whether master load deleted entries into memory could be configuable
> 
>
> Key: KUDU-3097
> URL: https://issues.apache.org/jira/browse/KUDU-3097
> Project: Kudu
>  Issue Type: New Feature
>Reporter: wangningito
>Assignee: wangningito
>Priority: Major
> Attachments: image-2020-05-28-19-41-05-485.png, screenshot-1.png, 
> set-09.svg
>
>
> The tablet of master is not under control of MVCC.
> The deleted entries like table structure, deleted tablet ids would be load 
> into memory.
> For those who use the massive columns or lots of tablets and frequently 
> switch table, it may result in some unnecessary memory usage. 
> By the way, the memory usage is different between leader and follower in 
> master. It may result in imbalance among master cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3097) whether master load deleted entries into memory could be configuable

2021-12-13 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3097:
-
Attachment: set-09.svg

> whether master load deleted entries into memory could be configuable
> 
>
> Key: KUDU-3097
> URL: https://issues.apache.org/jira/browse/KUDU-3097
> Project: Kudu
>  Issue Type: New Feature
>Reporter: wangningito
>Assignee: wangningito
>Priority: Major
> Attachments: image-2020-05-28-19-41-05-485.png, screenshot-1.png, 
> set-09.svg
>
>
> The tablet of master is not under control of MVCC.
> The deleted entries like table structure, deleted tablet ids would be load 
> into memory.
> For those who use the massive columns or lots of tablets and frequently 
> switch table, it may result in some unnecessary memory usage. 
> By the way, the memory usage is different between leader and follower in 
> master. It may result in imbalance among master cluster.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error

2021-12-03 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang resolved KUDU-3341.
--
Fix Version/s: 1.16.0
   Resolution: Fixed

> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
> Fix For: 1.16.0
>
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and if any, outdated 
> replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (KUDU-3328) Disable move replicas to tablet servers in maintenance mode

2021-12-03 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3328:


Assignee: YifanZhang

> Disable move replicas to tablet servers in maintenance mode
> ---
>
> Key: KUDU-3328
> URL: https://issues.apache.org/jira/browse/KUDU-3328
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
>
> When put some tablet servers in maintenance mode, new replicas are not 
> expected to be added to these tservers, but we still could run`kudu cluster 
> rebalance` or `kudu tablet change_config move_replica` to move replicas to 
> the tservers under maintenance. These operations should be disabled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error

2021-11-29 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3341:
-
Description: 
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
send DeleteTablet RPCs to this tserver, but receive either a RPC 
failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with 
a new uuid), and keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and if any, outdated 
replicas could be deleted finally.

  was:
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{--follower_unavailable_considered_failed_sec}} seconds. And then master send 
DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was 
shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and 
keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and if any, outdated 
replicas could be deleted finally.


> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and if any, outdated 
> replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error

2021-11-29 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3341:


Assignee: YifanZhang

> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{--follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and if any, outdated 
> replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error

2021-11-29 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3341:
-
Description: 
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{--follower_unavailable_considered_failed_sec}} seconds. And then master send 
DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was 
shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and 
keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and if any, outdated 
replicas could be deleted finally.

  was:
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
send DeleteTablet RPCs to this tserver, but receive either a RPC 
failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with 
a new uuid), and keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and outdated replicas could 
be deleted finally.


> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{--follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and if any, outdated 
> replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error

2021-11-29 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3341:
-
Summary: Catalog Manager should stop retrying DeleteTablet when receive 
WRONG_SERVER_UUID error  (was: Catalog Manager should stop retrying 
DeleteTablet when receive WRONG_UUID_ERROR)

> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and outdated replicas 
> could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_UUID_ERROR

2021-11-29 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3341:
-
Description: 
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
send DeleteTablet RPCs to this tserver, but receive either a RPC 
failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with 
a new uuid), and keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and outdated replicas could 
be deleted finally.

  was:
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{--follower_unavailable_considered_failed_sec}} seconds. And then master send 
DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was 
shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and 
keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and outdated replicas could 
be deleted finally.


> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_UUID_ERROR
> ---
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and outdated replicas 
> could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_UUID_ERROR

2021-11-29 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3341:
-
Component/s: master
Description: 
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{--follower_unavailable_considered_failed_sec}} seconds. And then master send 
DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was 
shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and 
keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and outdated replicas could 
be deleted finally.
Summary: Catalog Manager should stop retrying DeleteTablet when receive 
WRONG_UUID_ERROR  (was: Catalog Manager should stop retrying DeleteTablet when 
receive WRON)

> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_UUID_ERROR
> ---
>
> Key: KUDU-3341
> URL: https://issues.apache.org/jira/browse/KUDU-3341
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: YifanZhang
>Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{--follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and outdated replicas 
> could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRON

2021-11-29 Thread YifanZhang (Jira)
YifanZhang created KUDU-3341:


 Summary: Catalog Manager should stop retrying DeleteTablet when 
receive WRON
 Key: KUDU-3341
 URL: https://issues.apache.org/jira/browse/KUDU-3341
 Project: Kudu
  Issue Type: Improvement
Reporter: YifanZhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KUDU-2915) Support to delete dead tservers from CLI

2021-11-28 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449978#comment-17449978
 ] 

YifanZhang commented on KUDU-2915:
--

I think it's good that we could introduce a tool to unregister a dead tablet 
server from the master's in-memory state. 

And on the other hand, I also want to know whether it is safe or reasonable to 
make master take the initiative to forget a tablet server that have been in 
'dead' state for 'a long time' and no replica is running on it. If the same 
tablet server comes back again, the master re-register it in it's in-memory 
state. Is there some problems?

> Support to delete dead tservers from CLI
> 
>
> Key: KUDU-2915
> URL: https://issues.apache.org/jira/browse/KUDU-2915
> Project: Kudu
>  Issue Type: Improvement
>  Components: CLI, ops-tooling
>Affects Versions: 1.10.0
>Reporter: Hexin
>Assignee: Hexin
>Priority: Major
>  Labels: supportability
>
> Sometimes the nodes in the cluster will crash due to machine problems such as 
> disk corruption, which can be very common. However, if there are some dead 
> tservers, ksck result will always show error (e.g. Not all Tablet Servers are 
> reachable) although all tables have recovered to be healthy.
> The only way now to get the healthy status of ksck is to restart all masters 
> one by one. In some cases, for example, if the machine has completely 
> corrupted, we hope to get healthy status of ksck without restarting, since 
> after restarting masters the cluster will take some time to recover, during 
> which it will have influence on scanning or upsetting to tables. The recovery 
> time can be long which mainly depends on the scale of cluster. This problem 
> can be serious and annoying especially tservers crashed with high-frequency 
> in a large cluster.
> It’s valuable if we have an easier way to delete dead tservers from master, I 
> will support a kudu command to realize it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KUDU-3328) Disable move replicas to tablet servers in maintenance mode

2021-10-20 Thread YifanZhang (Jira)
YifanZhang created KUDU-3328:


 Summary: Disable move replicas to tablet servers in maintenance 
mode
 Key: KUDU-3328
 URL: https://issues.apache.org/jira/browse/KUDU-3328
 Project: Kudu
  Issue Type: Improvement
Reporter: YifanZhang


When put some tablet servers in maintenance mode, new replicas are not expected 
to be added to these tservers, but we still could run`kudu cluster rebalance` 
or `kudu tablet change_config move_replica` to move replicas to the tservers 
under maintenance. These operations should be disabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2064) Overall log cache usage doesn't respect the limit

2021-05-26 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351716#comment-17351716
 ] 

YifanZhang commented on KUDU-2064:
--

I also found actual log cache usage exceeded the 
log_cache_size_limit/global_log_cache_limit in a tserver's mem-tracker 
page(kudu version1.12.0):
||Id
 ||Parent
 ||Limit
 ||Current Consumption
 ||Peak Consumption
 ||
|root|none|none|44.97G|76.44G|
|block_cache-sharded_lru_cache|root|none|40.01G|40.02G|
|server|root|none|2.50G|26.29G|
|log_cache|root|1.00G|2.46G|10.89G|
|log_cache:adbee30f32664a48bc24f80b1e53d425:cbcc9aa7ac9c4167a7ba0b540c95c83a|log_cache|128.00M|854.01M|858.10M|
|log_cache:adbee30f32664a48bc24f80b1e53d425:4b2cbe4fd0d64e7d998a8abddbc1fb47|log_cache|128.00M|793.87M|794.58M|
|log_cache:adbee30f32664a48bc24f80b1e53d425:ea0d65bc2f384757b2259a19829fab9c|log_cache|128.00M|254.86M|429.48M|
|log_cache:adbee30f32664a48bc24f80b1e53d425:65065df878a64d1bae52fcd0bf6a2e45|log_cache|128.00M|215.48M|392.56M|

But the tablet that consumes largest log cache is TOMBSTONED, I'm not sure if 
the cache is actually occupied or the MemTracker is not updated.

I also saw some kernel_stack_watchdog traces in the log:
{code:java}
W0526 11:35:35.414122 27289 kernel_stack_watchdog.cc:198] Thread 190027 stuck 
at /home/zhangyifan8/work/kudu-xm/src/kudu/consensus/log.cc:405 for 118ms:
Kernel stack:
[] futex_wait_queue_me+0xc6/0x130
[] futex_wait+0x17b/0x280
[] do_futex+0x106/0x5a0
[] SyS_futex+0x80/0x180
[] system_call_fastpath+0x1c/0x21
[] 0x

User stack:
@ 0x7fe923e72370  (unknown)
@  0x2318d54  kudu::RowOperationsPB::~RowOperationsPB()
@  0x20d0300  kudu::tserver::WriteRequestPB::SharedDtor()
@  0x20d37a8  kudu::tserver::WriteRequestPB::~WriteRequestPB()
@  0x2095703  kudu::consensus::ReplicateMsg::SharedDtor()
@  0x209b038  kudu::consensus::ReplicateMsg::~ReplicateMsg()
@   0xc3d617  kudu::consensus::LogCache::EvictSomeUnlocked()
@   0xc3e052  
_ZNSt17_Function_handlerIFvRKN4kudu6StatusEEZNS0_9consensus8LogCache16AppendOperationsERKSt6vectorI13scoped_refptrINS5_19RefCountedReplicateEESaISA_EERKSt8functionIS4_EEUlS3_E_E9_M_invokeERKSt9_Any_dataS3_
@   0xc89ea9  kudu::log::Log::AppendThread::HandleBatches()
@   0xc8a7ad  kudu::log::Log::AppendThread::ProcessQueue()
@  0x2295cfe  kudu::ThreadPool::DispatchThread()
@  0x228ecaf  kudu::Thread::SuperviseThread()
@ 0x7fe923e6adc5  start_thread
@ 0x7fe92214c73d  __clone
{code}
This often happens when there is a large number of write requests and results 
in slow writes.

 

> Overall log cache usage doesn't respect the limit
> -
>
> Key: KUDU-2064
> URL: https://issues.apache.org/jira/browse/KUDU-2064
> Project: Kudu
>  Issue Type: Bug
>  Components: log
>Affects Versions: 1.4.0
>Reporter: Jean-Daniel Cryans
>Priority: Major
>  Labels: data-scalability
>
> Looking at a fairly loaded machine (10TB of data in LBM, close to 10k 
> tablets), I can see in the mem-trackers page that the log cache is using 
> 1.83GB, that it peaked at 2.82GB, with a 1GB limit. It's consistent on other 
> similarly loaded tservers. It's unexpected.
> Looking at the per-tablet breakdown, they all have between 0 and a handful of 
> MBs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3271) Tablet server crashed when handle scan request

2021-04-06 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315447#comment-17315447
 ] 

YifanZhang edited comment on KUDU-3271 at 4/6/21, 11:34 AM:


[~awong] I have attached the INFO log of that day related to the being scanned 
tablet. It was about 16:34 when the tablet server crashed. At that time a user 
executed the query `select count(1) from xxx`.

An application deletes all records from this table and reloads new data every 
day. But we failed to reporduce this problem by executing the same query 
today.:(  

We set tserver flag `​–tablet_history_max_age_sec=10` because users don't 
usually need to read historical data.

 


was (Author: zhangyifan27):
[~awong] I have attached the INFO log of that day related to the being scanned 
tablet. It was about 16:34 when the tablet server crashed. At that time a user 
executed the query `select count(1) from xxx`. An application deletes all 
records from this table and reloads new data every day. But we failed to 
reporduce this problem by executing the same query today.

 

> Tablet server crashed when handle scan request
> --
>
> Key: KUDU-3271
> URL: https://issues.apache.org/jira/browse/KUDU-3271
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: YifanZhang
>Priority: Major
> Attachments: tablet-52a743.log
>
>
> We found that one of kudu tablet server crashed when handle scan request. The 
> scanned table didn't have any row operations at that time. This issue only 
> came up once so far.
> Coredump stack is:
> {code:java}
> Program terminated with signal 11, Segmentation fault.
> (gdb) bt
> #0  kudu::tablet::DeltaApplier::HasNext (this=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84
> #1  0x02185900 in kudu::UnionIterator::HasNext (this=) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051
> #2  0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner 
> (this=0x4fea140, scanner_id=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195
> #3  0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, 
> __in_chrg=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179
> #4  kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
> (this=this@entry=0x60edef0, req=req@entry=0x9582e880, 
> rpc_context=rpc_context@entry=0x8151d7800,     
> result_collector=result_collector@entry=0x7f2d721679f0, 
> has_more_results=has_more_results@entry=0x7f2d721678f9, 
> error_code=error_code@entry=0x7f2d721678fc)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737
> #5  0x009fb009 in kudu::tserver::TabletServiceImpl::Scan 
> (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907
> #6  0x0210f019 in operator() (__args#2=0x8151d7800, 
> __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at 
> /usr/include/c++/4.8.2/functional:2471
> #7  kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
> #8  0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
> #9  0x0228ecaf in operator() (this=0xc1a58c28) at 
> /usr/include/c++/4.8.2/functional:2471
> #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 
> 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 
> 0x7f2de4e6873d in clone () from /lib64/libc.so.6
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3271) Tablet server crashed when handle scan request

2021-04-06 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315447#comment-17315447
 ] 

YifanZhang edited comment on KUDU-3271 at 4/6/21, 11:21 AM:


[~awong] I have attached the INFO log of that day related to the being scanned 
tablet. It was about 16:34 when the tablet server crashed. At that time a user 
executed the query `select count(1) from xxx`. An application deletes all 
records from this table and reloads new data every day. But we failed to 
reporduce this problem by executing the same query today.

 


was (Author: zhangyifan27):
[~awong] I have attached the INFO log of that day related to the being scanned 
tablet. It was about 16:34 when the tablet server crashed. At that time a user 
executed the query `select count(1) from xxx`. But we failed to reporduce this 
problem by executing the same query today.

> Tablet server crashed when handle scan request
> --
>
> Key: KUDU-3271
> URL: https://issues.apache.org/jira/browse/KUDU-3271
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: YifanZhang
>Priority: Major
> Attachments: tablet-52a743.log
>
>
> We found that one of kudu tablet server crashed when handle scan request. The 
> scanned table didn't have any row operations at that time. This issue only 
> came up once so far.
> Coredump stack is:
> {code:java}
> Program terminated with signal 11, Segmentation fault.
> (gdb) bt
> #0  kudu::tablet::DeltaApplier::HasNext (this=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84
> #1  0x02185900 in kudu::UnionIterator::HasNext (this=) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051
> #2  0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner 
> (this=0x4fea140, scanner_id=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195
> #3  0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, 
> __in_chrg=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179
> #4  kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
> (this=this@entry=0x60edef0, req=req@entry=0x9582e880, 
> rpc_context=rpc_context@entry=0x8151d7800,     
> result_collector=result_collector@entry=0x7f2d721679f0, 
> has_more_results=has_more_results@entry=0x7f2d721678f9, 
> error_code=error_code@entry=0x7f2d721678fc)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737
> #5  0x009fb009 in kudu::tserver::TabletServiceImpl::Scan 
> (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907
> #6  0x0210f019 in operator() (__args#2=0x8151d7800, 
> __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at 
> /usr/include/c++/4.8.2/functional:2471
> #7  kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
> #8  0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
> #9  0x0228ecaf in operator() (this=0xc1a58c28) at 
> /usr/include/c++/4.8.2/functional:2471
> #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 
> 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 
> 0x7f2de4e6873d in clone () from /lib64/libc.so.6
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3271) Tablet server crashed when handle scan request

2021-04-06 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315447#comment-17315447
 ] 

YifanZhang commented on KUDU-3271:
--

[~awong] I have attached the INFO log of that day related to the being scanned 
tablet. It was about 16:34 when the tablet server crashed. At that time a user 
executed the query `select count(1) from xxx`. But we failed to reporduce this 
problem by executing the same query today.

> Tablet server crashed when handle scan request
> --
>
> Key: KUDU-3271
> URL: https://issues.apache.org/jira/browse/KUDU-3271
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: YifanZhang
>Priority: Major
> Attachments: tablet-52a743.log
>
>
> We found that one of kudu tablet server crashed when handle scan request. The 
> scanned table didn't have any row operations at that time. This issue only 
> came up once so far.
> Coredump stack is:
> {code:java}
> Program terminated with signal 11, Segmentation fault.
> (gdb) bt
> #0  kudu::tablet::DeltaApplier::HasNext (this=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84
> #1  0x02185900 in kudu::UnionIterator::HasNext (this=) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051
> #2  0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner 
> (this=0x4fea140, scanner_id=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195
> #3  0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, 
> __in_chrg=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179
> #4  kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
> (this=this@entry=0x60edef0, req=req@entry=0x9582e880, 
> rpc_context=rpc_context@entry=0x8151d7800,     
> result_collector=result_collector@entry=0x7f2d721679f0, 
> has_more_results=has_more_results@entry=0x7f2d721678f9, 
> error_code=error_code@entry=0x7f2d721678fc)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737
> #5  0x009fb009 in kudu::tserver::TabletServiceImpl::Scan 
> (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907
> #6  0x0210f019 in operator() (__args#2=0x8151d7800, 
> __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at 
> /usr/include/c++/4.8.2/functional:2471
> #7  kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
> #8  0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
> #9  0x0228ecaf in operator() (this=0xc1a58c28) at 
> /usr/include/c++/4.8.2/functional:2471
> #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 
> 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 
> 0x7f2de4e6873d in clone () from /lib64/libc.so.6
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3271) Tablet server crashed when handle scan request

2021-04-06 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3271:
-
Attachment: tablet-52a743.log

> Tablet server crashed when handle scan request
> --
>
> Key: KUDU-3271
> URL: https://issues.apache.org/jira/browse/KUDU-3271
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.12.0
>Reporter: YifanZhang
>Priority: Major
> Attachments: tablet-52a743.log
>
>
> We found that one of kudu tablet server crashed when handle scan request. The 
> scanned table didn't have any row operations at that time. This issue only 
> came up once so far.
> Coredump stack is:
> {code:java}
> Program terminated with signal 11, Segmentation fault.
> (gdb) bt
> #0  kudu::tablet::DeltaApplier::HasNext (this=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84
> #1  0x02185900 in kudu::UnionIterator::HasNext (this=) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051
> #2  0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner 
> (this=0x4fea140, scanner_id=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195
> #3  0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, 
> __in_chrg=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179
> #4  kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
> (this=this@entry=0x60edef0, req=req@entry=0x9582e880, 
> rpc_context=rpc_context@entry=0x8151d7800,     
> result_collector=result_collector@entry=0x7f2d721679f0, 
> has_more_results=has_more_results@entry=0x7f2d721678f9, 
> error_code=error_code@entry=0x7f2d721678fc)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737
> #5  0x009fb009 in kudu::tserver::TabletServiceImpl::Scan 
> (this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907
> #6  0x0210f019 in operator() (__args#2=0x8151d7800, 
> __args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at 
> /usr/include/c++/4.8.2/functional:2471
> #7  kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
> #8  0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
> #9  0x0228ecaf in operator() (this=0xc1a58c28) at 
> /usr/include/c++/4.8.2/functional:2471
> #10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 
> 0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 
> 0x7f2de4e6873d in clone () from /lib64/libc.so.6
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3271) Tablet server crashed when handle scan request

2021-03-31 Thread YifanZhang (Jira)
YifanZhang created KUDU-3271:


 Summary: Tablet server crashed when handle scan request
 Key: KUDU-3271
 URL: https://issues.apache.org/jira/browse/KUDU-3271
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.12.0
Reporter: YifanZhang


We found that one of kudu tablet server crashed when handle scan request. The 
scanned table didn't have any row operations at that time. This issue only came 
up once so far.

Coredump stack is:
{code:java}
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  kudu::tablet::DeltaApplier::HasNext (this=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tablet/delta_applier.cc:84
#1  0x02185900 in kudu::UnionIterator::HasNext (this=) 
at /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:1051
#2  0x00a2ea8f in kudu::tserver::ScannerManager::UnregisterScanner 
(this=0x4fea140, scanner_id=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.cc:195
#3  0x009e7adf in ~ScopedUnregisterScanner (this=0x7f2d72167610, 
__in_chrg=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/scanners.h:179
#4  kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
(this=this@entry=0x60edef0, req=req@entry=0x9582e880, 
rpc_context=rpc_context@entry=0x8151d7800,     
result_collector=result_collector@entry=0x7f2d721679f0, 
has_more_results=has_more_results@entry=0x7f2d721678f9, 
error_code=error_code@entry=0x7f2d721678fc)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2737
#5  0x009fb009 in kudu::tserver::TabletServiceImpl::Scan 
(this=0x60edef0, req=0x9582e880, resp=0xb87b16de0, context=0x8151d7800)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1907
#6  0x0210f019 in operator() (__args#2=0x8151d7800, 
__args#1=0xb87b16de0, __args#0=, this=0x4e0c7708) at 
/usr/include/c++/4.8.2/functional:2471
#7  kudu::rpc::GeneratedServiceIf::Handle (this=0x60edef0, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
#8  0x0210fcd9 in kudu::rpc::ServicePool::RunThread (this=0x50fb9e0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
#9  0x0228ecaf in operator() (this=0xc1a58c28) at 
/usr/include/c++/4.8.2/functional:2471
#10 kudu::Thread::SuperviseThread (arg=0xc1a58c00) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:674#11 
0x7f2de6b8adc5 in start_thread () from /lib64/libpthread.so.0#12 
0x7f2de4e6873d in clone () from /lib64/libc.so.6
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-10-09 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3198:
-
Description: 
We recently got an error when deleted full rows from a table with 64 columns 
using sparkSQL, however if we delete a column from the table, this error will 
not appear. The error is:
{code:java}
Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
Unknown row operation type (error 0){code}
I tested this by deleting a full row from a table with 64 column using java 
client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
{code:java}
Row error for primary key=[-128, 0, 0, 1], tablet=null, 
server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE should 
not have a value for column: c63 STRING NULLABLE (error 0)
{code}
if the row is set values for all columns , I got an error like:
{code:java}
Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
{code}
I also tested this with tables with different number of columns. The weird 
thing is I could delete full rows from a table with 8/16/32/63/65 columns,  but 
couldn't do this if the table has 64/128 columns.

  was:
We recently got an error when deleted full rows from a table with 64 columns 
using sparkSQL, however if we delete a column, this error will not appear. The 
error is:
{code:java}
Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
Unknown row operation type (error 0){code}
I tested this by deleting a full row from a table with 64 column using java 
client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
{code:java}
Row error for primary key=[-128, 0, 0, 1], tablet=null, 
server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE should 
not have a value for column: c63 STRING NULLABLE (error 0)
{code}
if the row is set values for all columns , I got an error like:
{code:java}
Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
{code}
I also tested this with tables with different number of columns. The weird 
thing is I could delete full rows from a table with 8/16/32/63/65 columns,  but 
couldn't do this if the table has 64/128 columns.


> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.12.0, 1.13.0
>Reporter: YifanZhang
>Priority: Major
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column from the table, this error will 
> not appear. The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-09-28 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203148#comment-17203148
 ] 

YifanZhang edited comment on KUDU-3198 at 9/28/20, 10:50 AM:
-

As java client allows delete options with extra column set while c++ client 
doesn't support this, I fould there may be a problem in encodeRow :

[https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373]

java api doc of BitSet.clear(int fromIndex, int toIndex):
{quote}public void clear(int fromIndex, int toIndex)

Sets the bits from the specified {{fromIndex}} (inclusive) to the specified 
{{toIndex}} (exclusive) to {{false}}.
{quote}
It seems that the last non-key field would not be cleared. But why it works 
well with non-64-column tables?


was (Author: zhangyifan27):
As java client allows delete options with extra column set while c++ client 
doesn't support this, I fould there may be a problem in encodeRow :

[https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373]

java api doc of BitSet.clear(int fromIndex, int toIndex):
{quote}public void clear(int fromIndex, int toIndex)

Sets the bits from the specified {{fromIndex}} (inclusive) to the specified 
{{toIndex}} (exclusive) to {{false}}.
{quote}
But why it works well with non-64-column tables?

> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.12.0, 1.13.0
>Reporter: YifanZhang
>Priority: Major
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column, this error will not appear. 
> The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-09-28 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203148#comment-17203148
 ] 

YifanZhang commented on KUDU-3198:
--

As java client allows delete options with extra column set while c++ client 
doesn't support this, I fould there may be a problem in encodeRow :

[https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373]

java api doc of BitSet.clear(int fromIndex, int toIndex):
{quote}public void clear(int fromIndex, int toIndex)Sets the bits from the 
specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to 
{{false}}.{quote}
But why it works well with non-64-column tables?

> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.12.0, 1.13.0
>Reporter: YifanZhang
>Priority: Major
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column, this error will not appear. 
> The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-09-28 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203148#comment-17203148
 ] 

YifanZhang edited comment on KUDU-3198 at 9/28/20, 10:43 AM:
-

As java client allows delete options with extra column set while c++ client 
doesn't support this, I fould there may be a problem in encodeRow :

[https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373]

java api doc of BitSet.clear(int fromIndex, int toIndex):
{quote}public void clear(int fromIndex, int toIndex)

Sets the bits from the specified {{fromIndex}} (inclusive) to the specified 
{{toIndex}} (exclusive) to {{false}}.
{quote}
But why it works well with non-64-column tables?


was (Author: zhangyifan27):
As java client allows delete options with extra column set while c++ client 
doesn't support this, I fould there may be a problem in encodeRow :

[https://github.com/apache/kudu/blob/07eb5a97f17046f6ee61b2a28bdfbe578d3f6d2b/java/kudu-client/src/main/java/org/apache/kudu/client/Operation.java#L365-L373]

java api doc of BitSet.clear(int fromIndex, int toIndex):
{quote}public void clear(int fromIndex, int toIndex)Sets the bits from the 
specified {{fromIndex}} (inclusive) to the specified {{toIndex}} (exclusive) to 
{{false}}.{quote}
But why it works well with non-64-column tables?

> Unable to delete a full row from a table with 64 columns when using java 
> client
> ---
>
> Key: KUDU-3198
> URL: https://issues.apache.org/jira/browse/KUDU-3198
> Project: Kudu
>  Issue Type: Bug
>  Components: java
>Affects Versions: 1.12.0, 1.13.0
>Reporter: YifanZhang
>Priority: Major
>
> We recently got an error when deleted full rows from a table with 64 columns 
> using sparkSQL, however if we delete a column, this error will not appear. 
> The error is:
> {code:java}
> Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
> Unknown row operation type (error 0){code}
> I tested this by deleting a full row from a table with 64 column using java 
> client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, 
> server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE 
> should not have a value for column: c63 STRING NULLABLE (error 0)
> {code}
> if the row is set values for all columns , I got an error like:
> {code:java}
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
> status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
> {code}
> I also tested this with tables with different number of columns. The weird 
> thing is I could delete full rows from a table with 8/16/32/63/65 columns,  
> but couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3198) Unable to delete a full row from a table with 64 columns when using java client

2020-09-28 Thread YifanZhang (Jira)
YifanZhang created KUDU-3198:


 Summary: Unable to delete a full row from a table with 64 columns 
when using java client
 Key: KUDU-3198
 URL: https://issues.apache.org/jira/browse/KUDU-3198
 Project: Kudu
  Issue Type: Bug
  Components: java
Affects Versions: 1.13.0, 1.12.0
Reporter: YifanZhang


We recently got an error when deleted full rows from a table with 64 columns 
using sparkSQL, however if we delete a column, this error will not appear. The 
error is:
{code:java}
Failed to write at least 1000 rows to Kudu; Sample errors: Not implemented: 
Unknown row operation type (error 0){code}
I tested this by deleting a full row from a table with 64 column using java 
client 1.12.0/1.13.0, if the row is set NULL for some columns, I got an error:
{code:java}
Row error for primary key=[-128, 0, 0, 1], tablet=null, 
server=d584b3407ea444519e91b32f2744b162, status=Invalid argument: DELETE should 
not have a value for column: c63 STRING NULLABLE (error 0)
{code}
if the row is set values for all columns , I got an error like:
{code:java}
Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, 
status=Corruption: Not enough data for column: c63 STRING NULLABLE (error 0)
{code}
I also tested this with tables with different number of columns. The weird 
thing is I could delete full rows from a table with 8/16/32/63/65 columns,  but 
couldn't do this if the table has 64/128 columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (KUDU-2879) Build hangs in DEBUG type on Ubuntu 18.04

2020-09-06 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang closed KUDU-2879.

Resolution: Cannot Reproduce

> Build hangs in DEBUG type on Ubuntu 18.04
> -
>
> Key: KUDU-2879
> URL: https://issues.apache.org/jira/browse/KUDU-2879
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Yingchun Lai
>Priority: Major
> Attachments: config.diff, config.log
>
>
> Few months ago, I report this issue on Slack: 
> [https://getkudu.slack.com/archives/C0CPXJ3CH/p1549942641041600]
> I switch to RELEASE type then on, and haven't try build on DEBUG type on my 
> Ubuntu environment.
> Now, when I try build DEBUG type to check 1.10.0-RC2, this issue occurred 
> again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-14 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang resolved KUDU-3180.
--
Fix Version/s: 1.13.0
   Resolution: Fixed

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Major
> Fix For: 1.13.0
>
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-14 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3180:


Assignee: YifanZhang

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-10 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173104#comment-17173104
 ] 

YifanZhang edited comment on KUDU-3180 at 8/10/20, 10:57 AM:
-

If we lower {{-memory_pressure_percentage}}, we should also lower 
{{-block_cache_capacity_mb}} accordingly, then we may not make full use of the 
memory resources. 

In fact most time of a day the memory usage of our kudu server is not very 
high(about 50%), but there will be a lot of insert/update in one hour or two 
and the memory usage is significantly growing, at this time kudu did flush big 
MRSs/DMSs in priority but sometimes OOM still occurred, even though we have 
tuned {{-maintenance_manager_num_threads}} to 20. After we tuned 
{{-flush_threshold_secs}} to 1800(was 3600 before), we could avoid OOM 
occurring but I found that {{average_diskrowset_height}} of most tablets become 
larger, that means these tablets need to be compacted more.

In general we want to prioritize flushes so we could free more memory, but also 
don't want to get more small DRSs. So maybe prioritize bigger MRS/DMS flushes 
would help.

Maybe could use {{max(memory_size, time_since_last_flush }} to define perf 
improvement of a mem-store flush, so that both big mem-stores and long_lived 
mem-stores  could be flushed in priority.

 


was (Author: zhangyifan27):
If we lower {{-memory_pressure_percentage}}, we should also lower 
{{-block_cache_capacity_mb}} accordingly, that may not make full use of the 
memory resources. 

In fact most time of a day the memory usage of our kudu server is not very 
high(about 50%), but there will be a lot of insert/update in one hour or two 
and the memory usage is significantly growing, at this time kudu did flush big 
MRSs/DMSs in priority but sometimes OOM still occurred, even though we have 
tuned {{-maintenance_manager_num_threads}} to 20. After we tuned 
{{-flush_threshold_secs}} to 1800(was 3600 before), we could avoid OOM 
occurring but I found that {{average_diskrowset_height}} of most tablets become 
larger, that means these tablets need to be compacted more.

In general we want to prioritize flushes so we could free more memory, but also 
don't want to get more small DRSs. So maybe prioritize bigger MRS/DMS flushes 
would help.

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-07 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173104#comment-17173104
 ] 

YifanZhang commented on KUDU-3180:
--

If we lower {{-memory_pressure_percentage}}, we should also lower 
{{-block_cache_capacity_mb}} accordingly, that may not make full use of the 
memory resources. 

In fact most time of a day the memory usage of our kudu server is not very 
high(about 50%), but there will be a lot of insert/update in one hour or two 
and the memory usage is significantly growing, at this time kudu did flush big 
MRSs/DMSs in priority but sometimes OOM still occurred, even though we have 
tuned {{-maintenance_manager_num_threads}} to 20. After we tuned 
{{-flush_threshold_secs}} to 1800(was 3600 before), we could avoid OOM 
occurring but I found that {{average_diskrowset_height}} of most tablets become 
larger, that means these tablets need to be compacted more.

In general we want to prioritize flushes so we could free more memory, but also 
don't want to get more small DRSs. So maybe prioritize bigger MRS/DMS flushes 
would help.

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-07 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3180:
-
Issue Type: Improvement  (was: Bug)

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Improvement
>Reporter: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-07 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172925#comment-17172925
 ] 

YifanZhang commented on KUDU-3180:
--

Thanks [~aserbin].

I agree that using {{memory_size * time_since_last_flush}} instead of just 
{{time_since_last_flush}} to pick which MRS should be flush is a easy way to 
improve current policy. Also if we prefer flush to compactions, current policy 
ensures that if an MRS over  {{flush_threshold_mb}}, a flush will be more 
likely to be selected than a compaction.

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-06 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172082#comment-17172082
 ] 

YifanZhang commented on KUDU-3180:
--

[~awong] Thanks for your comments!

The kudu cluster shown in the screenshot is 1.11.1 version, and I also found 
mem-stores in a 1.12.0 cluster that anchor 0B WAL on /maintenance-manager page, 
maybe the log size is less than 1B and this should be a interger so becomes 0B.

We mainly want to tradeoff memory used by mem-stores and rowset size on disk. 
If we flush frequently we could get some small DRSs and need to do more 
compactions, if we don't flush frequently mem-stores will anchor more memory. 
So define a cost function based on the time since last flush and memory used 
might be useful.

It's not always true that older or larger mem-stores anchor more WAL bytes as 
far as I saw on /maintenance-manager page, so maybe we shouldn't always use WAL 
bytes anchored to determine what to flush.

In our cases, we are running low on memory now, I think that is more common 
than low on WAL disk space because OS allocate memory for various ops. If we 
want to free more WAL disk space, lower --log_target_replay_size_mb should be 
effective. 

> kudu don't always prefer to flush MRS/DMS that anchor more memory
> -
>
> Key: KUDU-3180
> URL: https://issues.apache.org/jira/browse/KUDU-3180
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
> Attachments: image-2020-08-04-20-26-53-749.png, 
> image-2020-08-04-20-28-00-665.png
>
>
> Current time-based flush policy always give a flush op a high score if we 
> haven't flushed for the tablet in a long time, that may lead to starvation of 
> ops that could free more memory.
> We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
> find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
> flushes and compactions, which seems not so reasonable.
> !image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3180) kudu don't always prefer to flush MRS/DMS that anchor more memory

2020-08-04 Thread YifanZhang (Jira)
YifanZhang created KUDU-3180:


 Summary: kudu don't always prefer to flush MRS/DMS that anchor 
more memory
 Key: KUDU-3180
 URL: https://issues.apache.org/jira/browse/KUDU-3180
 Project: Kudu
  Issue Type: Bug
Reporter: YifanZhang
 Attachments: image-2020-08-04-20-26-53-749.png, 
image-2020-08-04-20-28-00-665.png

Current time-based flush policy always give a flush op a high score if we 
haven't flushed for the tablet in a long time, that may lead to starvation of 
ops that could free more memory.

We set  -flush_threshold_mb=32,  -flush_threshold_secs=1800 in a cluster, and 
find that some small MRS/DMS flushes has a higher perf score than big MRS/DMS 
flushes and compactions, which seems not so reasonable.

!image-2020-08-04-20-26-53-749.png|width=1424,height=317!!image-2020-08-04-20-28-00-665.png|width=1414,height=327!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3150) UI tables page sorts tablet count column incorrectly.

2020-06-18 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139177#comment-17139177
 ] 

YifanZhang commented on KUDU-3150:
--

I think this was fixed in 1d1a85804b8ce132021661a8fdb053141c2781c, but wasn't 
cherry picked into 1.12.

> UI tables page sorts tablet count column incorrectly. 
> --
>
> Key: KUDU-3150
> URL: https://issues.apache.org/jira/browse/KUDU-3150
> Project: Kudu
>  Issue Type: Bug
>  Components: ui
>Reporter: Grant Henke
>Priority: Major
>  Labels: beginner, supportability
>
> It looks like the tables page in the master web ui sorts the "Tablet Count" 
> wrong. I think it must be sorting lexicographically instead of numerically. 
> This was especially evident when 5.49k tablets was not sorted to the top in a 
> cluster recently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (KUDU-2879) Build hangs in DEBUG type on Ubuntu 18.04

2020-05-25 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reopened KUDU-2879:
--

I hit this issue again.

I can build DEBUG type of kudu 1.12.0 successfully on CentOS 7.3, but when I 
try to run any binary in build/debug/bin, it somehow hangs.

The pstack is:
{code:java}
#0  0x7febae90ce40 in __nanosleep_nocancel () at 
../sysdeps/unix/syscall-template.S:81
#1  0x7febadb08849 in base::internal::SpinLockDelay(int volatile*, int, 
int) () from /usr/lib64/libprofiler.so.0
#2  0x7febadb087cf in SpinLock::SlowLock() () from 
/usr/lib64/libprofiler.so.0
#3  0x7febabd1ef08 in tcmalloc::ThreadCache::InitModule() () from 
/usr/lib64/libtcmalloc.so.4
#4  0x7febabd1effd in tcmalloc::ThreadCache::CreateCacheIfNecessary() () 
from /usr/lib64/libtcmalloc.so.4
#5  0x7febabd2d325 in tc_calloc () from /usr/lib64/libtcmalloc.so.4
#6  0x7febabaf4550 in _dlerror_run (operate=operate@entry=0x7febabaf3ff0 
, args=args@entry=0x7fffc9085dd0) at dlerror.c:141
#7  0x7febabaf4058 in __dlsym (handle=, name=) at dlsym.c:70
#8  0x7febac4925ae in (anonymous namespace)::dlsym_or_die 
(sym=0x7febac5b29eb "dlopen") at 
/home/zhangyifan8/work/kudu/src/kudu/util/debug/unwind_safeness.cc:74
#9  0x7febac4926d2 in (anonymous namespace)::InitIfNecessary () at 
/home/zhangyifan8/work/kudu/src/kudu/util/debug/unwind_safeness.cc:100
#10 0x7febac49280a in dl_iterate_phdr (callback=0x7feba9c2f280 
<_Unwind_IteratePhdrCallback>, data=0x7fffc9085ed0) at 
/home/zhangyifan8/work/kudu/src/kudu/util/debug/unwind_safeness.cc:158
#11 0x7feba9c2fbbf in _Unwind_Find_FDE (pc=0x7feba9c2df87 
<_Unwind_Backtrace+55>, bases=bases@entry=0x7fffc9086228) at 
../../../libgcc/unwind-dw2-fde-dip.c:461
#12 0x7feba9c2cd2c in uw_frame_state_for 
(context=context@entry=0x7fffc9086180, fs=fs@entry=0x7fffc9085fd0) at 
../../../libgcc/unwind-dw2.c:1245
#13 0x7feba9c2d6ed in uw_init_context_1 
(context=context@entry=0x7fffc9086180, 
outer_cfa=outer_cfa@entry=0x7fffc9086430, outer_ra=0x7febadb071da) at 
../../../libgcc/unwind-dw2.c:1566
#14 0x7feba9c2df88 in _Unwind_Backtrace (trace=0x7febadb07410, 
trace_argument=0x7fffc9086430) at ../../../libgcc/unwind.inc:283
#15 0x7febadb071da in ?? () from /usr/lib64/libprofiler.so.0
#16 0x7febadb078e4 in GetStackTrace(void**, int, int) () from 
/usr/lib64/libprofiler.so.0
#17 0x7febabd1c386 in tcmalloc::PageHeap::GrowHeap(unsigned long) () from 
/usr/lib64/libtcmalloc.so.4
#18 0x7febabd1c613 in tcmalloc::PageHeap::New(unsigned long) () from 
/usr/lib64/libtcmalloc.so.4
#19 0x7febabd1b139 in tcmalloc::CentralFreeList::Populate() () from 
/usr/lib64/libtcmalloc.so.4
#20 0x7febabd1b338 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, 
void**, void**) () from /usr/lib64/libtcmalloc.so.4
#21 0x7febabd1b3d0 in tcmalloc::CentralFreeList::RemoveRange(void**, 
void**, int) () from /usr/lib64/libtcmalloc.so.4
#22 0x7febabd1e2a7 in tcmalloc::ThreadCache::FetchFromCentralCache(unsigned 
int, int) () from /usr/lib64/libtcmalloc.so.4
#23 0x7febabd2ce16 in tcmalloc::allocate_full_malloc_oom(unsigned long) () 
from /usr/lib64/libtcmalloc.so.4
#24 0x7feba98bdb6d in __fopen_internal (filename=0x7feba8105f37 
"/proc/filesystems", mode=0x7feba8105da1 "r", is32=1) at iofopen.c:69
#25 0x7feba80f7956 in selinuxfs_exists () from /usr/lib64/libselinux.so.1
#26 0x7feba80efc28 in init_lib () from /usr/lib64/libselinux.so.1
#27 0x7febb3d69973 in call_init (env=0x7fffc9086828, argv=0x7fffc9086818, 
argc=1, l=) at dl-init.c:82
#28 _dl_init (main_map=0x7febb3f7d150, argc=1, argv=0x7fffc9086818, 
env=0x7fffc9086828) at dl-init.c:131
#29 0x7febb3d5b15a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#30 0x0001 in ?? ()
#31 0x7fffc9087195 in ?? ()
#32 0x in ?? ()
{code}
 It works well if built with RELEASE type. And this didn't happen when build 
kudu 1.11.1 with Debug type.

> Build hangs in DEBUG type on Ubuntu 18.04
> -
>
> Key: KUDU-2879
> URL: https://issues.apache.org/jira/browse/KUDU-2879
> Project: Kudu
>  Issue Type: Improvement
>Reporter: Yingchun Lai
>Priority: Major
> Attachments: config.diff, config.log
>
>
> Few months ago, I report this issue on Slack: 
> [https://getkudu.slack.com/archives/C0CPXJ3CH/p1549942641041600]
> I switch to RELEASE type then on, and haven't try build on DEBUG type on my 
> Ubuntu environment.
> Now, when I try build DEBUG type to check 1.10.0-RC2, this issue occurred 
> again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3121) Allow users to pick the next best op

2020-05-13 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106875#comment-17106875
 ] 

YifanZhang commented on KUDU-3121:
--

Maybe this issue is similar to KUDU-2824, now we can use 'kudu table set_flag' 
tool to give some tables a high priority in MM compaction.

And also according to [~wdberkeley_impala_f7d4]'s suggestion in 
[https://gerrit.cloudera.org/c/12852/], we could also improve the maintenance 
manager by accounting for how often tablets are read or written.

> Allow users to pick the next best op
> 
>
> Key: KUDU-3121
> URL: https://issues.apache.org/jira/browse/KUDU-3121
> Project: Kudu
>  Issue Type: New Feature
>  Components: ops-tooling
>Reporter: Andrew Wong
>Priority: Major
>
> Time and again, we'll see a case where the maintenance manager scheduler 
> thread is, for whatever reason, scheduling an op that is actually not that 
> helpful. KUDU-2929, KUDU-3002, and KUDU-1400 come to mind.
> It might be convenient in some cases to temporarily (maybe for a single round 
> of scheduling) give a specific tablet or op priority.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3108) Tablet server crashes when handle diffscan request

2020-04-21 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088805#comment-17088805
 ] 

YifanZhang commented on KUDU-3108:
--

[~granthenke] Thanks for your reply.

The OS version is CentOS 7.3.

Some non-default configurations of tablet server are:
{code:java}
log_target_replay_size_mb = 128
maintenance_manager_num_threads = 10
maintenance_manager_polling_interval_ms = 50
memory_limit_hard_bytes = 107374182475
memory_limit_soft_percentage = 85
memory_pressure_percentage = 80
num_tablets_to_open_simultaneously = 20
redact = none
rpc_authentication = disabled
rpc_bind_addresses = 0.0.0.0:14100
rpc_encryption = disabled
rpc_num_service_threads = 128
rpc_service_queue_length = 1024
server_thread_pool_max_thread_count = 128
tablet_history_max_age_sec = 10
tablet_transaction_memory_limit_mb = 1024
unlock_experimental_flags = true
vmodule = maintenance=2      
{code}
`tablet_history_max_age_sec` was set to 10.

I ran the first full backup job right after setting tables' history_max_age_sec 
configuration. This setting seems succeed with no timeout or something and the 
first full backup jobs succeed. I run an incremental backup job of these tables 
after about a day and a half.

Non-default flag of backup job is: --numParallelBackups 10.

I have tried to run this incremental backup job once more, and this crash 
happened again. Because it's a production cluster I didn't try many times.

Besides, there are some rows delete operations on backup tables all the time.

I hope provided information would help.

> Tablet server crashes when handle diffscan request 
> ---
>
> Key: KUDU-3108
> URL: https://issues.apache.org/jira/browse/KUDU-3108
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> When we did an incremental backup for tables in a cluster with 20 tservers,  
> 3 tservers crashed, coredump stacks are the same:
> {code:java}
> Unable to find source-code formatter for language: shell. Available languages 
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, 
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
> yamlProgram terminated with signal 11, Segmentation fault.Program terminated 
> with signal 11, Segmentation fault.
> #0  kudu::Schema::Compare 
> (this=0x25b883680, lhs=..., rhs=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
> 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file 
> or directory.
> Missing separate debuginfos, use: debuginfo-install 
> bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
> cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
> cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 
> elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 
> keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
> libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 
> libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
> libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 
> ncurses-libs-5.9-13.20130511.el7.x86_64 
> nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
> openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
> systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 
> zlib-1.2.7-17.el7.x86_64
> (gdb) bt
> #0  kudu::Schema::Compare 
> (this=0x25b883680, lhs=..., rhs=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
> #1  0x01da51fb in kudu::MergeIterator::RefillHotHeap 
> (this=this@entry=0x78f6ec500) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720
> #2  0x01da622b in kudu::MergeIterator::AdvanceAndReheap 
> (this=this@entry=0x78f6ec500, state=0xd1661a000, 
> num_rows_to_advance=num_rows_to_advance@entry=1)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690
> #3  0x01da7927 in kudu::MergeIterator::MaterializeOneRow 
> (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, 
> dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894
> #4  0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, 
> dst=0x7f0d5cc9ffc0) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796
> #5  0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock 
> (this=, dst=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499
> #6  0x0095475c in 
> kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
> (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720,     
> 

[jira] [Updated] (KUDU-3108) Tablet server crashes when handle diffscan request

2020-04-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3108:
-
Description: 
When we did an incremental backup for tables in a cluster with 20 tservers,  3 
tservers crashed, coredump stacks are the same:
{code:java}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlProgram terminated with signal 11, Segmentation fault.Program terminated 
with signal 11, Segmentation fault.
#0  kudu::Schema::Compare 
(this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or 
directory.
Missing separate debuginfos, use: debuginfo-install 
bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 
elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 
libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 
ncurses-libs-5.9-13.20130511.el7.x86_64 
nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 
zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  kudu::Schema::Compare 
(this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
#1  0x01da51fb in kudu::MergeIterator::RefillHotHeap 
(this=this@entry=0x78f6ec500) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720
#2  0x01da622b in kudu::MergeIterator::AdvanceAndReheap 
(this=this@entry=0x78f6ec500, state=0xd1661a000, 
num_rows_to_advance=num_rows_to_advance@entry=1)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690
#3  0x01da7927 in kudu::MergeIterator::MaterializeOneRow 
(this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, 
dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894
#4  0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, 
dst=0x7f0d5cc9ffc0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796
#5  0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock 
(this=, dst=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499
#6  0x0095475c in 
kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
(this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
has_more_results=has_more_results@entry=0x7f0d5cca0886,     
error_code=error_code@entry=0x7f0d5cca0888) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565
#7  0x00966564 in 
kudu::tserver::TabletServiceImpl::HandleNewScanRequest 
(this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
scanner_id=scanner_id@entry=0x7f0d5cca0940,     
snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, 
has_more_results=has_more_results@entry=0x7f0d5cca0886, 
error_code=error_code@entry=0x7f0d5cca0888)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476
#8  0x00967f4b in kudu::tserver::TabletServiceImpl::Scan 
(this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674
#9  0x01d2e449 in operator() (__args#2=0x5e512a460, 
__args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at 
/usr/include/c++/4.8.2/functional:2471
#10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
#11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) 
at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
#12 0x01e9e924 in operator() (this=0x90fb52e8) at 
/home/zhangyifan8/work/kudu-xm/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:771
#13 kudu::Thread::SuperviseThread (arg=0x90fb52c0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:657
#14 0x7f103b20cdc5 in start_thread () from /lib64/libpthread.so.0
#15 0x7f103956673d in clone () from /lib64/libc.so.6
{code}
Before we 

[jira] [Updated] (KUDU-3108) Tablet server crashes when handle diffscan request

2020-04-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3108:
-
Description: 
When we did an incremental backup for tables in a cluster with 20 tservers,  3 
tservers crashed, coredump stacks are the same:
{code}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlProgram terminated with signal 11, Segmentation fault.Program terminated 
with signal 11, Segmentation fault.
#0  kudu::Schema::Compare 
(this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or 
directory.
Missing separate debuginfos, use: debuginfo-install 
bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 
elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 
libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 
ncurses-libs-5.9-13.20130511.el7.x86_64 
nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 
zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  kudu::Schema::Compare 
(this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
#1  0x01da51fb in kudu::MergeIterator::RefillHotHeap 
(this=this@entry=0x78f6ec500) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720
#2  0x01da622b in kudu::MergeIterator::AdvanceAndReheap 
(this=this@entry=0x78f6ec500, state=0xd1661a000, 
num_rows_to_advance=num_rows_to_advance@entry=1)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690
#3  0x01da7927 in kudu::MergeIterator::MaterializeOneRow 
(this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, 
dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894
#4  0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, 
dst=0x7f0d5cc9ffc0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796
#5  0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock 
(this=, dst=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499
#6  0x0095475c in 
kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
(this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
has_more_results=has_more_results@entry=0x7f0d5cca0886,     
error_code=error_code@entry=0x7f0d5cca0888) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565
#7  0x00966564 in 
kudu::tserver::TabletServiceImpl::HandleNewScanRequest 
(this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
scanner_id=scanner_id@entry=0x7f0d5cca0940,     
snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, 
has_more_results=has_more_results@entry=0x7f0d5cca0886, 
error_code=error_code@entry=0x7f0d5cca0888)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476
#8  0x00967f4b in kudu::tserver::TabletServiceImpl::Scan 
(this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674
#9  0x01d2e449 in operator() (__args#2=0x5e512a460, 
__args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at 
/usr/include/c++/4.8.2/functional:2471
#10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
#11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) 
at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
#12 0x01e9e924 in operator() (this=0x90fb52e8) at 
/home/zhangyifan8/work/kudu-xm/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:771
#13 kudu::Thread::SuperviseThread (arg=0x90fb52c0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:657
#14 0x7f103b20cdc5 in start_thread () from /lib64/libpthread.so.0
#15 0x7f103956673d in clone () from /lib64/libc.so.6
{code}

  was:
When we 

[jira] [Updated] (KUDU-3108) Tablet server crashes when handle diffscan request

2020-04-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3108:
-
Summary: Tablet server crashes when handle diffscan request   (was: Tablet 
server crashes when handle scan request )

> Tablet server crashes when handle diffscan request 
> ---
>
> Key: KUDU-3108
> URL: https://issues.apache.org/jira/browse/KUDU-3108
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> When we use KuduBackup Spark job to backup tables in a  cluster with 20 
> tservers,  3 tservers crashed, coredump stacks are the same:
> {code:java}
> Program terminated with signal 11, Segmentation fault.Program terminated with 
> signal 11, Segmentation fault.
> #0  kudu::Schema::Compare 
> (this=0x25b883680, lhs=..., rhs=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
> 267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file 
> or directory.
> Missing separate debuginfos, use: debuginfo-install 
> bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
> cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
> cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 
> elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 
> keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
> libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 
> libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
> libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 
> ncurses-libs-5.9-13.20130511.el7.x86_64 
> nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
> openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
> systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 
> zlib-1.2.7-17.el7.x86_64
> (gdb) bt
> #0  kudu::Schema::Compare 
> (this=0x25b883680, lhs=..., rhs=...) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
> #1  0x01da51fb in kudu::MergeIterator::RefillHotHeap 
> (this=this@entry=0x78f6ec500) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720
> #2  0x01da622b in kudu::MergeIterator::AdvanceAndReheap 
> (this=this@entry=0x78f6ec500, state=0xd1661a000, 
> num_rows_to_advance=num_rows_to_advance@entry=1)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690
> #3  0x01da7927 in kudu::MergeIterator::MaterializeOneRow 
> (this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, 
> dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894
> #4  0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, 
> dst=0x7f0d5cc9ffc0) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796
> #5  0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock 
> (this=, dst=) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499
> #6  0x0095475c in 
> kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
> (this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720,     
> rpc_context=rpc_context@entry=0x5e512a460, 
> result_collector=result_collector@entry=0x7f0d5cca0a00, 
> has_more_results=has_more_results@entry=0x7f0d5cca0886,     
> error_code=error_code@entry=0x7f0d5cca0888) at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565
> #7  0x00966564 in 
> kudu::tserver::TabletServiceImpl::HandleNewScanRequest 
> (this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240,    
>  rpc_context=rpc_context@entry=0x5e512a460, 
> result_collector=result_collector@entry=0x7f0d5cca0a00, 
> scanner_id=scanner_id@entry=0x7f0d5cca0940,     
> snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, 
> has_more_results=has_more_results@entry=0x7f0d5cca0886, 
> error_code=error_code@entry=0x7f0d5cca0888)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476
> #8  0x00967f4b in kudu::tserver::TabletServiceImpl::Scan 
> (this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460)    at 
> /home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674
> #9  0x01d2e449 in operator() (__args#2=0x5e512a460, 
> __args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at 
> /usr/include/c++/4.8.2/functional:2471
> #10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call= out>) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
> #11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) 
> at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
> #12 0x01e9e924 in operator() (this=0x90fb52e8) at 
> 

[jira] [Updated] (KUDU-3108) Tablet server crashes when handle scan request

2020-04-17 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3108:
-
Description: 
When we use KuduBackup Spark job to backup tables in a  cluster with 20 
tservers,  3 tservers crashed, coredump stacks are the same:
{code:java}
Program terminated with signal 11, Segmentation fault.Program terminated with 
signal 11, Segmentation fault.
#0  kudu::Schema::Compare 
(this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
267 /home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or 
directory.
Missing separate debuginfos, use: debuginfo-install 
bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 
elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 
libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 
ncurses-libs-5.9-13.20130511.el7.x86_64 
nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 
zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  kudu::Schema::Compare 
(this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267
#1  0x01da51fb in kudu::MergeIterator::RefillHotHeap 
(this=this@entry=0x78f6ec500) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720
#2  0x01da622b in kudu::MergeIterator::AdvanceAndReheap 
(this=this@entry=0x78f6ec500, state=0xd1661a000, 
num_rows_to_advance=num_rows_to_advance@entry=1)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690
#3  0x01da7927 in kudu::MergeIterator::MaterializeOneRow 
(this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, 
dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894
#4  0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, 
dst=0x7f0d5cc9ffc0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796
#5  0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock 
(this=, dst=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499
#6  0x0095475c in 
kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
(this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
has_more_results=has_more_results@entry=0x7f0d5cca0886,     
error_code=error_code@entry=0x7f0d5cca0888) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565
#7  0x00966564 in 
kudu::tserver::TabletServiceImpl::HandleNewScanRequest 
(this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
scanner_id=scanner_id@entry=0x7f0d5cca0940,     
snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, 
has_more_results=has_more_results@entry=0x7f0d5cca0886, 
error_code=error_code@entry=0x7f0d5cca0888)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476
#8  0x00967f4b in kudu::tserver::TabletServiceImpl::Scan 
(this=0x53b5a90, req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674
#9  0x01d2e449 in operator() (__args#2=0x5e512a460, 
__args#1=0x56f9be6c0, __args#0=, this=0x497ecdd8) at 
/usr/include/c++/4.8.2/functional:2471
#10 kudu::rpc::GeneratedServiceIf::Handle (this=0x53b5a90, call=) at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139
#11 0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) 
at /home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225
#12 0x01e9e924 in operator() (this=0x90fb52e8) at 
/home/zhangyifan8/work/kudu-xm/thirdparty/installed/uninstrumented/include/boost/function/function_template.hpp:771
#13 kudu::Thread::SuperviseThread (arg=0x90fb52c0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/util/thread.cc:657
#14 0x7f103b20cdc5 in start_thread () from /lib64/libpthread.so.0
#15 0x7f103956673d in clone () from /lib64/libc.so.6
{code}

  was:
When we use KuduBackup{{}} Spark job to backup tables in a  cluster with 20 
tservers,  3 tservers crashed, coredump stacks are the same:
{code:java}
[Thread debugging using libthread_db enabled][Thread debugging using 
libthread_db enabled]Using host libthread_db library 
"/lib64/libthread_db.so.1".Missing 

[jira] [Created] (KUDU-3108) Tablet server crashes when handle scan request

2020-04-17 Thread YifanZhang (Jira)
YifanZhang created KUDU-3108:


 Summary: Tablet server crashes when handle scan request 
 Key: KUDU-3108
 URL: https://issues.apache.org/jira/browse/KUDU-3108
 Project: Kudu
  Issue Type: Bug
Affects Versions: 1.10.1
Reporter: YifanZhang


When we use KuduBackup{{}} Spark job to backup tables in a  cluster with 20 
tservers,  3 tservers crashed, coredump stacks are the same:
{code:java}
[Thread debugging using libthread_db enabled][Thread debugging using 
libthread_db enabled]Using host libthread_db library 
"/lib64/libthread_db.so.1".Missing separate debuginfo for 
/home/work/app/kudu/zjyprc-hadoop/tablet_server/package/libstdc++.so.6Try: yum 
--enablerepo='*debug*' install 
/usr/lib/debug/.build-id/b3/d9128bcf6786292a339a477953167d0ddab5ba.debugCore 
was generated by 
`/home/work/app/kudu/zjyprc-hadoop/tablet_server/package/kudu_tablet_server 
-tse'.Program terminated with signal 11, Segmentation fault.#0  
kudu::Schema::Compare (this=0x25b883680, 
lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267267 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h: No such file or 
directory.Missing separate debuginfos, use: debuginfo-install 
bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-gssapi-2.1.26-20.el7_2.x86_64 
cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 cyrus-sasl-md5-2.1.26-20.el7_2.x86_64 
cyrus-sasl-plain-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 
elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 
libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 
libcom_err-1.42.9-9.el7.x86_64 libdb-5.3.21-19.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 
ncurses-libs-5.9-13.20130511.el7.x86_64 
nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 
openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 
systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 
zlib-1.2.7-17.el7.x86_64(gdb) bt#0  kudu::Schema::Compare (this=0x25b883680, lhs=..., rhs=...) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/rowblock.h:267#1  
0x01da51fb in kudu::MergeIterator::RefillHotHeap 
(this=this@entry=0x78f6ec500) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:720#2  
0x01da622b in kudu::MergeIterator::AdvanceAndReheap 
(this=this@entry=0x78f6ec500, state=0xd1661a000, 
num_rows_to_advance=num_rows_to_advance@entry=1)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:690#3  
0x01da7927 in kudu::MergeIterator::MaterializeOneRow 
(this=this@entry=0x78f6ec500, dst=dst@entry=0x7f0d5cc9ffc0, 
dst_row_idx=dst_row_idx@entry=0x7f0d5cc9fbb0)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:894#4  
0x01da7de3 in kudu::MergeIterator::NextBlock (this=0x78f6ec500, 
dst=0x7f0d5cc9ffc0) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/common/generic_iterators.cc:796#5  
0x00a9ff19 in kudu::tablet::Tablet::Iterator::NextBlock 
(this=, dst=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tablet/tablet.cc:2499#6  
0x0095475c in 
kudu::tserver::TabletServiceImpl::HandleContinueScanRequest 
(this=this@entry=0x53b5a90, req=req@entry=0x7f0d5cca0720,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
has_more_results=has_more_results@entry=0x7f0d5cca0886,     
error_code=error_code@entry=0x7f0d5cca0888) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2565#7  
0x00966564 in kudu::tserver::TabletServiceImpl::HandleNewScanRequest 
(this=this@entry=0x53b5a90, replica=0xf5c0189c0, req=req@entry=0x2a15c240,     
rpc_context=rpc_context@entry=0x5e512a460, 
result_collector=result_collector@entry=0x7f0d5cca0a00, 
scanner_id=scanner_id@entry=0x7f0d5cca0940,     
snap_timestamp=snap_timestamp@entry=0x7f0d5cca0950, 
has_more_results=has_more_results@entry=0x7f0d5cca0886, 
error_code=error_code@entry=0x7f0d5cca0888)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:2476#8  
0x00967f4b in kudu::tserver::TabletServiceImpl::Scan (this=0x53b5a90, 
req=0x2a15c240, resp=0x56f9be6c0, context=0x5e512a460)    at 
/home/zhangyifan8/work/kudu-xm/src/kudu/tserver/tablet_service.cc:1674#9  
0x01d2e449 in operator() (__args#2=0x5e512a460, __args#1=0x56f9be6c0, 
__args#0=, this=0x497ecdd8) at 
/usr/include/c++/4.8.2/functional:2471#10 kudu::rpc::GeneratedServiceIf::Handle 
(this=0x53b5a90, call=) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_if.cc:139#11 
0x01d2eb49 in kudu::rpc::ServicePool::RunThread (this=0x2ab69560) at 
/home/zhangyifan8/work/kudu-xm/src/kudu/rpc/service_pool.cc:225#12 
0x01e9e924 in operator() (this=0x90fb52e8) at 

[jira] [Updated] (KUDU-3098) leadership change during tablet_copy process may lead to an isolate replica

2020-03-31 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3098:
-
Summary: leadership change during tablet_copy process may lead to an 
isolate replica  (was: leader change during 'add_peer' process for a tablet may 
lead to an isolate replica)

> leadership change during tablet_copy process may lead to an isolate replica
> ---
>
> Key: KUDU-3098
> URL: https://issues.apache.org/jira/browse/KUDU-3098
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus, master
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found some tablets in a cluster with a very large 
> "time_since_last_leader_heartbeat" metric, they are LEARNER/NON_VOTER and 
> seems couldn't become VOTER for a long time.
> These replicas created during the rebalance/tablet_copy process. After 
> beginning a new copy session from leader to the new added NON_VOTER peer, 
> leadership changed, old leader aborted uncommited CHANGE_CONFIG_OP operation. 
> Finally the tablet_copy session ended but new leader knew nothing about the 
> new peer. 
> Master didn't delete this new added replica because it has a larger 
> opid_index than the latest reported committed config. See the comments in 
> CatalogManager::ProcessTabletReport
> {code:java}
> // 5. Tombstone a replica that is no longer part of the Raft config (and
> // not already tombstoned or deleted outright).
> //
> // If the report includes a committed raft config, we only tombstone if
> // the opid_index is strictly less than the latest reported committed
> // config. This prevents us from spuriously deleting replicas that have
> // just been added to the committed config and are in the process of copying.
> {code}
> Maybe we shouldn't use opid_index to determine if replicas are in the process 
> of copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-3098) leader change during 'add_peer' process for a tablet may lead to an isolate replica

2020-03-31 Thread YifanZhang (Jira)
YifanZhang created KUDU-3098:


 Summary: leader change during 'add_peer' process for a tablet may 
lead to an isolate replica
 Key: KUDU-3098
 URL: https://issues.apache.org/jira/browse/KUDU-3098
 Project: Kudu
  Issue Type: Bug
  Components: consensus, master
Affects Versions: 1.10.1
Reporter: YifanZhang


Lately we found some tablets in a cluster with a very large 
"time_since_last_leader_heartbeat" metric, they are LEARNER/NON_VOTER and seems 
couldn't become VOTER for a long time.

These replicas created during the rebalance/tablet_copy process. After 
beginning a new copy session from leader to the new added NON_VOTER peer, 
leadership changed, old leader aborted uncommited CHANGE_CONFIG_OP operation. 
Finally the tablet_copy session ended but new leader knew nothing about the new 
peer. 

Master didn't delete this new added replica because it has a larger opid_index 
than the latest reported committed config. See the comments in 
CatalogManager::ProcessTabletReport
{code:java}
// 5. Tombstone a replica that is no longer part of the Raft config (and
// not already tombstoned or deleted outright).
//
// If the report includes a committed raft config, we only tombstone if
// the opid_index is strictly less than the latest reported committed
// config. This prevents us from spuriously deleting replicas that have
// just been added to the committed config and are in the process of copying.
{code}
Maybe we shouldn't use opid_index to determine if replicas are in the process 
of copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-28 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Attachment: master_leader.log

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
> Attachments: master_leader.log, ts25.info.gz, ts26.log.gz
>
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 

[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-28 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Attachment: ts25.info.gz

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
> Attachments: ts25.info.gz, ts26.log.gz
>
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 

[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-28 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Attachment: ts26.log.gz

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
> Attachments: ts25.info.gz, ts26.log.gz
>
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 

[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-28 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069389#comment-17069389
 ] 

YifanZhang commented on KUDU-3082:
--

Unfortunately, most logs were cleaned up due to expiration before I want to 
analyze them. Now I get partial logs about tablet 
7404240f458f462d92b6588d07583a52(full logs on ts26 and partial logs on ts25). 
I'll attach them in a moment. The logs on ts27 and the leader master before 
ts27 restart are completely cleaned up:( I also keep some fragmented logs on 
the master and I'm not sure if it is helpful.

I think the state of ts27 was abnormal when the problem occurs because some 
replicas could't communicate with their leader on ts27.
{code:java}
I0313 03:50:14.118202 99494 raft_consensus.cc:1149] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [term 2 
LEADER]: Rejecting Update request from peer 47af52df1adc47e1903eb097e9c88f2e 
for earlier term 1. Current term is 2. Ops: []
I0313 03:50:14.250483 56182 consensus_queue.cc:984] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" 
member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: 
LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 
55445, Time since last communication: 0.000s
I0313 03:50:14.327806 56430 consensus_queue.cc:984] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Connected to new peer: Peer: permanent_uuid: "d1952499f94a4e6087bee28466fcb09f" 
member_type: VOTER last_known_addr { host: "kudu-ts25" port: 14100 }, Status: 
LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 
54648, Time since last communication: 0.000s
I0313 03:50:14.330118 56430 consensus_queue.cc:689] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been 
garbage collected. The follower will never be able to catch up (Not found: 
Failed to read ops 54649..55444: Segment 157 which contained index 54649 has 
been GCed)
I0313 03:50:14.330137 56430 consensus_queue.cc:544] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been 
garbage collected. The replica will never be able to catch up
I0313 03:50:14.335949 99494 consensus_queue.cc:206] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Queue going to LEADER mode. State: All replicated index: 0, Majority replicated 
index: 55446, Committed index: 55446, Last appended: 2.55446, Last appended by 
leader: 55445, Current term: 2, Majority size: 2, State: 0, Mode: LEADER, 
active raft config: opid_index: 55447 OBSOLETE_local: false peers { 
permanent_uuid: "7380d797d2ea49e88d71091802fb1c81" member_type: VOTER 
last_known_addr { host: "kudu-ts26" port: 14100 } } peers { permanent_uuid: 
"47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: 
"kudu-ts27" port: 14100 } }
I0313 03:50:14.336225 56182 consensus_queue.cc:984] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" 
member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: 
LMP_MISMATCH, Last received: 0.0, Next index: 55447, Last known committed idx: 
55446, Time since last communication: 0.000s
W0313 03:50:14.336508 98349 consensus_peers.cc:458] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 -> Peer 
47af52df1adc47e1903eb097e9c88f2e (kudu-ts27:14100): Couldn't send request to 
peer 47af52df1adc47e1903eb097e9c88f2e. Status: Illegal state: Rejecting Update 
request from peer 7380d797d2ea49e88d71091802fb1c81 for term 2. Could not 
prepare a single transaction due to: Illegal state: RaftConfig change currently 
pending. Only one is allowed at a time.
{code}
Judging from the above logs on ts26, it reject the update request from peer 
47af52d and it also send update request to this peer but failed. Maybe it means 
the config change operation of replica 47af52d is failed but the pending config 
isn't cleared. This case maybe similar to KUDU-1338.

 

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 

[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-26 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067541#comment-17067541
 ] 

YifanZhang commented on KUDU-3082:
--

[~aihai] It seems a different problem, what I encountered was not a checksum 
error but a consistency mismatch error.

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 

[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Component/s: consensus

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>  Components: consensus
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEADER]: attempt to 

[jira] [Commented] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063045#comment-17063045
 ] 

YifanZhang commented on KUDU-3082:
--

Sorry I forgot to explain, the cluster version is 1.10.1.

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEADER]: 

[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Affects Version/s: 1.10.1

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: YifanZhang
>Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 5| -1   | Yes
>  B | A   B   C| 5| -1   | Yes
>  C | A   B   C*  D~   | 5| 54649| No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B*  C|  |  | Yes
>  A | A   B*  C| 5| 5| Yes
>  B | A   B*  C   D~   | 5| 96176| No
>  C | A   B*  C| 5| 5| Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae84954d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A   B   C*   |  |  | Yes
>  A | A   B   C*   | 1| -1   | Yes
>  B | A   B   C*   | 1| -1   | Yes
>  C | A   B   C*  D~   | 1| 2| No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source | Replicas | Current term | Config index | Committed?
> ---+--+--+--+
>  master| A*  B   C|  |  | Yes
>  A | A*  B   C   D~   | 1| 1991 | No
>  B | A*  B   C| 1| 4| Yes
>  C | A*  B   C| 1| 4| Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEADER]: attempt to promote peer 

[jira] [Updated] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3082:
-
Description: 
Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
output is like:

 
{code:java}
Tablet Summary
Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = 7380d797d2ea49e88d71091802fb1c81
  B = d1952499f94a4e6087bee28466fcb09f
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = 08beca5ed4d04003b6979bf8bac378d2
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 5| -1   | Yes
 B | A   B   C| 5| -1   | Yes
 C | A   B   C*  D~   | 5| 54649| No
Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' 
active configs disagree with the leader master's
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
All reported replicas are:
  A = d1952499f94a4e6087bee28466fcb09f
  B = 47af52df1adc47e1903eb097e9c88f2e
  C = 5a8aeadabdd140c29a09dabcae919b31
  D = 14632cdbb0d04279bc772f64e06389f9
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B*  C|  |  | Yes
 A | A   B*  C| 5| 5| Yes
 B | A   B*  C   D~   | 5| 96176| No
 C | A   B*  C| 5| 5| Yes
Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' 
active configs disagree with the leader master's
  a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = a9eaff3cf1ed483aae84954d649a
  B = f75df4a6b5ce404884313af5f906b392
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 1| -1   | Yes
 B | A   B   C*   | 1| -1   | Yes
 C | A   B   C*  D~   | 1| 2| No
Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
All reported replicas are:
  A = 47af52df1adc47e1903eb097e9c88f2e
  B = f0f7b2f4b9d344e6929105f48365f38e
  C = f75df4a6b5ce404884313af5f906b392
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A*  B   C|  |  | Yes
 A | A*  B   C   D~   | 1| 1991 | No
 B | A*  B   C| 1| 4| Yes
 C | A*  B   C| 1| 4| Yes{code}
These tablets couldn't recover for a couple of days until we restart kudu-ts27.

I found so many duplicated logs in kudu-ts27 are like:
{code:java}
I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.
I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 
6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 
LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.

{code}
There seems to be some RaftConfig change operations that somehow cannot 

[jira] [Created] (KUDU-3082) tablets in "CONSENSUS_MISMATCH" state for a long time

2020-03-19 Thread YifanZhang (Jira)
YifanZhang created KUDU-3082:


 Summary: tablets in "CONSENSUS_MISMATCH" state for a long time
 Key: KUDU-3082
 URL: https://issues.apache.org/jira/browse/KUDU-3082
 Project: Kudu
  Issue Type: Bug
Reporter: YifanZhang


Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
output is like:

 
{code:java}
Tablet Summary
Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = 7380d797d2ea49e88d71091802fb1c81
  B = d1952499f94a4e6087bee28466fcb09f
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = 08beca5ed4d04003b6979bf8bac378d2
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 5| -1   | Yes
 B | A   B   C| 5| -1   | Yes
 C | A   B   C*  D~   | 5| 54649| NoTablet 
6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active 
configs disagree with the leader master's
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
All reported replicas are:
  A = d1952499f94a4e6087bee28466fcb09f
  B = 47af52df1adc47e1903eb097e9c88f2e
  C = 5a8aeadabdd140c29a09dabcae919b31
  D = 14632cdbb0d04279bc772f64e06389f9
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B*  C|  |  | Yes
 A | A   B*  C| 5| 5| Yes
 B | A   B*  C   D~   | 5| 96176| No
 C | A   B*  C| 5| 5| Yes
Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' 
active configs disagree with the leader master's
  a9eaff3cf1ed483aae84954d649a (kudu-ts23): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = a9eaff3cf1ed483aae84954d649a
  B = f75df4a6b5ce404884313af5f906b392
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A   B   C*   |  |  | Yes
 A | A   B   C*   | 1| -1   | Yes
 B | A   B   C*   | 1| -1   | Yes
 C | A   B   C*  D~   | 1| 2| NoTablet 
3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active 
configs disagree with the leader master's
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
All reported replicas are:
  A = 47af52df1adc47e1903eb097e9c88f2e
  B = f0f7b2f4b9d344e6929105f48365f38e
  C = f75df4a6b5ce404884313af5f906b392
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source | Replicas | Current term | Config index | Committed?
---+--+--+--+
 master| A*  B   C|  |  | Yes
 A | A*  B   C   D~   | 1| 1991 | No
 B | A*  B   C| 1| 4| Yes
 C | A*  B   C| 1| 4| Yes{code}
These tablets couldn't recover for a couple of days until we restart kudu-ts27.

I found so many duplicated logs in kudu-ts27 are like:
{code:java}
I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.
I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 
6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 
LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is 
already a config change operation in progress. Unable to promote 

[jira] [Created] (KUDU-3069) Support to alter the number of hash buckets for newly added range partitions

2020-03-05 Thread YifanZhang (Jira)
YifanZhang created KUDU-3069:


 Summary: Support to alter the number of hash buckets for newly 
added range partitions
 Key: KUDU-3069
 URL: https://issues.apache.org/jira/browse/KUDU-3069
 Project: Kudu
  Issue Type: Improvement
  Components: client, master
Reporter: YifanZhang


Now a table in kudu has an immutable HashBucketSchema once created.  Sometimes 
we can't accurately predict the growth of data, after a period of time the 
number of hash buckets is too small for data of a time range.

Is it possible to support to alter the number of hash buckets for newly added 
range partitions? It seems has no effect on old data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-2992) Limit concurrent alter request of a table

2019-12-05 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984778#comment-16984778
 ] 

YifanZhang edited comment on KUDU-2992 at 12/5/19 10:10 AM:


I tried to reproduced this case by deleting many tablets at the same time, such 
as dropping a big table or some big partitions, I found many logs in master are 
like:
{code:java}
$ grep "Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b" 
kudu_master.c3-hadoop-kudu-prc-ct02.bj.work.log.INFO.20191128-213038.14672
I1129 11:21:42.995760 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:43.501857 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:44.394129 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:45.001634 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:45.618881 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:46.610380 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:47.086390 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:47.972025 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:48.973754 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:49.514094 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:50.040673 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:51.057112 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:51.800305 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet

{code}
That means the master would send delete tablet requests when receiving reports 
from 'deleted' tablets, maybe we could do something to prevent this kind of 
duplicate requests.


was (Author: zhangyifan27):
I tried to reproduced this case by deleting many tablets at the same time, such 
as dropping a big table or some big partitions, I found many logs in master are 
like:

[jira] [Assigned] (KUDU-2992) Limit concurrent alter request of a table

2019-12-04 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-2992:


Assignee: YifanZhang

> Limit concurrent alter request of a table
> -
>
> Key: KUDU-2992
> URL: https://issues.apache.org/jira/browse/KUDU-2992
> Project: Kudu
>  Issue Type: Improvement
>  Components: master
>Reporter: Yingchun Lai
>Assignee: YifanZhang
>Priority: Major
>
> One of our production environment clusters cause an accident some days ago, 
> one user has a table, partition schema looks like:
> {code:java}
> HASH (uuid) PARTITIONS 80,RANGE (date_hour) (
> PARTITION 2019102900 <= VALUES < 2019102901,
> PARTITION 2019102901 <= VALUES < 2019102902,
> PARTITION 2019102902 <= VALUES < 2019102903,
> PARTITION 2019102903 <= VALUES < 2019102904,
> PARTITION 2019102904 <= VALUES < 2019102905,
> ...)
> {code}
> He try to remove many outdated partitions once by SparkSQL, but it returns an 
> timeout error at first, then he try again and again, and SparkSQL failed 
> again and again. Then the cluster became unstable, memory usage and CPU load 
> increasing.
>  
> I found many log like:
> {code:java}
> W1030 17:29:53.382287  7588 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714799 (+ 0us) service_pool.cc:162] Inserting onto call 
> queue
> 1030 17:26:19.714808 (+ 9us) service_pool.cc:221] Handling call
> 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet 
> c52c5f43f7884d08b07fd0005e878fed
> 1030 17:29:53.382205 (+ 1us) ts_tablet_manager.cc:794] Acquired tablet 
> manager lock
> 1030 17:29:53.382208 (+ 3us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {"tablet-delete.queue_time_us":213667360}
> W1030 17:29:53.382300  7586 rpcz_store.cc:253] Call 
> kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 
> (request call id 1820316) took 213667 ms (3.56 min). Client timeout 2 ms 
> (30 s)
> W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call 
> kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 
> (request call id 1820315) took 213667 ms (3.56 min). Client timeout 2 ms 
> (30 s)
> W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace:
> 1030 17:26:19.714825 (+ 0us) service_pool.cc:162] Inserting onto call 
> queue
> 1030 17:26:19.714833 (+ 8us) service_pool.cc:221] Handling call
> 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet 
> 479f8c592f16408c830637a0129359e1
> 1030 17:29:53.382241 (+ 2us) ts_tablet_manager.cc:794] Acquired tablet 
> manager lock
> 1030 17:29:53.382244 (+ 3us) inbound_call.cc:162] Queueing success 
> response
> Metrics: {"tablet-delete.queue_time_us":213667378}
> ...{code}
> That means 'Acquired tablet manager lock' cost much time, right?
> {code:java}
> Status TSTabletManager::BeginReplicaStateTransition(
> const string& tablet_id,
> const string& reason,
> scoped_refptr* replica,
> scoped_refptr* deleter,
> TabletServerErrorPB::Code* error_code) {
>   // Acquire the lock in exclusive mode as we'll add a entry to the
>   // transition_in_progress_ map.
>   std::lock_guard lock(lock_);
>   TRACE("Acquired tablet manager lock");
>   RETURN_NOT_OK(CheckRunningUnlocked(error_code));
>   ...
> }{code}
> But I think the root case is Kudu master send too many duplicate 'alter 
> table/delete tablet' request to tserver. I found more info in master's log:
> {code:java}
> $ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" 
> kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | 
> egrep "attempt = 1\)"
> I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of 
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for 
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1)
> I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of 
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for 
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1)
> I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of 
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for 
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1)
> I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of 
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for 
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1)
> I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of 
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for 
> TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1)
> I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of 
> 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC 

[jira] [Commented] (KUDU-2992) Limit concurrent alter request of a table

2019-11-28 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984778#comment-16984778
 ] 

YifanZhang commented on KUDU-2992:
--

I tried to reproduced this case by deleting many tablets at the same time, such 
as dropping a big table or some big partitions, I found many logs in master are 
like:
{code:java}
$ grep "Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b" 
kudu_master.c3-hadoop-kudu-prc-ct02.bj.work.log.INFO.20191128-213038.14672
I1129 11:21:42.995760 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:43.501857 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:44.394129 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:45.001634 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:45.618881 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:46.610380 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:47.086390 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:47.972025 14817 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:48.973754 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:49.514094 14811 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:50.040673 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:51.057112 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet
I1129 11:21:51.800305 14810 catalog_manager.cc:4013] Got report from deleted 
tablet 71c50a73cddf4562b9b85477e4c2ea7b (table 
default.loadgen_auto_8f39ac625a834b02aaf994887917a49a 
[id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 
11:21:07 CST): Sending delete request for this tablet

{code}
That means the master would send delete tablet requests when receiving reports 
from 'deleted' tablets, maybe we could do something to prevent this kind of 
duplicated requests.

> Limit concurrent alter request of a table
> -
>
> Key: KUDU-2992
> URL: https://issues.apache.org/jira/browse/KUDU-2992
> Project: Kudu
>  Issue Type: 

[jira] [Assigned] (KUDU-3006) RebalanceIgnoredTserversTest.Basic is flaky

2019-11-21 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-3006:


Assignee: YifanZhang

> RebalanceIgnoredTserversTest.Basic is flaky
> ---
>
> Key: KUDU-3006
> URL: https://issues.apache.org/jira/browse/KUDU-3006
> Project: Kudu
>  Issue Type: Bug
>Reporter: Hao Hao
>Assignee: YifanZhang
>Priority: Minor
> Attachments: rebalancer_tool-test.1.txt
>
>
> RebalanceIgnoredTserversTest.Basic of the rebalancer_tool-test sometimes 
> fails with an error like below. I attached full test log.
> {noformat}
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/rebalancer_tool-test.cc:350:
>  Failure
> Value of: out
> Expected: has substring "2dd9365c71c54e5d83294b31046c5478 | 0"
>   Actual: "Per-server replica distribution summary for tservers_to_empty:\n   
> Server UUID| Replica 
> Count\n--+---\n 
> 2dd9365c71c54e5d83294b31046c5478 | 1\n\nPer-server replica distribution 
> summary:\n   Statistic   |  
> Value\n---+--\n Minimum Replica Count | 0\n 
> Maximum Replica Count | 1\n Average Replica Count | 0.50\n\nPer-table 
> replica distribution summary:\n Replica Skew |  
> Value\n--+--\n Minimum  | 1\n Maximum  | 1\n 
> Average  | 1.00\n\n\nrebalancing is complete: cluster is balanced 
> (moved 0 replicas)\n" (of type std::string)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-3006) RebalanceIgnoredTserversTest.Basic is flaky

2019-11-21 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979945#comment-16979945
 ] 

YifanZhang commented on KUDU-3006:
--

I reproduced this error and printed some logs to debug, some logs were like:
{code:java}
I1122 15:54:18.806080 28648 rebalancer_tool.cc:190] replacing replicas on 
healthy ignored tservers
I1122 15:54:18.825372 28648 rebalancer_tool.cc:1438] tablet 
6274e4d23add454d97ed7b2d7208a097: not considering replicas for movement since 
the tablet's status is 'CONSENSUS_MISMATCH'
{code}
That means a replica is not healthy so the rebalancer tool would not move it, 
I'll try to fix it.

 

 

> RebalanceIgnoredTserversTest.Basic is flaky
> ---
>
> Key: KUDU-3006
> URL: https://issues.apache.org/jira/browse/KUDU-3006
> Project: Kudu
>  Issue Type: Bug
>Reporter: Hao Hao
>Priority: Minor
> Attachments: rebalancer_tool-test.1.txt
>
>
> RebalanceIgnoredTserversTest.Basic of the rebalancer_tool-test sometimes 
> fails with an error like below. I attached full test log.
> {noformat}
> /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/tools/rebalancer_tool-test.cc:350:
>  Failure
> Value of: out
> Expected: has substring "2dd9365c71c54e5d83294b31046c5478 | 0"
>   Actual: "Per-server replica distribution summary for tservers_to_empty:\n   
> Server UUID| Replica 
> Count\n--+---\n 
> 2dd9365c71c54e5d83294b31046c5478 | 1\n\nPer-server replica distribution 
> summary:\n   Statistic   |  
> Value\n---+--\n Minimum Replica Count | 0\n 
> Maximum Replica Count | 1\n Average Replica Count | 0.50\n\nPer-table 
> replica distribution summary:\n Replica Skew |  
> Value\n--+--\n Minimum  | 1\n Maximum  | 1\n 
> Average  | 1.00\n\n\nrebalancing is complete: cluster is balanced 
> (moved 0 replicas)\n" (of type std::string)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KUDU-2986) Incorrect value for the 'live_row_count' metric with pre-1.11.0 tables

2019-10-28 Thread YifanZhang (Jira)
YifanZhang created KUDU-2986:


 Summary: Incorrect value for the 'live_row_count' metric with 
pre-1.11.0 tables
 Key: KUDU-2986
 URL: https://issues.apache.org/jira/browse/KUDU-2986
 Project: Kudu
  Issue Type: Bug
  Components: client, master, metrics
Affects Versions: 1.11.0
Reporter: YifanZhang


When we upgraded the cluster with pre-1.11.0 tables, we got inconsistent values 
for the 'live_row_count' metric of these tables:

When visiting masterURL:port/metrics, we got 0 for old tables, and got a 
positive integer for a old table with a newly added partition, which is the 
count of rows in the newly added partition.

When getting table statistics via `kudu table statistics` CLI tool, we got 0 
for old tables and the old table with a new parition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KUDU-2914) Rebalance tool support moving replicas from some specific tablet servers

2019-10-01 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941809#comment-16941809
 ] 

YifanZhang edited comment on KUDU-2914 at 10/1/19 1:09 PM:
---

Thanks for [~aserbin]'s suggestion, it's very useful.

I have two questions. If we let the master replace all replicas at the tablet 
server, how to know when the whole replacement process ends, do we need to keep 
checking whether all relicas have been removed? And if it is possible to mark 
multiple tablet servers as 'decomissioned'? I mean mark two or more replicas of 
a tablet with the REPLACE attribute.


was (Author: zhangyifan27):
Thanks for [~aserbin]'s suggestions, it's very useful.

I have tow questions. If we let the master replace all replicas at the tablet 
server, how to know when the whole replacement process ends, do we need to keep 
checking whether all relicas have been removed? And if it is possible to mark 
multiple tablet servers as 'decomissioned'? I mean mark two or more replicas of 
a tablet with the REPLACE attribute.

> Rebalance tool support moving replicas from some specific tablet servers
> 
>
> Key: KUDU-2914
> URL: https://issues.apache.org/jira/browse/KUDU-2914
> Project: Kudu
>  Issue Type: Improvement
>  Components: CLI
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
>
> When we need to remove some tservers from a kudu cluster (maybe just for 
> saving resources or replacing these servers with new servers), it's better to 
> move all replicas on these tservers to other tservers in a cluster in 
> advance, instead of waiting for all replicas kicked out and evicting new 
> replicas. This can be achieved by rebalance tool supporting specifying 
> 'blacklist_tservers'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-2914) Rebalance tool support moving replicas from some specific tablet servers

2019-10-01 Thread YifanZhang (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941809#comment-16941809
 ] 

YifanZhang commented on KUDU-2914:
--

Thanks for [~aserbin]'s suggestions, it's very useful.

I have tow questions. If we let the master replace all replicas at the tablet 
server, how to know when the whole replacement process ends, do we need to keep 
checking whether all relicas have been removed? And if it is possible to mark 
multiple tablet servers as 'decomissioned'? I mean mark two or more replicas of 
a tablet with the REPLACE attribute.

> Rebalance tool support moving replicas from some specific tablet servers
> 
>
> Key: KUDU-2914
> URL: https://issues.apache.org/jira/browse/KUDU-2914
> Project: Kudu
>  Issue Type: Improvement
>  Components: CLI
>Reporter: YifanZhang
>Assignee: YifanZhang
>Priority: Minor
>
> When we need to remove some tservers from a kudu cluster (maybe just for 
> saving resources or replacing these servers with new servers), it's better to 
> move all replicas on these tservers to other tservers in a cluster in 
> advance, instead of waiting for all replicas kicked out and evicting new 
> replicas. This can be achieved by rebalance tool supporting specifying 
> 'blacklist_tservers'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KUDU-2934) Bad merge behavior for some metrics

2019-09-18 Thread YifanZhang (Jira)


 [ 
https://issues.apache.org/jira/browse/KUDU-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang reassigned KUDU-2934:


Assignee: YifanZhang

> Bad merge behavior for some metrics
> ---
>
> Key: KUDU-2934
> URL: https://issues.apache.org/jira/browse/KUDU-2934
> Project: Kudu
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 1.11.0
>Reporter: Yingchun Lai
>Assignee: YifanZhang
>Priority: Minor
>
> We added a feature to merge metrics by commit 
> fe6e5cc0c9c1573de174d1ce7838b449373ae36e ([metrics] Merge metrics by the same 
> attribute), for AtomicGauge type metrics, we sum up of merged metrics, this 
> work for almost all of metrics in Kudu.
> But I found a metric that could not be merged like this simply, i.e. 
> "average_diskrowset_height", because it's a "average" value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >