[ https://issues.apache.org/jira/browse/KUDU-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489970#comment-17489970 ]
ASF subversion and git services commented on KUDU-3349: ------------------------------------------------------- Commit 90895ce76590f10730ad7aac3613b69d89ff5422 in kudu's branch refs/heads/master from Hongjiang Zhang [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=90895ce ] [java] KUDU-3349 Fix the failure to demote a leader KuduScanToken gets a wrong tserver's uuid whose format is something like: '<ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437">', the expected uuid should be "fc07f681d3ea4bab9bc5ec8090ab9437". This issue caused RemoteTablet to fail to demote a leader, and the java client always sends write ops to the demoted leader. As a result, there are a lot of "PendingErrors overflowed. Failed to write at least 1000 rows to Kudu". After this fix, the write ops, especially the deleting, will be faster. Change-Id: I2974b6ec2cec2f0120b113d1bcf89fe3793a1ec5 Reviewed-on: http://gerrit.cloudera.org:8080/18166 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin <aser...@cloudera.com> > Kudu java client failed to demote leader and caused a lot of deleting rows > timeout > ---------------------------------------------------------------------------------- > > Key: KUDU-3349 > URL: https://issues.apache.org/jira/browse/KUDU-3349 > Project: Kudu > Issue Type: Bug > Reporter: Redriver > Priority: Major > > During deleting rows through Spark, I found a lot of PendingErrors which > caused timeout, the deleting takes very long time, and finally failed > sometimes. > {code:java} > java.lang.RuntimeException: PendingErrors overflowed. Failed to write at > least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before > timeout: Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" > [0x00000006, 0x00000007), ignoredErrors=[], rpc=KuduRpc(method=Write, > tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23, > TimeoutTracker(timeout=30000, elapsed=26852), Traces: [0ms] sending RPC to > server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [2ms] received response from server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [10ms] sending RPC to server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, > [12ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [12ms] received response from server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [30ms] sending RPC to server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [32ms] received response from server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [51ms] sending RPC to server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, > [52ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [52ms] received response from server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [70ms] sending RPC to server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [72ms] received response from server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: > Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this > config: current role FOLLOWER (error 0), [90ms] sending RPC to server > <ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, > [92ms] delaying RPC due to: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [92ms] received response from server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [170ms] sending RPC to server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC > due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader > of this config: current role FOLLOWER (error 0), [172ms] received response > from server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (error 0), [190ms] sending RPC to server <ByteString@1d59316e > size=32 contents="c51a7275257240b7a8c7e99d0895ae89">, [192ms] delaying RPC > due to: Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader > of this config: current role FOLLOWER (error 0), [192ms] received response > from server <ByteString@1d59316e size=32 > contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica > c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role > FOLLOWER (erro > ... > {code} > Anyway, I think I found the RCA. > After enabling DEBUG level logging, I found something interesting. The UUID > is in fact the same, but kudu-client thinks they are not. See the below. > {code:java} > 22/01/19 23:34:34,621 DEBUG [kudu-nio-3] client.RemoteTablet:170 : > <ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437"> > wasn't the leader for 44fa35c99e7042329bbfa0268c1cd4de, current leader is > <ByteString@5aadafd0 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437"> > {code} > This issue caused the kudu-client to fail to demote the leader. > {code:java} > void demoteLeader(String uuid) { > synchronized (tabletServers) { > if (leaderUuid == null) { > LOG.debug("{} couldn't be demoted as the leader for {}, there is no > known leader", > uuid, getTabletId()); > return; > } > if (leaderUuid.equals(uuid)) { > leaderUuid = null; > LOG.debug("{} was demoted as the leader for {}", uuid, getTabletId()); > } else { > LOG.debug("{} wasn't the leader for {}, current leader is {}", uuid, > getTabletId(), leaderUuid); > } > } > } > {code} > I take some time to debug, and finally found the tserver's uuid is generated > by the "serverMetadataPB.getUuid().toString()". > [https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L246] > The correct way to get Uuid() is "serverMetadataPB.getUuid().toStringUtf8()" > After fixing this bug, the deleting becomes faster than before because the > client will not send write to the wrong leader. > I'll submit a patch for this issue. -- This message was sent by Atlassian Jira (v8.20.1#820001)