[ https://issues.apache.org/jira/browse/KUDU-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772786#comment-16772786 ]
Simon Zhang commented on KUDU-2702: ----------------------------------- very odd, the peer can't been found > data lost by using spark kudu client during high load writing > --------------------------------------------------------------- > > Key: KUDU-2702 > URL: https://issues.apache.org/jira/browse/KUDU-2702 > Project: Kudu > Issue Type: Bug > Components: client, consensus, spark > Affects Versions: 1.8.0 > Reporter: Simon Zhang > Priority: Blocker > Attachments: 屏幕快照 2019-02-20 下午4.57.08.png > > > 1. For our business, we need to write any taken signal into kudu, so the load > is very hight. We decide to use spark stream will kudu client to fulfill this > task. the code like below: > {code:java} > val kuduContext = new KuduContext("KuduMaster", trackRdd.sparkContext) > ........ > kuduContext.upsertRows(trackRdd, saveTable){code} > 2. check spark log > {code:java} > 2019-01-30 16:09:31 WARN TaskSetManager:66 - Lost task 0.0 in stage 38855.0 > (TID 25499, 192.168.33.158, executor 2): java.lang.RuntimeException: failed > to write 1000 rows from DataFrame to Kudu; sample errors: Timed out: can not > complete before timeout: Batch{operations=58, > tablet="41f47fabf6964719befd06ad01bc133b" [0x000000088000016804FE4 > 800, 0x0000000880000168A4A36BFF), ignoreAllDuplicateRows=false, > rpc=KuduRpc(method=Write, tablet=41f47fabf6964719befd06ad01bc133b, > attempt=42, DeadlineTracker(timeout=3000 > 0, elapsed=29675), Traces: [0ms] querying master, > [0ms] Sub rpc: GetTableLocations sending RPC to server > master-192.168.33.152:7051, > [3ms] Sub rpc: GetTableLocations received from server > master-192.168.33.152:7051 response OK, > [3ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae, > [6ms] delaying RPC due to Illegal state: Replica > a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: > FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: > 135 OBSOLETE_local: false peers { permanent_uuid: > "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: > "cm07" port: 7050 } } peers { permanent_uuid: > "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: > "cm04" port: 7050 } } peers { permanent_uuid: > "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: > "cm02" port: 7050 } } } (error 0), > [6ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal > state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. > Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { > opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: > "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: > "cm07" port: 7050 } } peers { permanent_uuid: > "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: > "cm04" port: 7050 } } peers { permanent_uuid: > "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: > "cm02" port: 7050 } } } (error0), > ..................... > [793ms] querying master, > [793ms] Sub rpc: GetTableLocations sending RPC to server > master-192.168.33.152:7051, > [795ms] Sub rpc: GetTableLocations received from server > master-192.168.33.152:7051 response OK, > [796ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae, > [798ms] delaying RPC due to Illegal state: Replica > a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: > FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: > 135 OBSOLETE_local: false peers { permanent_uuid: > "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: > "cm07" port: 7050 } } peers { permanent_uuid: > "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: > "cm04" port: 7050 } } peers { permanent_uuid: > "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: > "cm02" port: 7050 } } } (error 0), > [799ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response > Illegal state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this > config. Role: FOLLOWER. Consensus state: current_term: 1639 committed_config > { opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: > "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: > "cm07" port: 7050 } } peers { permanent_uuid: > "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: > "cm04" port: 7050 } } peers { permanent_uuid: > "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: > "cm02" port: 7050 } } } (error 0), > [3552ms] querying master, > [3552ms] Sub rpc: GetTableLocations sending RPC to server > master-192.168.33.152:7051, > [3553ms] Sub rpc: GetTableLocations received from server > master-192.168.33.152:7051 response OK, > [3553ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae, > [3556ms] delaying RPC due to Illegal state: Replica > a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: > FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: > 135 OBSOLETE_local: false peers { permanent_uuid: > "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: > "cm07" port: 7050 } } peers { permanent_uuid: > "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: > "cm04" port: 7050 } } p{code} > get the same issue like > [[KUDU-2329|https://jira.apache.org/jira/browse/KUDU-2329]|https://jira.apache.org/jira/browse/KUDU-2329] > 3. Then use kudu cluster ksck to check staus find some tablet unavailable > {noformat} > Tablet bb7aff8f0d79458ebd263b57e7ed2848 of table 'impala::flyway.track_2018' > is under-replicated: 1 replica(s) not RUNNING > a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING > afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER] > 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING > All reported replicas are: > A = a9efdf1d4c5d4bfd933876c2c9681e83 > B = afba5bc65a93472683cb613a7c693b0f > C = 4a00d2312d5042eeb41a1da0cc264213 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---------------+------------------------+--------------+--------------+------------ > master | A B* C | | | Yes > A | A B* C | 25 | -1 | Yes > B | [config not available] | | | > C | A B* C | 25 | -1 | Yes > Tablet 82e89518366840aaa3f8bd426818e001 of table 'impala::flyway.track_2017' > is under-replicated: 1 replica(s) not RUNNING > afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER] > 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING > a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING > All reported replicas are: > A = afba5bc65a93472683cb613a7c693b0f > B = 4a00d2312d5042eeb41a1da0cc264213 > C = a9efdf1d4c5d4bfd933876c2c9681e83 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---------------+------------------------+--------------+--------------+------------ > master | A* B C | | | Yes > A | [config not available] | | | > B | A* B C | 29 | -1 | Yes > C | A B C | 28 | -1 | Yes > .......................................... > The relative table is CONSENSUS_MISMATCH like > Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated > | Unavailable > -----------------------------------------------------+----+--------------- > impala::flyway.track_2017 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4 > impala::flyway.track_2018 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4 > {noformat} > For not leader available, the spark is stuck for long time, then data lost. > We find the status of some tables is "*CONSENSUS_MISMATCH*" randomly then > recover [*HEALTHY*] after a while, leaders of some tablets are unavailable. > All operation are in a LAN, network and machines work fine, each tablet > server own 1500 tablets, the recommended values is 1000 tablets . > By the way, I have a question about voting processing, each tablet seems to > have an individual vote process, then does need thousands of individual > voting processing if has thousands of tablets? Something optimized can do if > yes? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)