[jira] [Commented] (KUDU-2702) data lost by using spark kudu client during high load writing

Simon Zhang (JIRA) Wed, 20 Feb 2019 00:59:27 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772786#comment-16772786
 ]


Simon Zhang commented on KUDU-2702:
-----------------------------------

very  odd, the peer can't been found

> data lost by using spark kudu client during high load writing  
> ---------------------------------------------------------------
>
>                 Key: KUDU-2702
>                 URL: https://issues.apache.org/jira/browse/KUDU-2702
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, consensus, spark
>    Affects Versions: 1.8.0
>            Reporter: Simon Zhang
>            Priority: Blocker
>         Attachments: 屏幕快照 2019-02-20 下午4.57.08.png
>
>
> 1. For our business, we need to write any taken signal into kudu, so the load 
> is very hight. We decide to use spark stream will kudu client to fulfill this 
> task. the code like below: 
> {code:java}
> val kuduContext = new KuduContext("KuduMaster", trackRdd.sparkContext)
> ........
> kuduContext.upsertRows(trackRdd, saveTable){code}
> 2. check  spark log 
> {code:java}
> 2019-01-30 16:09:31 WARN TaskSetManager:66 - Lost task 0.0 in stage 38855.0 
> (TID 25499, 192.168.33.158, executor 2): java.lang.RuntimeException: failed 
> to write 1000 rows from DataFrame to Kudu; sample errors: Timed out: can not 
> complete before timeout: Batch{operations=58, 
> tablet="41f47fabf6964719befd06ad01bc133b" [0x000000088000016804FE4
> 800, 0x0000000880000168A4A36BFF), ignoreAllDuplicateRows=false, 
> rpc=KuduRpc(method=Write, tablet=41f47fabf6964719befd06ad01bc133b, 
> attempt=42, DeadlineTracker(timeout=3000
> 0, elapsed=29675), Traces: [0ms] querying master,
> [0ms] Sub rpc: GetTableLocations sending RPC to server 
> master-192.168.33.152:7051,
> [3ms] Sub rpc: GetTableLocations received from server 
> master-192.168.33.152:7051 response OK,
> [3ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [6ms] delaying RPC due to Illegal state: Replica 
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: 
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: 
> 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error 0),
> [6ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response Illegal 
> state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. 
> Role: FOLLOWER. Consensus state: current_term: 1639 committed_config { 
> opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error0),
> .....................
> [793ms] querying master,
> [793ms] Sub rpc: GetTableLocations sending RPC to server 
> master-192.168.33.152:7051,
> [795ms] Sub rpc: GetTableLocations received from server 
> master-192.168.33.152:7051 response OK,
> [796ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [798ms] delaying RPC due to Illegal state: Replica 
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: 
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: 
> 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error 0),
> [799ms] received from server a33504d2e2fc4447aa054f2589b9f9ae response 
> Illegal state: Replica a33504d2e2fc4447aa054f2589b9f9ae is not leader of this 
> config. Role: FOLLOWER. Consensus state: current_term: 1639 committed_config 
> { opid_index: 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } peers { permanent_uuid: 
> "cbc554095f1f4ef5b6442da45a542ac3" member_type: VOTER last_known_addr { host: 
> "cm02" port: 7050 } } } (error 0),
> [3552ms] querying master,
> [3552ms] Sub rpc: GetTableLocations sending RPC to server 
> master-192.168.33.152:7051,
> [3553ms] Sub rpc: GetTableLocations received from server 
> master-192.168.33.152:7051 response OK,
> [3553ms] sending RPC to server a33504d2e2fc4447aa054f2589b9f9ae,
> [3556ms] delaying RPC due to Illegal state: Replica 
> a33504d2e2fc4447aa054f2589b9f9ae is not leader of this config. Role: 
> FOLLOWER. Consensus state: current_term: 1639 committed_config { opid_index: 
> 135 OBSOLETE_local: false peers { permanent_uuid: 
> "a33504d2e2fc4447aa054f2589b9f9ae" member_type: VOTER last_known_addr { host: 
> "cm07" port: 7050 } } peers { permanent_uuid: 
> "083ef4d758854ddd9f4d15a3c718fe4b" member_type: VOTER last_known_addr { host: 
> "cm04" port: 7050 } } p{code}
>  get the same issue like 
> [[KUDU-2329|https://jira.apache.org/jira/browse/KUDU-2329]|https://jira.apache.org/jira/browse/KUDU-2329]
> 3. Then use kudu cluster ksck to check staus find some tablet unavailable 
> {noformat}
> Tablet bb7aff8f0d79458ebd263b57e7ed2848 of table 'impala::flyway.track_2018' 
> is under-replicated: 1 replica(s) not RUNNING
> a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
> afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
> 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
> All reported replicas are:
> A = a9efdf1d4c5d4bfd933876c2c9681e83
> B = afba5bc65a93472683cb613a7c693b0f
> C = 4a00d2312d5042eeb41a1da0cc264213
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A B* C | | | Yes
> A | A B* C | 25 | -1 | Yes
> B | [config not available] | | | 
> C | A B* C | 25 | -1 | Yes
> Tablet 82e89518366840aaa3f8bd426818e001 of table 'impala::flyway.track_2017' 
> is under-replicated: 1 replica(s) not RUNNING
> afba5bc65a93472683cb613a7c693b0f (cm03:7050): TS unavailable [LEADER]
> 4a00d2312d5042eeb41a1da0cc264213 (cm02:7050): RUNNING
> a9efdf1d4c5d4bfd933876c2c9681e83 (cm01:7050): RUNNING
> All reported replicas are:
> A = afba5bc65a93472683cb613a7c693b0f
> B = 4a00d2312d5042eeb41a1da0cc264213
> C = a9efdf1d4c5d4bfd933876c2c9681e83
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> master | A* B C | | | Yes
> A | [config not available] | | | 
> B | A* B C | 29 | -1 | Yes
> C | A B C | 28 | -1 | Yes
> ..........................................
> The relative table is CONSENSUS_MISMATCH like 
> Name | RF | Status | Total Tablets | Healthy | Recovering | Under-replicated 
> | Unavailable
> -----------------------------------------------------+----+---------------
> impala::flyway.track_2017 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
> impala::flyway.track_2018 | 3 | CONSENSUS_MISMATCH | 120 | 60 | 0 | 56 | 4
> {noformat}
>  For not leader available,  the spark is stuck for long time, then data lost. 
> We find the status of some tables is  "*CONSENSUS_MISMATCH*" randomly then 
> recover [*HEALTHY*] after a while, leaders of some tablets are unavailable. 
> All operation are in a LAN, network and machines work fine,  each tablet 
> server own 1500 tablets, the recommended values is 1000 tablets .
>  By the way,  I have a question about voting processing, each tablet seems to 
> have an individual vote process, then does need thousands of individual 
> voting processing if has thousands of tablets?  Something optimized can do if 
> yes? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2702) data lost by using spark kudu client during high load writing

Reply via email to