Héctor Gutiérrez created KUDU-2329:
--------------------------------------

             Summary: Random RPC timeout errors when inserting rows in a Kudu 
table
                 Key: KUDU-2329
                 URL: https://issues.apache.org/jira/browse/KUDU-2329
             Project: Kudu
          Issue Type: Bug
          Components: rpc, server
    Affects Versions: 1.5.0
            Reporter: Héctor Gutiérrez


When executing inserts into a Kudu table, we are experiencing errors at random 
times. The first time we found one of these errors was during a bulk update of 
a Kudu table via Spark (in Scala):

{{kuduContext.updateRows(dataFrame, "table_name")}}

The error message in Spark was the following:

{{java.lang.RuntimeException: failed to write 579 rows from DataFrame to Kudu; 
sample errors: Timed out: can not complete before timeout: Batch{operations=6, 
tablet="cd1e33fce0114c9bbd9c14e2559e7591" [0x0000000F, 0x00000010), 
ignoreAllDuplicateRows=false, rpc=KuduRpc(method=Write, 
tablet=cd1e33fce0114c9bbd9c14e2559e7591, attempt=3, 
DeadlineTracker(timeout=30000, elapsed=30090), Traces: [0ms] sending RPC to 
server 6f273933b4d5498e87aadfb99b054a21, [10011ms] received from server 
6f273933b4d5498e87aadfb99b054a21 response Network error: [peer 
6f273933b4d5498e87aadfb99b054a21] encountered a read timeout; closing the 
channel, [10011ms] delaying RPC due to Network error: [peer 
6f273933b4d5498e87aadfb99b054a21] encountered a read timeout; closing the 
channel, [10033ms] sending RPC to server 6f273933b4d5498e87aadfb99b054a21, 
[20050ms] received from server 6f273933b4d5498e87aadfb99b054a21 response 
Network error: [peer 6f273933b4d5498e87aadfb99b054a21] encountered a read 
timeout; closing the channel, [20050ms] delaying RPC due to Network error: 
[peer 6f273933b4d5498e87aadfb99b054a21] encountered a read timeout; closing the 
channel, [20072ms] sending RPC to server 6f273933b4d5498e87aadfb99b054a21, 
[30090ms] received from server 6f273933b4d5498e87aadfb99b054a21 response 
Network error: [peer 6f273933b4d5498e87aadfb99b054a21] encountered a read 
timeout; closing the channel, [30090ms] delaying RPC due to Network error: 
[peer 6f273933b4d5498e87aadfb99b054a21] encountered a read timeout; closing the 
channel)}}}

(+ 4 more errors similar to this one in the error message)

We first thought it was actually a problem with our Spark code, but when we 
tried to execute a simple "INSERT INTO" query from the impala shell into a Kudu 
table, we got the following error:

{{[.............................] > insert into test_kudu values (282, 
'hola');}}
{{ Query: insert into test_kudu values (282, 'hola')}}
{{ Query submitted at: ......................}}
{{ Query progress can be monitored at: ........................}}
{{ WARNINGS: Kudu error(s) reported, first error: Timed out: Failed to write 
batch of 1 ops to tablet 9c295e90811e483a9550bfd75abcf666 after 1 attempt(s): 
Failed to write to server: 071bcafbb1644678a697c474662047b7 
(.........................:7050): Write RPC to ....................:7050 timed 
out after 179.949s (SENT)}}

{{Error in Kudu table 'impala:kudu_db.test_kudu': Timed out: Failed to write 
batch of 1 ops to tablet 9c295e90811e483a9550bfd75abcf666 after 1 attempt(s): 
Failed to write to server: 071bcafbb1644678a697c474662047b7 
(...........................:7050): Write RPC to ......................:7050 
timed out after 179.949s (SENT)}}

To make things even more confusing, despite getting this error in the impala 
shell, after a while (and not immediately), the inserted rows ended up in the 
table, so somehow they were actually inserted.

We also tried tweaking the Kudu timeout configuration values that we had 
previously set, but it didn't solve anything and the problem kept appearing.

Furthermore, we don't always get these errors, they only appear at random 
times. For example, right now we're just getting errors in that update we have 
in the Spark code, but we are not experiencing issues when working from the 
impala shell.

After all that we have tried, we are pretty certain that this is a bug in Kudu, 
although we think it is a bit strange that it is undocumented and certainly 
it's hard to reproduce.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to