[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397618#comment-16397618
 ] 

Todd Lipcon commented on KUDU-2342:
---

{code}
if (s.ok() &&
peer_pb &&
peer_pb->member_type() == RaftPeerPB::NON_VOTER &&
peer_pb->attrs().promote()) {
  // This peer is ready to promote.
  //
  // TODO(mpercy): Should we introduce a function SafeToPromote() that
  // does the same calculation as SafeToEvict() but for adding a VOTER?
  NotifyObserversOfPeerToPromote(peer->uuid());
{code}

I think Mike's TODO here is relevant. Basically we ended up proposing an 
uncommittable config change here.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397617#comment-16397617
 ] 

Todd Lipcon commented on KUDU-2342:
---

I think being more conservative might be good in general -- eg after any tablet 
copy completes, include the newly-copied node for some number of 
seconds/minutes.

More directly, though, I think it's bad to promote a node that did not have a 
successful last communication.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397597#comment-16397597
 ] 

David Alves commented on KUDU-2342:
---

Seems like we should be more conservative with the first rule (for voters only) 
and also add the non-voter which we intend to promote.

thoughts?

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397594#comment-16397594
 ] 

David Alves commented on KUDU-2342:
---

>From what I read of the code, there are two main gc mechanisms:
 * one only for voters, that makes sure never to gc more than the committed 
index
 * one for all peers that is more conservative as it only gcs after everyone 
has an index, but has an upper bound of 80

 

In this case we gc'd logs after the tablet copy as if the peer as a non-voter 
(second rule), meaning the non-voter can't catch up, but then still promoted 
him to voter, pushing a change config that can never be committed.

 

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397537#comment-16397537
 ] 

Todd Lipcon commented on KUDU-2342:
---

For reference, here's the ksck report on this tablet:
{code}
Tablet b8431200388d486995a4426c88bc06a2 of table 
'impala::tpch_3_kudu.lineitem' is under-replicated: 1 replica(s) not RUNNING
  14b2404c50b540ae8957adff9a6c7548 (vd1336.halxg.cloudera.com:7050): RUNNING
  a260dca5a9c846e99cb621881a7b86b8 (vc1515.halxg.cloudera.com:7050): RUNNING 
[LEADER]
  e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): TS 
unavailable
  f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): RUNNING 
[NONVOTER]

2 replicas' active configs differ from the master's.
  All the peers reported by the master and tablet servers are:
  A = 14b2404c50b540ae8957adff9a6c7548
  B = a260dca5a9c846e99cb621881a7b86b8
  C = e3fdd8da21a643aba21b7acdd6b17499
  D = f7376c96c6b64e7fa6a7bfc84fd0cd64

The consensus matrix is:
 Config source |Replicas| Current term | Config index | 
Committed?
---++--+--+
 master| A   B*  C   D~ |  |  | Yes
 A | A   B*  C   D  | 1| 1233 | No
 B | A   B*  C   D  | 1| 1233 | No
 C | [config not available] |  |  | 
 D | A   B*  C   D~ | 1| 1141 | Yes
Table impala::tpch_3_kudu.lineitem has 1 under-replicated tablet(s)
{code}

It would be nice if ksck could report some info on opid indexes too, but that's 
a separate improvement.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Blocker
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397522#comment-16397522
 ] 

Todd Lipcon commented on KUDU-2342:
---

Reconstructing the timeline a bit:

- 07:20:54.751998: peer e3fdd8 fell behind the retention and "can never be 
caught up"
- 07:20:54.766460: peer f7376c added as a NON_VOTER
- 07:20:55.268965: tablet copy starts to f7376c
- 07:21:34.559736: tablet copy ends
- 07:21:34.779841: logs held by the tablet copy session are GCed
- 07:21:34.790443: the new NON_VOTER peer is already unable to be caught up 
because the logs just got GCed (*hmm, interesting*)
- 07:21:34.790797: nevertheless, the leader issues a config change to promote 
f7376c to VOTER

Now we have 2/4 VOTER replicas which can never be caught up -- the original bad 
one, and the one we just promoted. Hence we can't make progress.

It seems there are two serious issues at play here:
- why did we not retain the logs between the tablet copy session finishing and 
catching up the peer? perhaps because the non-voter isn't included in the log 
retention calculations and was more than 80 segments behind?
- why did we promote a non-voter that wasn't relatively up to date or in a 
"good" state?





> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397509#comment-16397509
 ] 

Todd Lipcon commented on KUDU-2342:
---

The change config which is pending is:
{code}
1.1233@6229814865004195840  REPLICATE CHANGE_CONFIG_OP
id { term: 1 index: 1233 } timestamp: 6229814865004195840 op_type: 
CHANGE_CONFIG_OP change_config_record { tablet_id: 
"b8431200388d486995a4426c88bc06a2" old_config { opid_index: 1141 
OBSOLETE_local: false peers { permanent_uuid: 
"a260dca5a9c846e99cb621881a7b86b8" member_type: VOTER last_known_addr { host: 
"vc1515.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"e3fdd8da21a643aba21b7acdd6b17499" member_type: VOTER last_known_addr { host: 
"va1038.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"14b2404c50b540ae8957adff9a6c7548" member_type: VOTER last_known_addr { host: 
"vd1336.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"f7376c96c6b64e7fa6a7bfc84fd0cd64" member_type: NON_VOTER last_known_addr { 
host: "vc1534.halxg.cloudera.com" port: 7050 } attrs { promote: true } } } 
new_config { opid_index: 1233 OBSOLETE_local: false peers { permanent_uuid: 
"a260dca5a9c846e99cb621881a7b86b8" member_type: VOTER last_known_addr { host: 
"vc1515.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"e3fdd8da21a643aba21b7acdd6b17499" member_type: VOTER last_known_addr { host: 
"va1038.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"14b2404c50b540ae8957adff9a6c7548" member_type: VOTER last_known_addr { host: 
"vd1336.halxg.cloudera.com" port: 7050 } } peers { permanent_uuid: 
"f7376c96c6b64e7fa6a7bfc84fd0cd64" member_type: VOTER last_known_addr { host: 
"vc1534.halxg.cloudera.com" port: 7050 } attrs { promote: false } } } }
{code}

That is to say, it has a pending promotion of peer 
f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534) from NON_VOTER to VOTER.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397505#comment-16397505
 ] 

Todd Lipcon commented on KUDU-2342:
---

It appears what happened is that the leader actaully got 80 segments ahead of 
the two followers, and since our default log_max_segments_to_retain=80, it GCed 
the logs anyway. Then it couldn't replicate to either follower and the tablet 
got stuck. I checked the earliest WAL on that server (wal-01141) and its 
earliest op is 1.1154.

What's a bit odd here is that the leader watermark thinks that 1232 is the 
committed index and the majority-replicated, but it wants to send ops 1143 and 
1055 to the two peers. Also interesting is that it appears this tablet is 
currently in a configuration with four VOTER replicas.

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexey Serbin
>Priority: Critical
>  Labels: scalability
> Attachments: Impala query profile.txt, tablet-info.html
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

2018-03-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397479#comment-16397479
 ] 

Todd Lipcon commented on KUDU-2342:
---

The server vc1515 has the following spewing in its logs:

{code}
I0313 11:56:27.615651 43703 consensus_peers.cc:230] T 
b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer 
f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: f7376c96c6b64e7fa6a7bfc84fd0cd64. Status: 
Not found: Failed to read ops 1143..1221: Segment 1130 which contained index 
1143 has been GCed
I0313 11:56:27.973654 43703 consensus_peers.cc:230] T 
b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer 
e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): Could not 
obtain request from queue for peer: e3fdd8da21a643aba21b7acdd6b17499. Status: 
Not found: Failed to read ops 1055..1221: Segment 1043 which contained index 
1055 has been GCed
{code}

in other words, it appears to have evicted the log segments necessary to catch 
up both of its followers. Thus it's unable to replicate and commit any writes, 
so the write here timed out. Instead of letting it time out we should of course 
respond more rapidly saying that the tablet is unavailable, but that's a 
separate issue.

I guess in this case we can't recover because it wont evict a follower either 
because it knows that it wouldn't be able to commit the config change. So, how 
did it get into the state where it had GCed logs behind the majority_replicated 
watermark? [~aserbin] said he can take a look

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with 
> "Failed to write batch "
> -
>
> Key: KUDU-2342
> URL: https://issues.apache.org/jira/browse/KUDU-2342
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: 1.7.0
>Reporter: Mostafa Mokhtar
>Priority: Major
>  Labels: scalability
> Attachments: Impala query profile.txt
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation 
> failed with :
> Query Status: Kudu error(s) reported, first error: Timed out: Failed to 
> write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 
> attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 
> (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 
> 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)