[ 
https://issues.apache.org/jira/browse/KUDU-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992838#comment-16992838
 ] 

Alexey Serbin commented on KUDU-3016:
-------------------------------------

[~adar] thank you for pointing out to KUDU-1125.

Yes, I think we should take into account a couple of things there:
# Maximum size of an RPC for masters (let's assume all masters have the same 
setting for {{rpc_max_message_size}})
# Maximum size of a Raft batch for masters (let's assume all masters have the 
same setting for {{consensus_max_batch_size_bytes}}).  This is already applied 
by the consensus queue logic when pushing updates to followers, but the setting 
is effective only starting with the second batch to push (i.e. the first one 
might be as large as it gets: 
https://github.com/apache/kudu/blob/22d1f66ed1b9ae70a0118fdb6d645e1899878442/src/kudu/consensus/log_cache.cc#L309-L367).

For the second item, we might want to rethink its the default setting of the 
flag.

> Catalog manager: don't lump together all updates from one tablet report
> -----------------------------------------------------------------------
>
>                 Key: KUDU-3016
>                 URL: https://issues.apache.org/jira/browse/KUDU-3016
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>            Reporter: Alexey Serbin
>            Assignee: Alexey Serbin
>            Priority: Major
>              Labels: scalability
>
> With current structure of the system tablet for rows storing metadata 
> information on tablets, the catalog manager can create a very large write 
> operation on the system tablet when processing full tablet reports sent from 
> tablet servers.  At some point (depends on the {{\-\-rpc_max_message_size}} 
> setting), a tablet report received from a tablet server comes through, but 
> its Raft counterpart for the system tablet update doesn't because it might be 
> almost two times larger.  If that happens, Kudu cluster becomes almost 
> non-functional because of self-perpetuating 
> accepted-huge-tablet-report-but-cannot-push-Raft-update-to-follower-masters 
> pattern.
> The catalog manager should not lump together updates on all tablets received 
> from one tablet server:  
> https://github.com/apache/kudu/blob/3175c35c7d721aef0c4c6b358cc3b422089c1ba7/src/kudu/master/catalog_manager.cc#L4268-L4274



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to