[ 
https://issues.apache.org/jira/browse/KUDU-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo updated KUDU-2611:
-----------------------------
    Labels: newbie  (was: )

> Create table guardrail isn't enforced when NR=1
> -----------------------------------------------
>
>                 Key: KUDU-2611
>                 URL: https://issues.apache.org/jira/browse/KUDU-2611
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.8.0
>            Reporter: Adar Dembo
>            Priority: Major
>              Labels: newbie
>
> A Slack user reported a small constellation of issues that all turned out to 
> be interrelated
> His master was spitting out the following error message:
> {noformat}
> catalog_manager.cc:509] Error processing pending assignments: Service 
> unavailable: error persisting updated tablet metadata: Transaction failed, 
> tablet 00000000000000000000000000000000 transaction memory consumption (0) 
> has exceeded its limit (67108864) or the limit of an ancestral tracker
> {noformat}
> This is odd; if the memory consumption is 0, how is the limit exceeded?
> Meanwhile, the tserver was running out of file descriptors even though 
> RLIMIT_NOFILE was 32k for the kudu user.
> It turns out that the user had issued the following DDL:
> {noformat}
> CREATE TABLE foo(
> ...
>    PRIMARY KEY (...)
>  )
>  PARTITION BY HASH (...) PARTITIONS 20,
>  HASH (...)  PARTITIONS 20,
>  HASH (...)  PARTITIONS 20,
>  HASH (...)  PARTITIONS 20
>  
>  STORED AS KUDU
>  TBLPROPERTIES ('kudu.master_addresses'=...,
>  'kudu.num_tablet_replicas' = '1'
>  );
> {noformat}
> This is a table with 160,000 tablets (20 * 20 * 20 * 20), which is way too 
> many!
> The key is that the standard "max replicas at table creation time" guardrail 
> wasn't enforced because it is conditioned on num_replicas>1, so the master 
> allowed the table to be created. Then what happened?
> # The master managed to persist a mega-transaction with all of the tablets, 
> but wasn't able to update it to include the Raft configs, because that 
> exceeded 64M. So the catalog manager's background process was retrying this 
> (and failing), over and over.
> # The tserver had handled about 100k of the CreateTablet RPCs, and then ran 
> out of fds because it needs at least two fds per tablet (to read and write 
> from the latest WAL segment). It crashed again when restarted, as the 
> bootstrap process attempted to reopen all of these WAL segments.
> So, we should enforce some sort of guardrail when NR=1 too, if only to avoid 
> these hard-to-debug issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to