[ https://issues.apache.org/jira/browse/KUDU-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adar Dembo updated KUDU-2611: ----------------------------- Labels: newbie (was: ) > Create table guardrail isn't enforced when NR=1 > ----------------------------------------------- > > Key: KUDU-2611 > URL: https://issues.apache.org/jira/browse/KUDU-2611 > Project: Kudu > Issue Type: Bug > Components: master > Affects Versions: 1.8.0 > Reporter: Adar Dembo > Priority: Major > Labels: newbie > > A Slack user reported a small constellation of issues that all turned out to > be interrelated > His master was spitting out the following error message: > {noformat} > catalog_manager.cc:509] Error processing pending assignments: Service > unavailable: error persisting updated tablet metadata: Transaction failed, > tablet 00000000000000000000000000000000 transaction memory consumption (0) > has exceeded its limit (67108864) or the limit of an ancestral tracker > {noformat} > This is odd; if the memory consumption is 0, how is the limit exceeded? > Meanwhile, the tserver was running out of file descriptors even though > RLIMIT_NOFILE was 32k for the kudu user. > It turns out that the user had issued the following DDL: > {noformat} > CREATE TABLE foo( > ... > PRIMARY KEY (...) > ) > PARTITION BY HASH (...) PARTITIONS 20, > HASH (...) PARTITIONS 20, > HASH (...) PARTITIONS 20, > HASH (...) PARTITIONS 20 > > STORED AS KUDU > TBLPROPERTIES ('kudu.master_addresses'=..., > 'kudu.num_tablet_replicas' = '1' > ); > {noformat} > This is a table with 160,000 tablets (20 * 20 * 20 * 20), which is way too > many! > The key is that the standard "max replicas at table creation time" guardrail > wasn't enforced because it is conditioned on num_replicas>1, so the master > allowed the table to be created. Then what happened? > # The master managed to persist a mega-transaction with all of the tablets, > but wasn't able to update it to include the Raft configs, because that > exceeded 64M. So the catalog manager's background process was retrying this > (and failing), over and over. > # The tserver had handled about 100k of the CreateTablet RPCs, and then ran > out of fds because it needs at least two fds per tablet (to read and write > from the latest WAL segment). It crashed again when restarted, as the > bootstrap process attempted to reopen all of these WAL segments. > So, we should enforce some sort of guardrail when NR=1 too, if only to avoid > these hard-to-debug issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)