[ https://issues.apache.org/jira/browse/KUDU-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644241#comment-17644241 ]
dengke commented on KUDU-3413: ------------------------------ [~aserbin] I'm sorry I missed the design purpose. I have opened a new google document to discuss design according to your suggestion. https://docs.google.com/document/d/1hcZC9qsMB4OxlJc-yYQk3JISXTCiofshG5K31_h3zas/edit?usp=sharing > Kudu multi-tenancy > ------------------ > > Key: KUDU-3413 > URL: https://issues.apache.org/jira/browse/KUDU-3413 > Project: Kudu > Issue Type: New Feature > Reporter: dengke > Assignee: dengke > Priority: Major > Attachments: data_and_metadata.png, kudu table topology.png, > metadata_record.png, new_fs_manager.png, tablet_rowsets.png, > zonekey_update.png > > > h1. 1、Definition > * Tenant: A cluster user can be called a tenant. Tenants may be divided by > project or actual application. Each tenant is equivalent to a resource pool, > and all users under a tenant share all resources of the resource pool. > Multiple tenants share a cluster resource. > * User: The user of cluster resources. > * Multi tenant: The database level controls that tenants cannot access each > other, and resources are private and independent(Note: Kudu does not have the > concept of database, which is simply understood as multiple tables). > h1. 2、Current situation > The latest version of kudu has realized ‘data at rest encryption', > mainly cluster level authentication and encryption, data storage encryption > of a single server level, which can meet the needs of basic encryption > scenarios, but there is still a little gap from the tenant level encryption > we are pursuing. > h1. 3、Outline design > In general, there are the following differences between tenant level > encryption and cluster level encryption: > * Tenant level encryption requires data storage isolation, which means data > between tenants needs to be separated (a new layer of namespace namespace may > be added to the storage topology, and data of the same tenant is stored in > the same namespace path, with minimal mutual impact); > * The generation and use of tenants'keys. In a multi tenant scenario, we > need to replace the cluster key with the tenant key. > h1. 4、Design > h2. 4.1 Namespace > The namespace in the storage field of the industry is mainly used to > maintain the file attributes, directory tree structure and other metadata > information of the file system, and is compatible with POSIX directory trees > and file operations. It is a core concept in file storage. Taking the common > HDFS as an example, its namespace is mainly implemented based on "the disk > allows logical partitioning, while attaching partition files to different > directories, and finally modifying the directory owner's permissions" to > achieve resource isolation. > Corresponding to the Kudu system, the current storage topology is > relatively mature, and the kudu client's read/write requests need to be > processed by tserver before the corresponding data can be obtained. The > request does not involve direct manipulation of raw data, that is, the client > does not perceive the data distribution in the storage engine at all, there > is a natural degree of data isolation. > However, the data in the storage engine are intertwined. In some > extreme cases, there is still the possibility of interaction. The best > solution is to completely distinguish the read/write, compact and other > processing processes of different tenants. However, it requires a lot of > changes and may lead to system instability. We can make minimal changes by > tenant to achieve physical isolation of data. > First, we need to analyze the current storage topology: a table in > kudu will be divided into multiple tablet partitions. Each tablet includes > metadata meta information and several RowSets. The RowSet contains a > 'MemRowSet'(corresponding to the data in memory) and multiple > 'DiskRowSets'(corresponding to the data on the disk). The 'DiskRowSet' > contains 'BloomFile’、'Ad_hoc Index’、'BaseData'、'DeltaMem' and several > 'RedoFiles' and 'UndoFile' (generally, there is only one 'UndoFile'). For > more specific distribution information, please refer to the following figure. > !kudu table topology.png|width=1282,height=721! > *The simplest way to achieve physical isolation is to set different > storage paths for the data of different tenants.* Currently, we only need to > consider the physical isolation of 'DiskRowSet'. > Kudu system writes disks through containers. Each container can write > a large continuous disk space for writing data to a CFile (the actual storage > form of ‘DiskRowSet'). When one CFile is written, the container will be > returned to the ‘BlockManager', and then the container can be used to write > data to the next CFile. When no container is available in the BlockManager, a > new container will be created for the new CFile. Each container consists of a > *. metadata and a * Data. Each DiskRowSet has several blocks, and all the > blocks corresponding to a DiskRowset are distributed to multiple containers. > A container may also contain data from multiple DiskRowSets. > It can be simply understood that one DiskRowSet corresponds to one > CFile file (it refers to the single column case. If it is multi column, it > corresponds to multiple CFile files). The difference is that DiskRowSet is > our logical organization, while CFile is our physical storage. For the six > parts of a DiskRowSet (BloomFile, BaseData, UndoFile, RedoFile, DeltaMem, > AdhocIndex as shown in the figure above), neither one CFile corresponds to a > DiskRowSet nor one CFile contains all six parts of a DiskRowSet. These six > parts will be independent in multiple CFiles, and each part will be a > separate CFile. As shown in the figure below, we can only find the following > files (*. data and *. metadata) in the actual production environment, and no > CFile file exists. > !data_and_metadata.png|width=1298,height=395! > This is because a large number of CFiles will be merged and written > to a *.data file by the container, and the *.data is actually a collection of > CFiles. The CFile corresponding to each part of the DiskRowSet and its > mapping relationship are recorded in the tablet-meta/<tablet_id>. In the > file, each mapping relationship is based on the tablet_id which saved > separately. > In current storage topology, the *.metadata file corresponds to the > metadata of the block (the final representation of CFile in fs) of the lowest > level fs layer. It is not in the same dimension as the above concepts such as > CFile and BlockManager. Instead, it records the relevant information of the > block. As shown in the figure below, it is a record in *. metadata. > !metadata_record.png! > According to the above description, we can draw the approximate > corresponding relationship as shown in the figure below: > !tablet_rowsets.png|width=1315,height=695! > Base on the above logic, we can know that the *.data file is the > actual storage location of tenant data. To achieve data isolation, the > isolation of *.data is needed. In order to achieve this goal, we can choose > to create different BlockManagers for each tenant, maintain their own *.data > files. *_In the default scenario (no tenant name is specified), the data will > have a default block_manager. If multi tenant encryption is enabled, > fs_manager will create a new tenant_block_manager based on the tenant name, > the data of the specified tenant name will be stored in the > tenant_block_manager corresponding to the tenant name to achieve the purpose > of data physical isolation._* The modified schematic diagram is as follows: > {{!new_fs_manager.png|width=1306,height=552!}} Add the correspondence > between the tenant and the block_manager in fs_manager, and maintain it in > memory. The tenant's information needs to be persistent. We can consider > appending metadata, or adding new metadata files for real-time update. > {code:java} > message TenantMetadataPB { > message TenantMeta { > // The name of tenant. > optional string tenant_name = 1; > // Encrypted tenant key used to encrypt/decrypt file keys for tenant. > optional string tenant_key = 2; > // Initialization vector for the tenant key. > optional string server_key_iv = 3; > } > repeated TenantMeta tenant_meta = 1; > // Tenant key version. > optional string tenant_key_version = 2; > } {code} > h2. 4.2 Tenant Key > There are two current implementations of the key: > * When static encryption is enabled, server_key is randomly generated by > default; > * When the address and cluster name of the kms are specified, try to get the > server_key from kms. > The server_key is mainly used for encryption and decryption of > sensitive files. We should change the work mode like 'no encryption’, > 'default cluster static encryption’, 'KMS cluster static encryption' and 'KMS > multi tenant encryption’. In the 'KMS multi tenant encryption' mode, the new > tenant name need to add. The tenant name is used to distinguish different > tenants and obtain the corresponding key. If the tenant name is not set, it > corresponds to the "default cluster static encryption” mode, which means > sharing the randomly generated server_key by default. > In the previous cluster encryption scenario, kms_client gets the > zonekey information of the cluster. But there is only zonekey information and > no tenant information in the ranger system, so we need to maintain the > correspondence between the tenant name and zonekey. To do this, we need to > add a configuration file(maybe JSON format) to mark the corresponding > relationship between the tenant name and zonekey. Every time the tenant name > changes, we need to add a zoneKey in Ranger first, then update the > configuration item in the configuration file, and finally use the new tenant > name when creating the table by the end. > {code:c++} > class RangerKMSClient { > public: > RangerKMSClient(std::string kms_url) > : kms_url_(std::move(kms_url)) {} > > Status DecryptKey(const std::string tenant_name, > const std::string& encrypted_key, > const std::string& iv, > const std::string& key_version, > std::string* decrypted_key); > > Status GenerateEncryptedServerKey(const std::string tenant_name, > std::string* encrypted_key, > std::string* iv, > std::string* key_version); > > private: > std::string kms_url_; > }; > class DefaultKeyProvider : public KeyProvider { > public: > ~DefaultKeyProvider() override {} > Status DecryptServerKey(const std::string& encrypted_server_key, > const std::string& /*iv*/, > const std::string& /*key_version*/, > std::string* server_key); > > Status GenerateEncryptedServerKey(std::string* server_key, > std::string* iv, > std::string* key_version); > }; > {code} > The encryption and decryption api of the kms client needs to pass in > the tenant name, and maintain the correspondence between the tenant name and > the zonekey in the memory. Each time we use it, search it in the memory at > first. If the search fails, we will search in the configuration file, and > update the memory data at the same time. If it fails again, we will return. > Otherwise, we will use the queried zonekey to obtain the key. > !zonekey_update.png|width=1273,height=754! > h1. 5、Follow-up work > * Add the parameter of tenant name; > * Add multi tenant encryption mode parameter control; > * Modify the use of block_manager to adapt to multi tenant scenarios; > * Modify the key acquisition; > * Add new multi tenant key acquisition and sensitive data encryption; > * Modify the key acquisition and sensitive data encryption behavior of the > default scenario (no tenant is specified); -- This message was sent by Atlassian Jira (v8.20.10#820010)