errose28 opened a new pull request #1298:
URL: https://github.com/apache/hadoop-ozone/pull/1298


   ## What changes were proposed in this pull request?
   
   This pull request adds column family support for Datanode RocksDB instances. 
Originally, datanodes placed all of their data under the default column family 
in RocksDB. This differs from OM and SCM which organize their data into 
different column families based on its type. This feature will move the 
datanode code off of the database utilities in the hadoop.hdds.utils package 
(which has no column family support), and move them to the newer utilities used 
by OM and SCM in the hadoop.hdds.utils.db package (which has column family 
support). It then divides data for new datanode containers into three column 
families. This is implemented in a backwards compatible way, so that containers 
written in the old format can still be used. Since this is a rather large pull 
request, a breakdown of the different components is provided below.
   
   ### Migration From Old to New Database Interfaces
   
   In order to divide container data into different column families, datanode 
code had to first be moved off of the old database utilities in the 
hadoop.hdds.utils package (which has no column family support), and moved to 
the newer utilities used by OM and SCM in the hadoop.hdds.utils.db package 
(which has column family support). In addition to the ability to use column 
families, this also provides a strong typing layer to the tables. The majority 
of file changes are just swapping calls made using the old interface to their 
equivalent versions in the new interface. This allowed removing the byte stream 
conversions that previously had to be done when interacting with the database, 
since the new interface provides codecs to handle this.
   
   The code using the new interface should exhibit identical behavior to that 
using the old interface. In cases where the new interface lacked functionality 
present in the old interface, that functionality was added to the new 
interface. Apart from these minor extensions, this pull request does not modify 
the new interface or existing implementations, but adds new implementations to 
the interface needed by the datanode.
   
   ### New Container Layout
   
   Any new containers created with this code will divide data in to three 
column families:
   
   1. **block_data**
       - Key type: `String`
           - Block ID with optional prefix.
       - Value type: `BlockData`
   
   2. **metadata**
       - Key type: `String`
           - Name of metadata field.
       - Value type: `Long`
           - The value of the field.
   
   3. **deleted_blocks**
       - Key type: `String`
           - Block ID with optional prefix.
       - Value type: `ChunkInfoList`
           - A new type created that allows encoding/decoding and saving the 
chunk information associated with blocks that have been deleted.
           - This value is not currently used by the code (except in tests), 
but is included for potential future use.
           - The underlying chunks are still deleted, only the information 
about them is retained.
   
   ### Backwards Compatibility
   
   To distinguish between containers created in the original layout with only a 
default column family, and containers created in the new layout with three 
column families, a new field was added to the .container files, called 
*schemaVersion*. *schemaVersion* 1, or a missing *schemaVersion* value, 
indicates that the container was created in the old layout. *schemaVersion* 2 
indicates that the container was created in the new layout. The code will use 
the proper `DatanodeStore` implementation for the corresponding schema version, 
and callers do not need to change their operations based on the schema version.
   
   Unfortunately, the existing design of the interface requires callers to 
specify which column families (called `Table`s by the interface) they interact 
with, which makes it difficult to abstract out this piece of the 
implementation. In order to allow the same set of calls to work with both 
schema versions, callers always specify tables as if they are using schema 
version 2. If they are in fact using schema version 2, database interactions 
proceed as expected. If they are using schema version 1, however, calls into 
the *block_data*, *metadata*, or *deleted_blocks* tables will be redirected to 
the one *default* table. This comes with a few issues that had to be resolved:
   
   1. Since deleted blocks are in their own table in schema version 2, they do 
not need a prefix to separate them from regular blocks.
   
       - In schema version 1, however, the prefix is still necessary, otherwise 
block ID keys for deleted and regular blocks would have identical format and 
indistinguishable at the default table level.
   
       - This is solved with the `SchemaOneDeletedBlocksTable` class, which is 
returned as the `Table` implementation when callers ask for the deleted blocks 
table but are using schema version 1.
           - This class automatically adds the `#deleted#` prefix to caller 
data, so callers can read and write as if their data is actually in a separate 
table.
           - This class will not preprocess and modify iterator results, 
however, so scanning the deleted blocks table will return the keys with the 
`#deleted#` prefix.
               - This is documented above the appropriate methods, and the 
current code never reads iterator values, only their size.
           - The definition of the `#deleted#` prefix was moved out of the 
`OzoneConsts` and `MetadataKeyFilters` classes and into 
`SchemaOneDeletedBlocksTable`, since this field is now specific only to this 
case.
   
   2. Databases that had blocks deleted from them before this pull request will 
have the deleted block ID key mapped to the block ID value, instead of having 
the chunk information for that block saved as the value.
       - This is solved by the `SchemaOneChunkInfoListCodec` class, which will 
chain the default exception thrown by a trying to read a block ID as a 
`ChunkInfoList` to another `IOException` with a more detailed error message 
explaining why the chunk information may be invalid.
       - Callers were already forced to handle checked IOExceptions from this 
method according to the `Codec` interface, so there should be no surprise 
behavior if this occurs.
   
   3. Calls to `Table#iterator` will return an iterator over the whole 
underlying table structure.
       - On schema version 1, this means calls like `blockDataTable.iterator()` 
will iterate all data in the default table, not just that pertaining to block 
information.
           - Without an explicit list of key formats defined for each table 
(which is fragile if prefixes or metadata are added/changed), this method 
cannot function properly.
       - Since this method is not used by datanode code, it was left unaltered.
           - An alternate design choice would be to always return a table 
implementation that throws `UnsupportedOperationException` for this call.
   
   ### Guide to File Changes
   
   - The following *new* files were created to supply the datanode with 
necessary functionality:
       - For defining the database layout:
           - AbstractDatanodeDBDefinition.java
           - DatanodeSchemaOneDBDefinition.java
           - DatanodeSchemaTwoDBDefinition.java
   
       - For interacting the with the database:
           - DatanodeStore.java
           - AbstractDatanodeStore.java
           - DatanodeStoreSchemaOneImpl.java
           - DatanodeStoreSchemaTwoImpl.java
   
       - For encoding/decoding byte streams to/from the database:
           - ChunkInfoList.java
           - ChunkInfoListCodec.java
           - BlockDataCodec.java
   
       - Wrappers for backwards compatibility:
           - SchemaOneDeletedBlocksTable.java
           - SchemaOneChunkInfoListCodec.java
   
       - Unit testing:
           - TestSchemaOneBackwardsCompatibility.java
   
   - Additionally, a RocksDB database and associated .container file were 
created using the original code to test backwards compatibility.
       - These files are internal to RocksDB and can be ignored, even though 
they show up in the diff.
       - Their contents are documented in 
`TestSchemaOneBackwardsCompatibility#TestDB`.
   
   - Most other changes are switching existing code over to the new interface, 
or adding equivalent functionality that was present in the old interface to the 
new interface.
   
   
   ## What is the link to the Apache JIRA
   
   HDDS-3869
   
   ## How was this patch tested?
   
   - To ensure equivalent functionality between the old and new interfaces and 
their implementations, the code was run through the existing unit test 
framework.
       - Since this creates all containers from scratch, these tests were all 
run on using databases in schema version 2.
   
   - To ensure backwards compatibility, a new set of tests was created in 
`TestSchemaOneBackwardsCompatibility.java.`
       - These tests use a RocksDB database and container file that were 
generated using the original code before this pull request.
       - This tests reading all data in the database, and writing everything 
except new block data.
           - Since all containers will be closed before upgrade, no new blocks 
will be added to containers encountered with schema version 1.
   
   - In `TestBlockDeletingService#testDeletedChunkInfo` was added to verify 
that when using a database in schema version 2, chunk information from deleted 
blocks is saved correctly.
   
   - In `TestKeyValueBlockIterator`, use of the `#deleted#` blocks prefix was 
replaced with the `#BCSID#` prefix, since deleted blocks were moved to their 
own column family outside of the block data table.
       - The specific prefixes used in these tests is arbitrary, as long as 
they are two different prefixes present in the block data table.
       - The new unit test `testKeyValueBlockIteratorWithAdvancedFilter` was 
also added to this class.
           - This makes sure that the iterator still functions properly when 
keys matching a given filter may occur in the middle of the key set.
               - This condition was not guaranteed to occur in any of the 
existing unit tests, but it could happen when using the schema version 2 layout.
   
   ## Notes
   
   Leaving as draft until HDDS-4061 is merged, since this pull request 
incorporates changes from that issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to