Ivan Bessonov created IGNITE-16933: -------------------------------------- Summary: PageMemory-based MV storage implementation Key: IGNITE-16933 URL: https://issues.apache.org/jira/browse/IGNITE-16933 Project: Ignite Issue Type: New Feature Reporter: Ivan Bessonov
Similar to IGNITE-16611, we need an MV-storage implementation for page memory storage engine. Currently, I expect only row storage implementation, without primary or secondary indexes. Here I'm going to describe a data format. Each row is stored as a versioned chain. It will be represented by a number of data entries that will have references to each other. {code:java} [ Timestamp | NextLink | PayloadSize | Payload ]{code} * Timestamp is a 16 bytes value derived from {{org.apache.ignite.internal.tx.Timestamp}} instance. * NextLink is a link to the next element in the chain or a NULL_LINK (or any other convenient name). It's a long value in standard format for Page Memory links (itemId, flag, partitionId, pageIdx). Technically, partition id is not needed here, because it's always the same. Removing it could allow us to save 2 bytes per chain element. * PayloadSize is a 4-bytes integer value that gives us the size of actual data in arbitrary format. * Payload - I expect it to be a serialized BinaryRow data. This is how it's implemented in RocksDB right now. For uncommitted (pending) entries I propose using maximal possible timestamp - {{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that we never store tx id in chain itself. Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It should be used as a header size in corresponding FreeList. There's a requirement to have an immutable RowId for every versioned chain. One could argue that we should just make chain head immutable, but it would result in lots of complications. It's better to have a separate structure with immutable link, that will point to an actual head of the versioned chain. {code:java} [ TransactionId | HeadLink | NextLink ]{code} * TransactionId is a UUID. Can only be applied to pending entries. For committed head I propose storing 16 zeroes. * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already mentioned, I'd prefer 6. * NextLink is a "NextLink" value from the head chain element. It's a cheap shortcut for read-only transactions, you can skip uncommitted entry without even trying to read it, if there's a non-null transaction id. Debatable, I know, but looks cheap enough. In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6 = 28) bytes of data. There must e a separate FreeList for every partition even in In-Memory mode for reasons that I'll give later. "Header" size in that list must be equal to these 28 bytes. I wander how effective FreeList will be for this case, where every chunk has the same size. We'll see. Maybe we should adjust a number of buckets somehow. Now, the fun part. There's no mention of B+Tree here. That's because we can probably just avoid it. If it existed, it would just point RowId to a described RowId structure in partition, but RowId is already a pointer itself. The only other problem that is usually solved by a tree-like structure is a full-scan of all rows in partition. This is useful when you need to rebuild indexes, for example. We should keep in mind that there's no code yet for rebuilding indexes. On the other hand, there's a method for partition scan in the API. It could be used to implement a Primary Index imitation until we have a real implementation. There's not FreeList full-scan currently in the code, it needs to be implemented. And, this particular full-scan is the reason why every partition should have its own list of row ids. There's also a chance that introducing new flag for row ids might be convenient. I don't know yet, let's not do it for now. Finally, we need an adequate protection from assertions if we, for some reason, have invalid row id. Things that can be checked be a normal code, not assertion: * data page type * number of items in the page -- This message was sent by Atlassian Jira (v8.20.7#820007)