[ https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov updated IGNITE-17571: ----------------------------------- Description: h3. Basics Current implementation can only work with infinite storage space. This is because the storage works in appen-only mode (except for transactions rollbacks). There's a value in the system called "low watermark". It's guaranteed, that no new RO transactions will be started at the time earlier then LW. Existence of such value allows us to safely delete all versions before it. We will not discuss the mechanics of acquiring such value, assuming that it is simply given. It's expected that deletion will be performed in background workers, without holding any transactions locks. Process is independent on each RAFT group member. GC shouldn't affect transactions in any way, if possible. h3. API Original intended design looks like following: {code:java} cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code} Now, I don't think that this is necessarily a good design. Main problem with it is the existence of bounds: * First of all, why not just have inclusive lower bound and exclusive upper bound, like almost all methods with bounds in existence. * Second, I believe that this API has been proposed in assumption that row ids will have a uniform distribution and every "range" cleanup will result in somewhat equal amount of job executed. This is simply not true in current implementation. RowId is a timestamp based value that exist in a very narrow time slice, making most of ranges empty and meaningless. Then, the way they're stored is actually up to the storage. There's no restrictions on byte order when physically storing RowId objects. h3. Problems There's one thing that worries me: indexes. Because storages are unaware of table schemas, it's impossible to cleanup indexes. This gets me thinking. API should be flexible enough so that indexes cleanup can be performed as an external operation over the storage. This will also reduce the amount of job that we need to do for the implementation. To be specific, it feels like the method would look like this: {code:java} RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, BiConsumer<RowId, BinaryRow> indexCleaner);{code} Explanation is required. * timestamp represents the same thing as before - low watermark. * startId - the row that should be first to iterate in current batch. * numRows - number of rows that should be cleaned in current batch. By this I mean individual versions, not chains. * cleaner closure must be used to clean indexes after every individual version removal. Right now it doesn't look optimal to me, but I struggle to find a good solution on efficient indexes cleanup. * next rowId is returned, that should be used a startId in next call. "null" if cleanup is complete. In this case it can be started from the beginning or simply postponed until new low watermark value is available. How to operate it: * numRows has two strategic purposes: ** to control the size of batches in "runConsistently" closures. ** to be able to run many cleanups in parallel avoiding pools starvation. Every task is split into small chunks. * cleanup should start from the smallest possible row id. * low watermark value can be changed in-between calls. This is fine and even preferable. h3. Alternative Some might find given API too complicated. There's another way that involves less actions from the "client". {code:java} void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, BinaryRow> indexCleaner);{code} In a way, both of these methods are equivalent, the difference is minuscule. One can be transformed into the other with relative ease. If anything, this one is even preferable as a first implementation, because it's smaller and there's less room for errors. h3. Random notes Passed closure should only be executed when specific version is already deleted. Otherwise the data in row store will match indexes and indexes would not be cleaned, which is wrong. There must be a test for it, of course. Unsurprisingly, integration of this method into a raft machine is out of scope. I don't think that LW propagation is implemented at this point. I may be wrong. Anyways, we better discuss it with guys who implement transactions right now. was: h3. Basics Current implementation can only work with infinite storage space. This is because the storage works in appen-only mode (except for transactions rollbacks). There's a value in the system called "low watermark". It's guaranteed, that no new RO transactions will be started at the time earlier then LW. Existence of such value allows us to safely delete all versions before it. We will not discuss the mechanics of acquiring such value, assuming that it is simply given. h3. API Original intended design looks like following: {code:java} cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code} Now, I don't think that this is necessarily a good design. Main problem with it is the existence of bounds: * First of all, why not just have inclusive lower bound and exclusive upper bound, like almost all methods with bounds in existence. * Second, I believe that this API has been proposed in assumption that row ids will have a uniform distribution and every "range" cleanup will result in somewhat equal amount of job executed. This is simply not true in current implementation. RowId is a timestamp based value that exist in a very narrow time slice, making most of ranges empty and meaningless. Then, the way they're stored is actually up to the storage. There's no restrictions on byte order when physically storing RowId objects. h3. Problems There's one thing that worries me: indexes. Because storages are unaware of table schemas, it's impossible to cleanup indexes. This gets me thinking. API should be flexible enough so that indexes cleanup can be performed as an external operation over the storage. This will also reduce the amount of job that we need to do for the implementation. To be specific, it feels like the method would look like this: {code:java} RowId cleanup(Timestamp threshold, RowId startId, int numRows, BiConsumer<RowId, BinaryRow> indexCleaner);{code} Explanation is required. * timestamp represents the same thing as before - low watermark. * startId - the row that should be first to iterate in current batch. * numRows - number of rows that should be cleaned in current batch. By this I mean individual versions, not chains. * cleaner closure must be used to clean indexes after every individual version removal. Right now it doesn't look optimal to me, but I struggle to find a good solution on efficient indexes cleanup. * next rowId is returned, that should be used a startId in next call. "null" if cleanup is complete. In this case it can be started from the beginning or simply postponed until new low watermark value is available. How to operate it: * numRows has two strategic purposes: ** to control the size of batches in "runConsistently" closures. ** to be able to run many cleanups in parallel avoiding pools starvation. Every task is split into small chunks. * cleanup should start from the smallest possible row id. Unfortunately, we can't just use (0L, 0L), that's wrong. Maybe we should add something like "smallestRowId()" to storage engine. * low watermark value can be changed in-between calls. This is fine and even preferable. h3. Random notes Passed closure should only be executed when specific version is already deleted. Otherwise the data in row store will match indexes and indexes would not be cleaned, which is wrong. There must be a test for it, of course. Unsurprisingly, integration of this method into a raft machine is out of scope. I don't think that LW propagation is implemented at this point. I may be wrong. Anyways, we better discuss it with guys who implement transactions right now. > Implement GC for MV storages > ---------------------------- > > Key: IGNITE-17571 > URL: https://issues.apache.org/jira/browse/IGNITE-17571 > Project: Ignite > Issue Type: Improvement > Reporter: Ivan Bessonov > Priority: Major > Labels: ignite-3 > > h3. Basics > Current implementation can only work with infinite storage space. This is > because the storage works in appen-only mode (except for transactions > rollbacks). > There's a value in the system called "low watermark". It's guaranteed, that > no new RO transactions will be started at the time earlier then LW. Existence > of such value allows us to safely delete all versions before it. We will not > discuss the mechanics of acquiring such value, assuming that it is simply > given. > It's expected that deletion will be performed in background workers, without > holding any transactions locks. Process is independent on each RAFT group > member. GC shouldn't affect transactions in any way, if possible. > h3. API > Original intended design looks like following: > {code:java} > cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, > @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code} > Now, I don't think that this is necessarily a good design. Main problem with > it is the existence of bounds: > * First of all, why not just have inclusive lower bound and exclusive upper > bound, like almost all methods with bounds in existence. > * Second, I believe that this API has been proposed in assumption that row > ids will have a uniform distribution and every "range" cleanup will result in > somewhat equal amount of job executed. This is simply not true in current > implementation. > RowId is a timestamp based value that exist in a very narrow time slice, > making most of ranges empty and meaningless. > Then, the way they're stored is actually up to the storage. There's no > restrictions on byte order when physically storing RowId objects. > h3. Problems > There's one thing that worries me: indexes. > Because storages are unaware of table schemas, it's impossible to cleanup > indexes. This gets me thinking. API should be flexible enough so that indexes > cleanup can be performed as an external operation over the storage. This will > also reduce the amount of job that we need to do for the implementation. > To be specific, it feels like the method would look like this: > {code:java} > RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, > BiConsumer<RowId, BinaryRow> indexCleaner);{code} > Explanation is required. > * timestamp represents the same thing as before - low watermark. > * startId - the row that should be first to iterate in current batch. > * numRows - number of rows that should be cleaned in current batch. By this > I mean individual versions, not chains. > * cleaner closure must be used to clean indexes after every individual > version removal. Right now it doesn't look optimal to me, but I struggle to > find a good solution on efficient indexes cleanup. > * next rowId is returned, that should be used a startId in next call. "null" > if cleanup is complete. In this case it can be started from the beginning or > simply postponed until new low watermark value is available. > How to operate it: > * numRows has two strategic purposes: > ** to control the size of batches in "runConsistently" closures. > ** to be able to run many cleanups in parallel avoiding pools starvation. > Every task is split into small chunks. > * cleanup should start from the smallest possible row id. > * low watermark value can be changed in-between calls. This is fine and even > preferable. > h3. Alternative > Some might find given API too complicated. There's another way that involves > less actions from the "client". > {code:java} > void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, > BinaryRow> indexCleaner);{code} > In a way, both of these methods are equivalent, the difference is minuscule. > One can be transformed into the other with relative ease. If anything, this > one is even preferable as a first implementation, because it's smaller and > there's less room for errors. > h3. Random notes > Passed closure should only be executed when specific version is already > deleted. Otherwise the data in row store will match indexes and indexes would > not be cleaned, which is wrong. There must be a test for it, of course. > Unsurprisingly, integration of this method into a raft machine is out of > scope. I don't think that LW propagation is implemented at this point. I may > be wrong. Anyways, we better discuss it with guys who implement transactions > right now. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)