[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB
[ https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17081: --- Description: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top of possible performance degradation. Obvious solution is inherently bad and needs to be improved. h2. General idea & implementation Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda breaks RocksDB recovery procedure, we need to take measures to avoid it. The only feasible way to do so is to use DBOptions#setAtomicFlush in conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save all column families consistently, if you have batches that cover several CFs. Basically, {{acquireConsistencyLock()}} would create a thread-local write batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will be affected by this change. NOTE: I believe that scans with unapplied batches should be prohibited for now (gladly, there's a WriteBatchInterface#count() to check). I don't see any practical value and a proper way of implementing it, considering how spread-out in time the scan process is. h2. Callbacks and RAFT snapshots Simply storing and reading update index is easy. Reading committed index is more challenging, I propose caching it and update only from the closure, that can also be used by RAFT to truncate the log. For a closure, there are several things to account for during the implementation: * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in atomic flush mode. And, once you have your first "completed" event ,you have a guarantee that *all* memtables are already persisted. This allows easy tracking of RocksDB flushes, monitoring events alteration is all that's needed. * Unlike PDS implementation, here we will be writing updateIndex value into a memtable every time. This makes it harder to find persistedIndex values for partitions. Gladly, considering the events that we have, during the time between first "completed" and the very next "begin", the state on disk is fully consistent. And there's a way to read data from storage avoiding memtable completely - ReadOptions#setReadTier(PERSISTED_TIER). Summarizing everything from the above, we should implement following protocol: {code:java} During table start: read latest values of update indexes. Store them in an in-memory structure. Set "lastEventType = ON_FLUSH_COMPLETED;". onFlushBegin: if (lastEventType == ON_FLUSH_BEGIN) return; waitForLastAsyncUpdateIndexesRead(); lastEventType = ON_FLUSH_BEGIN; onFlushCompleted: if (lastEventType == ON_FLUSH_COMPLETED) return; asyncReadUpdateIndexesFromDisk(); lastEventType = ON_FLUSH_COMPLETED;{code} Reading values from disk must be performed asynchronously to not stall flushing process. We don't control locks that RocksDb holds while calling listener's methods. That asynchronous process would invoke closures that provide presisted updateIndex values to other components. NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But my implementation says calling it during the first event. This is fine. I noticed that column families are flushed in order of their internal ids. These ids correspond to a sequence number of CFs, and the "default" CF is always created first. This is the exact CF that we use to store meta. Maybe we're going to change this and create a separate meta CF. Only then we could start optimizing this part, and only if we'll have an actual proof that there's a stall in this exact place. h3. Types of storages RocksDB is used for: * tables * cluster management * meta-storage All these types should use the same recovery procedure, but code is located in different places. I hope that it won't be a big problem and we can do everything at once. was: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top
[jira] [Created] (IGNITE-17310) Intergrate IndexStorage into a TableStorage API
Ivan Bessonov created IGNITE-17310: -- Summary: Intergrate IndexStorage into a TableStorage API Key: IGNITE-17310 URL: https://issues.apache.org/jira/browse/IGNITE-17310 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov As an endpoint, we need an interface that represents a single index storage for a single partition. But, creating/destroying these storages is not as obvious from API standpoint. When index is created, storages should be created for every existing partition. And when a partition is created, index storages should be created for it as well. This complicates things a little bit, but, generally speaking, something like this could be a solution: * CompletableFuture createIndex(indexCinfgiguration); * CompletableFuture dropIndex(indexId); * IndexMvStorage getIndexStorage(indexId, partitionId); Build / rebuild API will be figured out later in another issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface
[ https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17308: --- Description: Currently, SortedIndexMvStorage is a very weird mixture of many things. Its contract is far from obvious and it's only used in tests as a part of "reference implementation". Originally, it was implemented when the vision of MV store wasn't fully solidified. h3. API changes * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It should be replaced with {{{}InternalTuple{}}}, with the requirement that every internal tuple can be converted into a IEP-92 format. * {{scan}} should not return rows, but only indexed rows and RowId instances. Index scan should NOT by itself filter-out invalid rows, this will be performed outside of scan. * TxId / Timestamp parameters are no longer applicable, given that index does not perform rows validation. * Partition filter should be removed as well. To simplify things, every partition will be indexed {+}independently{+}. * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for now. Former can be brought back in the future, while latter makes no sense considering that indexes are not multiversioned. * new methods, like {{update}} and {{remove}} should be added to API. h3. New API for removed functions * There should be a new entity on top of partition and index store. It updates indexes and filters scan queries. There's no point in fully designing it right now, all we need is working tests for now. Porting current tests to new API is up to a developer. h3. Other I would say that effective InternalTuple comparison is out of scope. We could just adapt current test code somehow. was: Currently, SortedIndexMvStorage is a very weird mixture of many things. Its contract is far from obvious and it's only used in tests as a part of "reference implementation". Originally, it was implemented when the vision of MV store wasn't fully solidified. h3. API changes * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It should be replaced with {{{}InternalTuple{}}}, with the requirement that every internal tuple can be converted into a IEP-92 format. * {{scan}} should not return rows, but only indexed rows and RowId instances. Index scan should NOT by itself filter-out invalid rows, this will be performed outside of scan. * TxId / Timestamp parameters are no longer applicable, given that index does not perform rows validation. * Partition filter should be removed as well. To simplify things, every partition will be indexed {+}independently{+}. * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for now. Former can be brought back in the future, while latter makes no sense considering that indexes are not multiversioned. h3. New API for removed functions * There should be a new entity on top of partition and index store. It updates indexes and filters scan queries. There's no point in fully designing it right now, all we need is working tests for now. > Revisit SortedIndexMvStorage interface > -- > > Key: IGNITE-17308 > URL: https://issues.apache.org/jira/browse/IGNITE-17308 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > Currently, SortedIndexMvStorage is a very weird mixture of many things. Its > contract is far from obvious and it's only used in tests as a part of > "reference implementation". > Originally, it was implemented when the vision of MV store wasn't fully > solidified. > h3. API changes > * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It > should be replaced with {{{}InternalTuple{}}}, with the requirement that > every internal tuple can be converted into a IEP-92 format. > * {{scan}} should not return rows, but only indexed rows and RowId > instances. Index scan should NOT by itself filter-out invalid rows, this will > be performed outside of scan. > * TxId / Timestamp parameters are no longer applicable, given that index > does not perform rows validation. > * Partition filter should be removed as well. To simplify things, every > partition will be indexed {+}independently{+}. > * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for > now. Former can be brought back in the future, while latter makes no sense > considering that indexes are not multiversioned. > * new methods, like {{update}} and {{remove}} should be added to API. > h3. New API for removed functions > * There should be a new entity on top of partition and index store. It > updates indexes and filters scan queries. There's no point in fully designing > it right now, all we need is working tests for now. Porting current tests
[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface
[ https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17308: --- Description: Currently, SortedIndexMvStorage is a very weird mixture of many things. Its contract is far from obvious and it's only used in tests as a part of "reference implementation". Originally, it was implemented when the vision of MV store wasn't fully solidified. h3. API changes * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It should be replaced with {{{}InternalTuple{}}}, with the requirement that every internal tuple can be converted into a IEP-92 format. * {{scan}} should not return rows, but only indexed rows and RowId instances. Index scan should NOT by itself filter-out invalid rows, this will be performed outside of scan. * TxId / Timestamp parameters are no longer applicable, given that index does not perform rows validation. * Partition filter should be removed as well. To simplify things, every partition will be indexed {+}independently{+}. * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for now. Former can be brought back in the future, while latter makes no sense considering that indexes are not multiversioned. h3. New API for removed functions * There should be a new entity on top of partition and index store. It updates indexes and filters scan queries. There's no point in fully designing it right now, all we need is working tests for now. was: Currently, SortedIndexMvStorage is a very weird mixture of many things. Its contract is far from obvious and it's only used in tests as a part of "reference implementation". Originally, it was implemented when the vision of MV store wasn't fully solidified. h3. API changes * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It should be replaced with {{{}InternalTuple{}}}, with the requirement that every internal tuple can be converted into a IEP-92 format. * {{scan}} should not return rows, but only indexed rows and RowId instances. Index scan should NOT by itself filter-out invalid rows, this will be performed outside of scan. * TxId / Timestamp parameters are no longer applicable, given that index does not perform rows validation. * Partition filter should be removed as well. To simplify things, every partition will be indexed {+}independently{+}. * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for now. Former can be brought back in the future, while latter makes no sense considering that indexes are not multiversioned. > Revisit SortedIndexMvStorage interface > -- > > Key: IGNITE-17308 > URL: https://issues.apache.org/jira/browse/IGNITE-17308 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > Currently, SortedIndexMvStorage is a very weird mixture of many things. Its > contract is far from obvious and it's only used in tests as a part of > "reference implementation". > Originally, it was implemented when the vision of MV store wasn't fully > solidified. > h3. API changes > * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It > should be replaced with {{{}InternalTuple{}}}, with the requirement that > every internal tuple can be converted into a IEP-92 format. > * {{scan}} should not return rows, but only indexed rows and RowId > instances. Index scan should NOT by itself filter-out invalid rows, this will > be performed outside of scan. > * TxId / Timestamp parameters are no longer applicable, given that index > does not perform rows validation. > * Partition filter should be removed as well. To simplify things, every > partition will be indexed {+}independently{+}. > * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for > now. Former can be brought back in the future, while latter makes no sense > considering that indexes are not multiversioned. > h3. New API for removed functions > * There should be a new entity on top of partition and index store. It > updates indexes and filters scan queries. There's no point in fully designing > it right now, all we need is working tests for now. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-17308) Revisit SortedIndexMvStorage interface
Ivan Bessonov created IGNITE-17308: -- Summary: Revisit SortedIndexMvStorage interface Key: IGNITE-17308 URL: https://issues.apache.org/jira/browse/IGNITE-17308 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Currently, SortedIndexMvStorage is a very weird mixture of many things. Its contract is far from obvious and it's only used in tests as a part of "reference implementation". Originally, it was implemented when the vision of MV store wasn't fully solidified. h3. API changes * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It should be replaced with {{{}InternalTuple{}}}, with the requirement that every internal tuple can be converted into a IEP-92 format. * {{scan}} should not return rows, but only indexed rows and RowId instances. Index scan should NOT by itself filter-out invalid rows, this will be performed outside of scan. * TxId / Timestamp parameters are no longer applicable, given that index does not perform rows validation. * Partition filter should be removed as well. To simplify things, every partition will be indexed {+}independently{+}. * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for now. Former can be brought back in the future, while latter makes no sense considering that indexes are not multiversioned. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IGNITE-16156) Byte ordered index keys.
[ https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-16156. Resolution: Won't Fix Other data format will be used > Byte ordered index keys. > > > Key: IGNITE-16156 > URL: https://issues.apache.org/jira/browse/IGNITE-16156 > Project: Ignite > Issue Type: Task > Components: sql >Reporter: Alexander Belyak >Assignee: Alexander Belyak >Priority: Major > Labels: ignite-3 > > To improve speed of operations with indexes ignite can store keys in byte > ordered format so only natural byte[] comparator will be enough to scan it. > Required features: > 1) write any (almost) data types. > Must to have: boolean, byte, short, int,long, float, double, bigint, > bigdecimal, String, Date, Time, DateTime. > Like to have: byte[], bitset > unlikely to have: timestamp with timezone > 2) Support null values for any columns. Like to have: support > nullFirst/nullLast > 3) write asc/desc ordered (in any combination for columns, for indexes like > "col1 asc, col2 desc, col3 asc"). > Non functional requirements: space used and speed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IGNITE-16105) Replace sorted index binary storage protocol
[ https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-16105. Resolution: Won't Fix IGNITE-17192 will be used instead > Replace sorted index binary storage protocol > > > Key: IGNITE-16105 > URL: https://issues.apache.org/jira/browse/IGNITE-16105 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > Sorted Index Storage currently uses {{BinaryRow}} as way to convert column > values into byte arrays. This approach is not optimal for the following > reasons: > # Data is stored in RocksDB and we can't use its native lexicographic > comparator, we rely on a custom Java-based comparator that needs to > de-serialize all columns in order to compare them. This is bad > performance-wise, because Java-based comparators are slower and we need to > extract all column values; > # Range scans can't use the prefix seek operation from RocksDB, because > {{BinaryRow}} seralization is not stable: serialized prefix of column values > will not be a prefix of the whole serialized row, because the format depends > on columns being serialized; > # {{BinaryRow}} serialization is designed to store versioned row data and is > overall badly suited to the Sorted Index purposes, its API usage looks > awkward in this context. > We need to find a new serialization protocol that will (ideally) satisfy the > following requirements: > # It should be comparable lexicographically; > # It should support null values; > # It should support variable length columns (though this requirement can > probably be dropped); > # It should support both ascending and descending order for individual > columns; > # It should support all data types that {{BinaryRow}} uses. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IGNITE-16079) Rename search and data keys for the Partition Storage
[ https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-16079. Resolution: Won't Fix > Rename search and data keys for the Partition Storage > - > > Key: IGNITE-16079 > URL: https://issues.apache.org/jira/browse/IGNITE-16079 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > There are currently the following classes in the {{PartitionStorage}} that > act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the > {{SortedIndexStorage}} interface hard to understand, because it stores > {{SearchRows}} as values. It is proposed to rename these classes: > {{SearchRow}} -> {{PartitionKey}} > {{DataRow}} -> {{PartitionData}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage
[ https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-16059. Resolution: Won't Fix > Add options to the "range" method in SortedIndexStorage > --- > > Key: IGNITE-16059 > URL: https://issues.apache.org/jira/browse/IGNITE-16059 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage] > declares the following API for the {{SortedIndexStorage#range}} method: > {code:java} > /** Exclude lower bound. */ > byte GREATER = 0; > > /** Include lower bound. */ > byte GREATER_OR_EQUAL = 1; > > /** Exclude upper bound. */ > byte LESS = 0; > > /** Include upper bound. */ > byte LESS_OR_EQUAL = 1 << 1; > /** > * Return rows between lower and upper bounds. > * Fill results rows by fields specified at the projection set. > * > * @param low Lower bound of the scan. > * @param up Lower bound of the scan. > * @param scanBoundMask Scan bound mask (specify how to work with rows > equals to the bounds: include or exclude). > * @param proj Set of the columns IDs to fill results rows. > */ > Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj); > {code} > The {{scanBoundMask}} flags are currently not implemented. This API should be > revised and implemented, if needed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-17306) Speedup runtime classes compilation speed for configuration
[ https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17306: --- Description: There are a few places in presto that for too slow, we can easily optimize them (Nothing will be committed if there's no visible difference in tests duration) was:There are a few places in presto that for too slow, we can easily optimize them > Speedup runtime classes compilation speed for configuration > --- > > Key: IGNITE-17306 > URL: https://issues.apache.org/jira/browse/IGNITE-17306 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > There are a few places in presto that for too slow, we can easily optimize > them > (Nothing will be committed if there's no visible difference in tests duration) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (IGNITE-17306) Speedup runtime classes compilation speed for configuration
[ https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov reassigned IGNITE-17306: -- Assignee: Ivan Bessonov > Speedup runtime classes compilation speed for configuration > --- > > Key: IGNITE-17306 > URL: https://issues.apache.org/jira/browse/IGNITE-17306 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > There are a few places in presto that for too slow, we can easily optimize > them -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-17306) Speedup runtime classes compilation speed for configuration
Ivan Bessonov created IGNITE-17306: -- Summary: Speedup runtime classes compilation speed for configuration Key: IGNITE-17306 URL: https://issues.apache.org/jira/browse/IGNITE-17306 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov There are a few places in presto that for too slow, we can easily optimize them -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-15908) Investigate index binary structure compatibility
[ https://issues.apache.org/jira/browse/IGNITE-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-15908: --- Epic Link: IGNITE-17304 > Investigate index binary structure compatibility > > > Key: IGNITE-15908 > URL: https://issues.apache.org/jira/browse/IGNITE-15908 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > Sorted Index Storage has a binary storage format that is subject to change in > the future. Though index schema is immutable and any change to it leads to > index being rebuilt, it should be possible to update the storage format > without rebuilding. It means that there should be some kind of a versioning > mechanism, so that {{IndexKey}} serialization format can be changed in a > backwards-compatilbe way. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage
[ https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16059: --- Epic Link: IGNITE-17304 > Add options to the "range" method in SortedIndexStorage > --- > > Key: IGNITE-16059 > URL: https://issues.apache.org/jira/browse/IGNITE-16059 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage] > declares the following API for the {{SortedIndexStorage#range}} method: > {code:java} > /** Exclude lower bound. */ > byte GREATER = 0; > > /** Include lower bound. */ > byte GREATER_OR_EQUAL = 1; > > /** Exclude upper bound. */ > byte LESS = 0; > > /** Include upper bound. */ > byte LESS_OR_EQUAL = 1 << 1; > /** > * Return rows between lower and upper bounds. > * Fill results rows by fields specified at the projection set. > * > * @param low Lower bound of the scan. > * @param up Lower bound of the scan. > * @param scanBoundMask Scan bound mask (specify how to work with rows > equals to the bounds: include or exclude). > * @param proj Set of the columns IDs to fill results rows. > */ > Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj); > {code} > The {{scanBoundMask}} flags are currently not implemented. This API should be > revised and implemented, if needed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16079) Rename search and data keys for the Partition Storage
[ https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16079: --- Epic Link: IGNITE-17304 > Rename search and data keys for the Partition Storage > - > > Key: IGNITE-16079 > URL: https://issues.apache.org/jira/browse/IGNITE-16079 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > There are currently the following classes in the {{PartitionStorage}} that > act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the > {{SortedIndexStorage}} interface hard to understand, because it stores > {{SearchRows}} as values. It is proposed to rename these classes: > {{SearchRow}} -> {{PartitionKey}} > {{DataRow}} -> {{PartitionData}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16105) Replace sorted index binary storage protocol
[ https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16105: --- Epic Link: IGNITE-17304 > Replace sorted index binary storage protocol > > > Key: IGNITE-16105 > URL: https://issues.apache.org/jira/browse/IGNITE-16105 > Project: Ignite > Issue Type: Task >Reporter: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > > Sorted Index Storage currently uses {{BinaryRow}} as way to convert column > values into byte arrays. This approach is not optimal for the following > reasons: > # Data is stored in RocksDB and we can't use its native lexicographic > comparator, we rely on a custom Java-based comparator that needs to > de-serialize all columns in order to compare them. This is bad > performance-wise, because Java-based comparators are slower and we need to > extract all column values; > # Range scans can't use the prefix seek operation from RocksDB, because > {{BinaryRow}} seralization is not stable: serialized prefix of column values > will not be a prefix of the whole serialized row, because the format depends > on columns being serialized; > # {{BinaryRow}} serialization is designed to store versioned row data and is > overall badly suited to the Sorted Index purposes, its API usage looks > awkward in this context. > We need to find a new serialization protocol that will (ideally) satisfy the > following requirements: > # It should be comparable lexicographically; > # It should support null values; > # It should support variable length columns (though this requirement can > probably be dropped); > # It should support both ascending and descending order for individual > columns; > # It should support all data types that {{BinaryRow}} uses. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16156) Byte ordered index keys.
[ https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16156: --- Epic Link: IGNITE-17304 > Byte ordered index keys. > > > Key: IGNITE-16156 > URL: https://issues.apache.org/jira/browse/IGNITE-16156 > Project: Ignite > Issue Type: Task > Components: sql >Reporter: Alexander Belyak >Assignee: Alexander Belyak >Priority: Major > Labels: ignite-3 > > To improve speed of operations with indexes ignite can store keys in byte > ordered format so only natural byte[] comparator will be enough to scan it. > Required features: > 1) write any (almost) data types. > Must to have: boolean, byte, short, int,long, float, double, bigint, > bigdecimal, String, Date, Time, DateTime. > Like to have: byte[], bitset > unlikely to have: timestamp with timezone > 2) Support null values for any columns. Like to have: support > nullFirst/nullLast > 3) write asc/desc ordered (in any combination for columns, for indexes like > "col1 asc, col2 desc, col3 asc"). > Non functional requirements: space used and speed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14937) Index schema & Index management integration
[ https://issues.apache.org/jira/browse/IGNITE-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14937: --- Epic Link: IGNITE-17304 > Index schema & Index management integration > --- > > Key: IGNITE-14937 > URL: https://issues.apache.org/jira/browse/IGNITE-14937 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > > Public index schema (required indexes) and current indexes state on the > cluster are different. > We have to track it, store it and provide actual indexes schema state for any > components: select query, DDL query etc.. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14936) Benchmark sorted index scan vs table's partitions scan
[ https://issues.apache.org/jira/browse/IGNITE-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14936: --- Epic Link: IGNITE-17304 > Benchmark sorted index scan vs table's partitions scan > -- > > Key: IGNITE-14936 > URL: https://issues.apache.org/jira/browse/IGNITE-14936 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > > We have to decide what are data structures is used for PK and table scan. > Possible cases: > - table partitions sorted by plain bytes/hash (in fact: unsorted); > - table partitions sorted by PK columns; > - PK sorted index (one store for all partitions on the node). > All cases have pros and cons. The choice should be based on benchmarks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14940) Investigation parallel index scan
[ https://issues.apache.org/jira/browse/IGNITE-14940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14940: --- Epic Link: IGNITE-17304 > Investigation parallel index scan > - > > Key: IGNITE-14940 > URL: https://issues.apache.org/jira/browse/IGNITE-14940 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > > Motivation: 2.x version implements {{queryParallelism}} by creation index > segments. Each segment contains subset of partitions. This approach has > several shortcomings: > - index scans parallelism cannot be changed / scaled on runtime. > - we have always scan all segments (looks like virtual MapNode for query); > - many index storages for one logical index. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14938) Introduce persistance store for the indexes states on cluster
[ https://issues.apache.org/jira/browse/IGNITE-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14938: --- Epic Link: IGNITE-17304 > Introduce persistance store for the indexes states on cluster > - > > Key: IGNITE-14938 > URL: https://issues.apache.org/jira/browse/IGNITE-14938 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > > Includes: > - building state progress; > - ready to scan / building; > - rebuild index; > - support node restart and index recovery. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14939) Tests coverage for index rebuild and recovery scenarios
[ https://issues.apache.org/jira/browse/IGNITE-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14939: --- Epic Link: IGNITE-17304 > Tests coverage for index rebuild and recovery scenarios > --- > > Key: IGNITE-14939 > URL: https://issues.apache.org/jira/browse/IGNITE-14939 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > > Test cases from version 2.x must be analyzed and ported to 3.0. > See in 2.x {{AbstractRebuildIndexTest}} and the children. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16199) Implements index build/rebuild
[ https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16199: --- Epic Link: IGNITE-17304 > Implements index build/rebuild > --- > > Key: IGNITE-16199 > URL: https://issues.apache.org/jira/browse/IGNITE-16199 > Project: Ignite > Issue Type: Improvement > Components: sql >Affects Versions: 3.0.0-alpha3 >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Index must be built on exists table data: scan table's data and build an > index. > Now only update index by table updates is implemented. > May be build and rebuild tasks may be split. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16196) Supports index rename
[ https://issues.apache.org/jira/browse/IGNITE-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16196: --- Epic Link: IGNITE-17304 > Supports index rename > - > > Key: IGNITE-16196 > URL: https://issues.apache.org/jira/browse/IGNITE-16196 > Project: Ignite > Issue Type: Improvement > Components: sql >Affects Versions: 3.0.0-alpha3 >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Need to supports index rename. > ALTER INDEX [ IF EXISTS ] RENAME TO -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage
[ https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16265: --- Epic Link: IGNITE-17304 > Integration SQL Index and data storage > -- > > Key: IGNITE-16265 > URL: https://issues.apache.org/jira/browse/IGNITE-16265 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Yury Gerzhedovich >Assignee: Konstantin Orlov >Priority: Major > Labels: ignite-3 > > Need to think about point of integration of data modification > (put/remove/amend) with update data at SQL indexes. > Let's as first version for integation will be update index on commit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16202) Supports transactions by index
[ https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16202: --- Epic Link: IGNITE-17304 > Supports transactions by index > -- > > Key: IGNITE-16202 > URL: https://issues.apache.org/jira/browse/IGNITE-16202 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Indexes must support transaction protocol. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IGNITE-14925) Sorted indexes engine
[ https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-14925. Resolution: Duplicate > Sorted indexes engine > - > > Key: IGNITE-14925 > URL: https://issues.apache.org/jira/browse/IGNITE-14925 > Project: Ignite > Issue Type: New Feature > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > The umbrella ticket to track improvements and issues related to design and > development sorted index engine for Ignite 3.0. > Feature branch: > [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14925) Sorted indexes engine
[ https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14925: --- Epic Link: IGNITE-17304 > Sorted indexes engine > - > > Key: IGNITE-14925 > URL: https://issues.apache.org/jira/browse/IGNITE-14925 > Project: Ignite > Issue Type: New Feature > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > The umbrella ticket to track improvements and issues related to design and > development sorted index engine for Ignite 3.0. > Feature branch: > [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-14925) Sorted indexes engine
[ https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562614#comment-17562614 ] Ivan Bessonov commented on IGNITE-14925: Replaced with EPIC > Sorted indexes engine > - > > Key: IGNITE-14925 > URL: https://issues.apache.org/jira/browse/IGNITE-14925 > Project: Ignite > Issue Type: New Feature > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > The umbrella ticket to track improvements and issues related to design and > development sorted index engine for Ignite 3.0. > Feature branch: > [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-17304) SQL indexes 3.0 epic
Ivan Bessonov created IGNITE-17304: -- Summary: SQL indexes 3.0 epic Key: IGNITE-17304 URL: https://issues.apache.org/jira/browse/IGNITE-17304 Project: Ignite Issue Type: Epic Reporter: Ivan Bessonov Ignite 3.x requires SQL indexes, just like any other database. Current Epic is the collection of issues related to indexes design and implementation. This includes: * indexes configuration * indexes lifecycle * indexes storage * indexes integration into SQL queries -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14925) Sorted indexes engine
[ https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14925: --- Issue Type: New Feature (was: Epic) > Sorted indexes engine > - > > Key: IGNITE-14925 > URL: https://issues.apache.org/jira/browse/IGNITE-14925 > Project: Ignite > Issue Type: New Feature > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > The umbrella ticket to track improvements and issues related to design and > development sorted index engine for Ignite 3.0. > Feature branch: > [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage
[ https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16265: --- Epic Link: (was: IGNITE-14925) > Integration SQL Index and data storage > -- > > Key: IGNITE-16265 > URL: https://issues.apache.org/jira/browse/IGNITE-16265 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Yury Gerzhedovich >Assignee: Konstantin Orlov >Priority: Major > Labels: ignite-3 > > Need to think about point of integration of data modification > (put/remove/amend) with update data at SQL indexes. > Let's as first version for integation will be update index on commit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16199) Implements index build/rebuild
[ https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16199: --- Epic Link: (was: IGNITE-14925) > Implements index build/rebuild > --- > > Key: IGNITE-16199 > URL: https://issues.apache.org/jira/browse/IGNITE-16199 > Project: Ignite > Issue Type: Improvement > Components: sql >Affects Versions: 3.0.0-alpha3 >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Index must be built on exists table data: scan table's data and build an > index. > Now only update index by table updates is implemented. > May be build and rebuild tasks may be split. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16202) Supports transactions by index
[ https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16202: --- Epic Link: (was: IGNITE-14925) > Supports transactions by index > -- > > Key: IGNITE-16202 > URL: https://issues.apache.org/jira/browse/IGNITE-16202 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Indexes must support transaction protocol. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16199) Implements index build/rebuild
[ https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16199: --- Epic Link: IGNITE-14925 > Implements index build/rebuild > --- > > Key: IGNITE-16199 > URL: https://issues.apache.org/jira/browse/IGNITE-16199 > Project: Ignite > Issue Type: Improvement > Components: sql >Affects Versions: 3.0.0-alpha3 >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Index must be built on exists table data: scan table's data and build an > index. > Now only update index by table updates is implemented. > May be build and rebuild tasks may be split. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage
[ https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16265: --- Epic Link: IGNITE-14925 > Integration SQL Index and data storage > -- > > Key: IGNITE-16265 > URL: https://issues.apache.org/jira/browse/IGNITE-16265 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Yury Gerzhedovich >Assignee: Konstantin Orlov >Priority: Major > Labels: ignite-3 > > Need to think about point of integration of data modification > (put/remove/amend) with update data at SQL indexes. > Let's as first version for integation will be update index on commit. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16202) Supports transactions by index
[ https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16202: --- Epic Link: IGNITE-14925 > Supports transactions by index > -- > > Key: IGNITE-16202 > URL: https://issues.apache.org/jira/browse/IGNITE-16202 > Project: Ignite > Issue Type: Improvement > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > Indexes must support transaction protocol. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14925) Sorted indexes engine
[ https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14925: --- Epic Name: Sorted SQL indexes > Sorted indexes engine > - > > Key: IGNITE-14925 > URL: https://issues.apache.org/jira/browse/IGNITE-14925 > Project: Ignite > Issue Type: Epic > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > The umbrella ticket to track improvements and issues related to design and > development sorted index engine for Ignite 3.0. > Feature branch: > [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-14925) Sorted indexes engine
[ https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14925: --- Issue Type: Epic (was: New Feature) > Sorted indexes engine > - > > Key: IGNITE-14925 > URL: https://issues.apache.org/jira/browse/IGNITE-14925 > Project: Ignite > Issue Type: Epic > Components: sql >Reporter: Taras Ledkov >Priority: Major > Labels: ignite-3 > > The umbrella ticket to track improvements and issues related to design and > development sorted index engine for Ignite 3.0. > Feature branch: > [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches
[ https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17272: --- Component/s: cache > Logical recovery works incorrectly for encrypted caches > --- > > Key: IGNITE-17272 > URL: https://issues.apache.org/jira/browse/IGNITE-17272 > Project: Ignite > Issue Type: Bug > Components: cache >Affects Versions: 2.13 >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Fix For: 2.14 > > Time Spent: 20m > Remaining Estimate: 0h > > When encryption is enabled for a particular cache, its WAL records get > encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type > is considered a {{PHYSICAL}} record, which leads to such records being > omitted during logical recovery regardless of the fact that it can contain > logical records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches
[ https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17272: --- Affects Version/s: 2.13 > Logical recovery works incorrectly for encrypted caches > --- > > Key: IGNITE-17272 > URL: https://issues.apache.org/jira/browse/IGNITE-17272 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.13 >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Fix For: 2.14 > > Time Spent: 20m > Remaining Estimate: 0h > > When encryption is enabled for a particular cache, its WAL records get > encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type > is considered a {{PHYSICAL}} record, which leads to such records being > omitted during logical recovery regardless of the fact that it can contain > logical records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches
[ https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561364#comment-17561364 ] Ivan Bessonov commented on IGNITE-17272: Looks good to me, thank you! I'll merge it to master > Logical recovery works incorrectly for encrypted caches > --- > > Key: IGNITE-17272 > URL: https://issues.apache.org/jira/browse/IGNITE-17272 > Project: Ignite > Issue Type: Bug >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When encryption is enabled for a particular cache, its WAL records get > encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type > is considered a {{PHYSICAL}} record, which leads to such records being > omitted during logical recovery regardless of the fact that it can contain > logical records. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-17283) ItCmgRaftServiceTest should start Raft groups in parallel
[ https://issues.apache.org/jira/browse/IGNITE-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17283: --- Ignite Flags: (was: Docs Required,Release Notes Required) > ItCmgRaftServiceTest should start Raft groups in parallel > - > > Key: IGNITE-17283 > URL: https://issues.apache.org/jira/browse/IGNITE-17283 > Project: Ignite > Issue Type: Improvement >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Minor > Labels: ignite-3 > Fix For: 3.0.0-alpha6 > > Time Spent: 20m > Remaining Estimate: 0h > > ItCmgRaftServiceTest starts a couple of Raft groups sequentially, so the > first group waits for other members to appear before it times out. This leads > to this test running for quite a long time. It is proposed to start these > groups in parallel. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (IGNITE-17283) ItCmgRaftServiceTest should start Raft groups in parallel
[ https://issues.apache.org/jira/browse/IGNITE-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561099#comment-17561099 ] Ivan Bessonov commented on IGNITE-17283: Looks good, thank you for the improvement! > ItCmgRaftServiceTest should start Raft groups in parallel > - > > Key: IGNITE-17283 > URL: https://issues.apache.org/jira/browse/IGNITE-17283 > Project: Ignite > Issue Type: Improvement >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Minor > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > ItCmgRaftServiceTest starts a couple of Raft groups sequentially, so the > first group waits for other members to appear before it times out. This leads > to this test running for quite a long time. It is proposed to start these > groups in parallel. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IGNITE-17278) TableManager#directTableIds can't be implemented effectively
Ivan Bessonov created IGNITE-17278: -- Summary: TableManager#directTableIds can't be implemented effectively Key: IGNITE-17278 URL: https://issues.apache.org/jira/browse/IGNITE-17278 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Assignee: Ivan Bessonov I propose adding a special method "internalIds" to direct proxy, so that there won't be the case for reading all tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IGNITE-16913) Provide affective way to write BinaryRow into byte buffer
Title: Message Title Ivan Bessonov updated an issue Ignite / IGNITE-16913 Provide affective way to write BinaryRow into byte buffer Change By: Ivan Bessonov Epic Link: IGNITE-16923 Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages
Title: Message Title Ivan Bessonov updated an issue Ignite / IGNITE-16655 Volatile RAFT log for pure in-memory storages Change By: Ivan Bessonov h3. Original issue descriptionFor in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process.h3. More detailed descriptionApparently, we can't completely ignore writing to log. There are several situations where it needs to be collected: * During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved. * During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.So, it all comes to the aggressive log truncation to make it short.Preserved log can be quite big in reality, there must be a disk offloading operation available.The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways. * Potentially, we could use a volatile page memory region for this purpose, since it already has a good control over the amount of memory used. But, memory overflow should be carefully processed, usually it's treated as an error and might even cause node failure.
[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages
Title: Message Title Ivan Bessonov updated an issue Ignite / IGNITE-16655 Volatile RAFT log for pure in-memory storages Change By: Ivan Bessonov h3. Original issue descriptionFor in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process.h3. More detailed description Apparently, we can't completely ignore writing to log. There are several situations where it needs to be collected: * During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved. * During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.So, it all comes to the aggressive log truncation to make it short.Preserved log can be quite big in reality, there must be a disk offloading operation available.The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways.
[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages
Title: Message Title Ivan Bessonov updated an issue Ignite / IGNITE-16655 Volatile RAFT log for pure in-memory storages Change By: Ivan Bessonov h3. Original issue description For in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process. h3. More detailed description Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages
Title: Message Title Ivan Bessonov updated an issue Ignite / IGNITE-16655 Volatile RAFT log for pure in-memory storages Change By: Ivan Bessonov Summary: Raft Volatile RAFT log improvements for pure in-memory storages Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Updated] (IGNITE-17230) Support splt-file page store
[ https://issues.apache.org/jira/browse/IGNITE-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17230: --- Description: *Notes* Description may not be complete. *Goal* To implement a new checkpoint (described in IGNITE-15818), we will introduce a new entity {*}DelataFilePageStore{*}, which will be created for each partition at each checkpoint and removed after merging with the *FilePageStore* (the main partition file) using the compacter. *DelataFilePageStore* will consist of: * Header (maybe updated in the course of implementation): ** Allocation *pageIdx* - *pageIdx* of the last created page; * Sorted list of *pageIdx* - allows a binary search to find the file offset for an {*}pageId -> pageIdx{*}; * Page content - sorted by {*}pageIdx{*}. What will change for {*}FilePageStore{*}: * List of class *DelataFilePageStore* will be added (from the newest to the oldest by the time of creation); * Allocation index (pageIdx of the last created page) - it will be logical and contained in the header of {*}FilePageStore{*}. At node start, it will be read from the header of *FilePageStore* or obtained from the first *DelataFilePageStore* (the newest one). How pages will be read by {*}pageId -> pageIdx{*}: * Interrogates the class *DelataFilePageStore* in order from the newest to the oldest; * If not found, then we read page from the *FilePageStore* itself. *Some implementation notes* * Format of the file name for the *DelataFilePageStore* is *part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit is the partition identifier, and the second is the serial number of the delta file for this partition; * Before creating {*}part-1-delta-3.bin{*}, a temporary file *part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, then renamed to {*}part-1-delta-3.bin{*}. was: *Notes* Description may not be complete. *Goal* To implement a new checkpoint (described in IGNITE-15818), we will introduce a new entity {*}DelataFilePageStore{*}, which will be created for each partition at each checkpoint and removed after merging with the *FilePageStore* (the main partition file) using the compacter. *DelataFilePageStore* will consist of: * Header (maybe updated in the course of implementation): ** Allocation *pageIdx* - *pageIdx* of the last created page; * Sorted list of *pageIds* - allows a binary search to find the file offset for an {*}pageId -> pageIdx{*}; * Page content - sorted by {*}pageIdx{*}. What will change for {*}FilePageStore{*}: * List of class *DelataFilePageStore* will be added (from the newest to the oldest by the time of creation); * Allocation index (pageIdx of the last created page) - it will be logical and contained in the header of {*}FilePageStore{*}. At node start, it will be read from the header of *FilePageStore* or obtained from the first *DelataFilePageStore* (the newest one). How pages will be read by {*}pageId -> pageIdx{*}: * Interrogates the class *DelataFilePageStore* in order from the newest to the oldest; * If not found, then we read page from the *FilePageStore* itself. *Some implementation notes* * Format of the file name for the *DelataFilePageStore* is *part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit is the partition identifier, and the second is the serial number of the delta file for this partition; * Before creating {*}part-1-delta-3.bin{*}, a temporary file *part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, then renamed to {*}part-1-delta-3.bin{*}. > Support splt-file page store > > > Key: IGNITE-17230 > URL: https://issues.apache.org/jira/browse/IGNITE-17230 > Project: Ignite > Issue Type: Task >Reporter: Kirill Tkalenko >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha6 > > > *Notes* > Description may not be complete. > *Goal* > To implement a new checkpoint (described in IGNITE-15818), we will introduce > a new entity {*}DelataFilePageStore{*}, which will be created for each > partition at each checkpoint and removed after merging with the > *FilePageStore* (the main partition file) using the compacter. > *DelataFilePageStore* will consist of: > * Header (maybe updated in the course of implementation): > ** Allocation *pageIdx* - *pageIdx* of the last created page; > * Sorted list of *pageIdx* - allows a binary search to find the file offset > for an {*}pageId -> pageIdx{*}; > * Page content - sorted by {*}pageIdx{*}. > What will change for {*}FilePageStore{*}: > * List of class *DelataFilePageStore* will be added (from the newest to the > oldest by the time of creation); > * Allocation index (pageIdx of the last created page) - it will be logical > and contained in the header of {*}Fi
[jira] [Updated] (IGNITE-17230) Support splt-file page store
[ https://issues.apache.org/jira/browse/IGNITE-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17230: --- Description: *Notes* Description may not be complete. *Goal* To implement a new checkpoint (described in IGNITE-15818), we will introduce a new entity {*}DelataFilePageStore{*}, which will be created for each partition at each checkpoint and removed after merging with the *FilePageStore* (the main partition file) using the compacter. *DelataFilePageStore* will consist of: * Header (maybe updated in the course of implementation): ** Allocation *pageIdx* - *pageIdx* of the last created page; * Sorted list of *pageIds* - allows a binary search to find the file offset for an {*}pageId -> pageIdx{*}; * Page content - sorted by {*}pageIdx{*}. What will change for {*}FilePageStore{*}: * List of class *DelataFilePageStore* will be added (from the newest to the oldest by the time of creation); * Allocation index (pageIdx of the last created page) - it will be logical and contained in the header of {*}FilePageStore{*}. At node start, it will be read from the header of *FilePageStore* or obtained from the first *DelataFilePageStore* (the newest one). How pages will be read by {*}pageId -> pageIdx{*}: * Interrogates the class *DelataFilePageStore* in order from the newest to the oldest; * If not found, then we read page from the *FilePageStore* itself. *Some implementation notes* * Format of the file name for the *DelataFilePageStore* is *part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit is the partition identifier, and the second is the serial number of the delta file for this partition; * Before creating {*}part-1-delta-3.bin{*}, a temporary file *part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, then renamed to {*}part-1-delta-3.bin{*}. was: *Notes* Description may not be complete. *Goal* To implement a new checkpoint (described in IGNITE-15818), we will introduce a new entity *DelataFilePageStore*, which will be created for each partition at each checkpoint and removed after merging with the *FilePageStore* (the main partition file) using the compacter. *DelataFilePageStore* will consist of: * Header (maybe updated in the course of implementation): ** Allocation *pageIdx* - *pageIdx* of the last created page; * Sorted list of *pageIdx* - allows a binary search to find the file offset for an *pageId -> pageIdx*; * Page content - sorted by *pageIdx*. What will change for *FilePageStore*: * List of class *DelataFilePageStore* will be added (from the newest to the oldest by the time of creation); * Allocation index (pageIdx of the last created page) - it will be logical and contained in the header of *FilePageStore*. At node start, it will be read from the header of *FilePageStore* or obtained from the first *DelataFilePageStore* (the newest one). How pages will be read by *pageId -> pageIdx*: * Interrogates the class *DelataFilePageStore* in order from the newest to the oldest; * If not found, then we read page from the *FilePageStore* itself. *Some implementation notes* * Format of the file name for the *DelataFilePageStore* is *part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit is the partition identifier, and the second is the serial number of the delta file for this partition; * Before creating *part-1-delta-3.bin*, a temporary file *part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, then renamed to *part-1-delta-3.bin*. > Support splt-file page store > > > Key: IGNITE-17230 > URL: https://issues.apache.org/jira/browse/IGNITE-17230 > Project: Ignite > Issue Type: Task >Reporter: Kirill Tkalenko >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha6 > > > *Notes* > Description may not be complete. > *Goal* > To implement a new checkpoint (described in IGNITE-15818), we will introduce > a new entity {*}DelataFilePageStore{*}, which will be created for each > partition at each checkpoint and removed after merging with the > *FilePageStore* (the main partition file) using the compacter. > *DelataFilePageStore* will consist of: > * Header (maybe updated in the course of implementation): > ** Allocation *pageIdx* - *pageIdx* of the last created page; > * Sorted list of *pageIds* - allows a binary search to find the file offset > for an {*}pageId -> pageIdx{*}; > * Page content - sorted by {*}pageIdx{*}. > What will change for {*}FilePageStore{*}: > * List of class *DelataFilePageStore* will be added (from the newest to the > oldest by the time of creation); > * Allocation index (pageIdx of the last created page) - it will be logical > and contained in the header of {*}FilePageStore{*}. At node start, it will be
[jira] [Commented] (IGNITE-17199) Improve the usability of the abstract configuration interface
[ https://issues.apache.org/jira/browse/IGNITE-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556722#comment-17556722 ] Ivan Bessonov commented on IGNITE-17199: [~ktkale...@gridgain.com] I don't think that improving something here is necessary. Wildcard types is integral part of Java type system, it's not a bad thing. Over-engineering everything because of several "" occasions in code won't make product better IMO. > Improve the usability of the abstract configuration interface > - > > Key: IGNITE-17199 > URL: https://issues.apache.org/jira/browse/IGNITE-17199 > Project: Ignite > Issue Type: Improvement >Reporter: Kirill Tkalenko >Priority: Major > Labels: iep-55, ignite-3 > Fix For: 3.0.0-alpha6 > > > *Problem* > Consider an example of generating configuration interfaces (**Configuration*) > for an abstract configuration. > Configuration schemas: > {code:java} > @AbstractConfiguration > public class BaseConfigurationSchema { > @Value > public int size; > } > @Config > public class VolatileConfigurationSchema extends BaseConfigurationSchema { > @Value > public double evictionThreshold; > } > {code} > Configuration interfaces: > {code:java} > public interface BaseConfiguration BaseChange> extends ConfigurationTree { > ConfigurationValue size(); > } > public interface VolatileConfiguration extends > BaseConfiguration { > ConfigurationValue size(); > } > {code} > This implementation allows us to work with the inheritors of the abstract > configuration as with a regular configuration (as if > *VolatileConfigurationSchema* did not extend *BaseConfigurationSchema*), but > when working with the abstract configuration itself, it creates > inconvenience. > For example, to get a view of the abstract configuration, we will need to > write the following code: > {code:java} > BaseConfiguration baseConfig0 = ...; > BaseConfiguration baseConfig1 = ...; > > BaseView baseView0 = (BasePageMemoryDataRegionView) baseConfig0.value(); > BaseView baseView1 = baseConfig1.value(); > {code} > Which is not convenient and I would like us to be able to work in the same > way as with the *VolatileConfiguration*. > *Possible implementations* > * Simplest is to leave it as is; > * Creates an additional configuration interface that will be similar to > *BaseConfiguration*, for example *BaseConfigurationTree*, but it will be > extended by *BaseConfiguration* and all its inheritors like > *VolatileConfiguration*, then there may be confusion about whether to use > *BaseConfiguration* or *BaseConfigurationTree* in the end, so we need to > decide how to create a name for such an interface; > ** *BaseConfigurationTree*; > ** *AbstractBaseConfigurationTree*; > ** other. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17077) Implement checkpointIndex for PDS
[ https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17077: --- Description: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. h2. General idea The idea doesn't seem complicated. There will be a "setUpdateIndex" and "getUpdateIndex" methods (names might be different). * First one is invoked at the end of every write command, with RAFT commit index being passed as a parameter. This is done right before releasing checkpoint read lock (or whatever the name we will come up with). More on that later. * Second one is invoked at the beginning of every write command to validate that update don't come out of order or with gaps. This is the way to guarantee that IndexMismatchException can be thrown at the right time. So, the write command flow will look like this. All names here are completely random. {code:java} try (ConsistencyLock lock = partition.acquireConsistencyLock()) { long updateIndex = partition.getUpdateIndex(); long raftIndex = writeCommand.raftIndex(); if (raftIndex != updateIndex + 1) { throw new IndexMismatchException(updateIndex); } partition.write(writeCommand.row()); for (Index index : table.indexes(partition) { index.index(writeCommand.row()); } partition.setUpdateIndex(raftIndex); }{code} Some nuances: * Mismatch exception must be thrown before any data modifications. Storage content must be intact, otherwise we'll just break it. * Case above is the simplest one - there's a single "atomic" storage update. Generally speaking, we can't or sometimes don't want to work this way. Examples of operations, where atomicity this strict is not required: ** Batch insert/update from the transaction. ** Transaction commit might have a huge number of row ids, we can exhaust the memory while committing. * If we split write operation into several operations, we should externally guarantee their idempotence. "setUpdateIndex" should be at the end of the last "atomic" operation, so that the last command could be safely reapplied. h2. Implementation "set" method could write a value directly into partitions meta page. This *will* work. But it's not quite optimal. Optimal solution is tightly coupled with the way checkpoint should work. This may not be the right place to describe the issue, but I do it nonetheless. It'll probably get split into another issue one day. There's a simple way to touch every meta page only once per checkpoint. We just do it while holding checkpoint write lock. This way data is consistent. But this solution is equally {*}bad{*}, it forces us to perform pages manipulation under write lock. Flushing freelists is enough already. (NOTE: we should test the performance without onheap-cache, it'll speed-up checkpoint start process, thus reducing latency spikes) Better way to do this is not having meta pages in page memory whatsoever. Maybe during the start, but that's it. It's a common practice to have a pageSize being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a loaded page for every partition is just a waste of resources, all required data can be stored on-heap. Then, let's rely on two simple facts: * If meta page date is cached on-heap, no one would need to read it from disk. I should also mention that it will mostly be immutable. * We can write partition meta page into every delta file even if meta has not changed. In actuality, this will be very rare situation. Considering both of these facts, checkpointer may unconditionally write meta page from heap to disk at the beginning of writing the delta file. This page will become a write-only page, which is basically what we need. h2. Callbacks and RAFT snapshots I argue against scheduled RAFT snapshots. They will produce a lot of junk checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine RAFT triggering snapshots for 100 partitions in a row. This will result in a 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation: * partition.getCheckpointerUpdateIndex(); * partition.registerCheckpointedUpdateIndexListener(closure); Bot of these methods could be used by RAFT to determine whether it needs to truncate its log and to define a specific commit index for truncation. In case of PDS checkpointer, implementation for both of these methods is trivial. was: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. h2. General idea The idea doesn't seem complicated. There will be a "setUpdateIndex" and "getUpdateIndex" methods (names might be different). * First one is invoked at the end of every write command, with RAFT commit index being passed as a parameter. This is done right befo
[jira] [Resolved] (IGNITE-17074) Create integer tableId identifier for tables
[ https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-17074. Resolution: Duplicate > Create integer tableId identifier for tables > > > Key: IGNITE-17074 > URL: https://issues.apache.org/jira/browse/IGNITE-17074 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > First of all, this requirement comes from the PageMemory component > restrictions - having an entire UUID for table id is too much for a loaded > pages list. Currently the implementation uses String hash, just like in > Ignite 2.x. This is a bad solution. > In Ignite 3.x configuration model, every configuration update is serialized > by design. This allows us to have atomic counters basically for free. We > could add a {{int lastTableId}} configuration property to a > {{TablesConfigurationSchema}}, for example, and increment it every time new > table is created. Then all we need is to read this value in all components > that need it. > Maybe we should even use it in thin clients, but that needs a careful > consideration. Originally, int tableId is intended to be used in storage > implementations and maybe as a part of unique RowId, associated with tables, > but that's only a speculation. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17074) Create integer tableId identifier for tables
[ https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17074: --- Description: First of all, this requirement comes from the PageMemory component restrictions - having an entire UUID for table id is too much for a loaded pages list. Currently the implementation uses String hash, just like in Ignite 2.x. This is a bad solution. In Ignite 3.x configuration model, every configuration update is serialized by design. This allows us to have atomic counters basically for free. We could add a {{int lastTableId}} configuration property to a {{TablesConfigurationSchema}}, for example, and increment it every time new table is created. Then all we need is to read this value in all components that need it. Maybe we should even use it in thin clients, but that needs a careful consideration. Originally, int tableId is intended to be used in storage implementations and maybe as a part of unique RowId, associated with tables, but that's only a speculation. was: First of all, this requirement comes from the PageMemory component restrictions - having an entire UUID for table id is too much for a loaded pages list. Currently the implementation uses String hash, just like in Ignite 2.x. This is a bad solution. In Ignite 3.x configuration model, every configuration update is serialized by design. This allows us to have atomic counters basically for free. We could add a {{int lastTableId}} configuration property to a {{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every time new table is created. Then all we need is to read this value in all components that need it. Maybe we should even use it in thin clients, but that needs a careful consideration. Originally, int tableId is intended to be used in storage implementations and maybe as a part of unique RowId, associated with tables, but that's only a speculation. > Create integer tableId identifier for tables > > > Key: IGNITE-17074 > URL: https://issues.apache.org/jira/browse/IGNITE-17074 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > First of all, this requirement comes from the PageMemory component > restrictions - having an entire UUID for table id is too much for a loaded > pages list. Currently the implementation uses String hash, just like in > Ignite 2.x. This is a bad solution. > In Ignite 3.x configuration model, every configuration update is serialized > by design. This allows us to have atomic counters basically for free. We > could add a {{int lastTableId}} configuration property to a > {{TablesConfigurationSchema}}, for example, and increment it every time new > table is created. Then all we need is to read this value in all components > that need it. > Maybe we should even use it in thin clients, but that needs a careful > consideration. Originally, int tableId is intended to be used in storage > implementations and maybe as a part of unique RowId, associated with tables, > but that's only a speculation. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17074) Create integer tableId identifier for tables
[ https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17074: --- Description: First of all, this requirement comes from the PageMemory component restrictions - having an entire UUID for table id is too much for a loaded pages list. Currently the implementation uses String hash, just like in Ignite 2.x. This is a bad solution. In Ignite 3.x configuration model, every configuration update is serialized by design. This allows us to have atomic counters basically for free. We could add a {{int lastTableId}} configuration property to a {{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every time new table is created. Then all we need is to read this value in all components that need it. Maybe we should even use it in thin clients, but that needs a careful consideration. Originally, int tableId is intended to be used in storage implementations and maybe as a part of unique RowId, associated with tables, but that's only a speculation. was: First of all, this requirement comes from the PageMemory component restrictions - having an entire UUID for table id is too much for a loaded pages list. Currently the implementation uses String hash, just like in Ignite 2.x. This is a bad solution. In Ignite 3.x configuration model, every configuration update is serialized by design. This allows us to have atomic counters basically for free. We could add a {{int lastTableId }}configuration property to a {{{}TablesConfigurationSchema{}}}, for example, and increment it every time new table is created. Then all we need is to read this value in all components that need it. Maybe we should even use it in thin clients, but that needs a careful consideration. Originally, int tableId is intended to be used in storage implementations and maybe as a part of unique RowId, associated with tables, but that's only a speculation. > Create integer tableId identifier for tables > > > Key: IGNITE-17074 > URL: https://issues.apache.org/jira/browse/IGNITE-17074 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > First of all, this requirement comes from the PageMemory component > restrictions - having an entire UUID for table id is too much for a loaded > pages list. Currently the implementation uses String hash, just like in > Ignite 2.x. This is a bad solution. > In Ignite 3.x configuration model, every configuration update is serialized > by design. This allows us to have atomic counters basically for free. We > could add a {{int lastTableId}} configuration property to a > {{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every > time new table is created. Then all we need is to read this value in all > components that need it. > Maybe we should even use it in thin clients, but that needs a careful > consideration. Originally, int tableId is intended to be used in storage > implementations and maybe as a part of unique RowId, associated with tables, > but that's only a speculation. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-17087) Native rebalance for PDS partitions
Ivan Bessonov created IGNITE-17087: -- Summary: Native rebalance for PDS partitions Key: IGNITE-17087 URL: https://issues.apache.org/jira/browse/IGNITE-17087 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov General idea of full rebalance is described in https://issues.apache.org/jira/browse/IGNITE-17083 For persistent storages, there's an option to avoid copy-on-write rebalance algorithms if it's desired. Intuitively, it's a preferable option. Each storage chooses its own format. h2. General idea In this case, PDS has checkpointing feature that saves consistent state on disk. I expect SQL indexes to be in the same partition file as other data. For every partition, its state on disk would look like this: {code:java} part-x.bin part-x-1.bin part-x-2.bin ... part-x-n.bin{code} part-x.bin is a baseline, and every other file is a delta that should be applied to underlying layers to get consistent data. It can be viewed like full and incremental backups. When rebalance snapshot is required, we could force a checkpoint and then *prohibit merging* of new deltas to delta files from the snapshot until rebalance is finished. We must guarantee that consistent state can be read from disk. Now, there are several strategies of data transferring: * File-based. We can send baseline and delta files as files. Two possible issues here: ** Files contain duplicated pages, so the volume of data will be bigger than necessary. ** Baseline file has to be truncated, because some delta pages go directly into baseline file as optimization. * Page-based. Latest state of every required page is sent separately. Two strategies here: ** Iterate pages in order of page indexes. Overheads during reads, but writes are very effective. ** Iterate pages in order of delta files, skipping already read pages in the process (like snapshots in GridGain, for example). Little overhead on read, but write won't be append-only. I would argue that slower reads are more appropriate then slower writes. Generally speaking, any write should be slower than any read of the same size, right? Should we implement all strategies and give user a choice? It's hard to predict which one is better for which scenario. In the future, I think it would be convenient to implement many options, but at first we should stick to the simplest one. There must be a common "infrastructure" or a framework to stream native rebalance snapshots. Data format should be as simple as possible. NOTE: of course, it has to be mentioned that this approach might lead to ineffective storage space usage. It can be a problem in theory, but in practice full rebalance isn't expected to occur often, and event then we don't expect that users will rewrite the entire partition data in a span of a single rebalance. h2. Possible problems Given that "raw" data is sent, including sql indexes, all incompleted indexes will be sent incompleted. Maybe we should also send a build state for each index so that the receiving side could continue from the right place, not from the beginning. This problem will be resolved in the future. Currently we don't have indexes implemented. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17084) Native rebalance for RocksDB partitions
[ https://issues.apache.org/jira/browse/IGNITE-17084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17084: --- Description: General idea of full rebalance is described in https://issues.apache.org/jira/browse/IGNITE-17083 For persistent storages, there's an option to avoid copy-on-write rebalance algorithms if it's desired. Intuitively, it's a preferable option. Each storage chooses its own format. In this case, RocksDB allows consistent db iteration using a "Snapshot" feature. Idea is very simple: * Take a RoackDB snapshot. * Iterate through partition data. * Iterate through indexes. * Relese the snapshot. There must be a common "infrastructure" or a framework to stream native rebalance snapshots. Data format should be as simple as possible. NOTE: of course, it has to be mentioned that this approach might lead to ineffective storage space usage. What I mean is that "previous" versions of values, in terms of RocksDB, must be stored on the device if they're visible from any of snapshots. It can be a problem in theory, but in practice full rebalance isn't expected to occur often, and event then we don't expect that users will rewrite the entire partition data in a span of a single rebalance. h2. Possible problems Given that "raw" data is sent, including sql indexes, all incompleted indexes will be sent incompleted. Maybe we should also send a build state for each index so that the receiving side could continue from the right place, not from the beginning. This problem will be resolved in the future. Currently we don't have indexes implemented. was: General idea of full rebalance is described in https://issues.apache.org/jira/browse/IGNITE-17083 For persistent storages, there's an option to avoid copy-on-write rebalance algorithms if it's desired. Intuitively, it's a preferable option. Each storage chooses its own format. In this case, RocksDB allows consistent db iteration using a "Snapshot" feature. Idea is very simple: * Take a RoackDB snapshot. * Iterate through partition data. * Iterate through indexes. * Relese the snapshot. There must be a common "infrastructure" or a framework to stream native rebalance snapshots. Data format should be as simple as possible. h2. Possible problems Given that "raw" data is sent, including sql indexes, all incompleted indexes will be sent incompleted. Maybe we should also send a build state for each index so that the receiving side could continue from the right place, not from the beginning. This problem will be resolved in the future. Currently we don't have indexes implemented. > Native rebalance for RocksDB partitions > --- > > Key: IGNITE-17084 > URL: https://issues.apache.org/jira/browse/IGNITE-17084 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > General idea of full rebalance is described in > https://issues.apache.org/jira/browse/IGNITE-17083 > For persistent storages, there's an option to avoid copy-on-write rebalance > algorithms if it's desired. Intuitively, it's a preferable option. Each > storage chooses its own format. > In this case, RocksDB allows consistent db iteration using a "Snapshot" > feature. Idea is very simple: > * Take a RoackDB snapshot. > * Iterate through partition data. > * Iterate through indexes. > * Relese the snapshot. > There must be a common "infrastructure" or a framework to stream native > rebalance snapshots. Data format should be as simple as possible. > NOTE: of course, it has to be mentioned that this approach might lead to > ineffective storage space usage. What I mean is that "previous" versions of > values, in terms of RocksDB, must be stored on the device if they're visible > from any of snapshots. It can be a problem in theory, but in practice full > rebalance isn't expected to occur often, and event then we don't expect that > users will rewrite the entire partition data in a span of a single rebalance. > h2. Possible problems > Given that "raw" data is sent, including sql indexes, all incompleted indexes > will be sent incompleted. Maybe we should also send a build state for each > index so that the receiving side could continue from the right place, not > from the beginning. > This problem will be resolved in the future. Currently we don't have indexes > implemented. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-17084) Native rebalance for RocksDB partitions
Ivan Bessonov created IGNITE-17084: -- Summary: Native rebalance for RocksDB partitions Key: IGNITE-17084 URL: https://issues.apache.org/jira/browse/IGNITE-17084 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov General idea of full rebalance is described in https://issues.apache.org/jira/browse/IGNITE-17083 For persistent storages, there's an option to avoid copy-on-write rebalance algorithms if it's desired. Intuitively, it's a preferable option. Each storage chooses its own format. In this case, RocksDB allows consistent db iteration using a "Snapshot" feature. Idea is very simple: * Take a RoackDB snapshot. * Iterate through partition data. * Iterate through indexes. * Relese the snapshot. There must be a common "infrastructure" or a framework to stream native rebalance snapshots. Data format should be as simple as possible. h2. Possible problems Given that "raw" data is sent, including sql indexes, all incompleted indexes will be sent incompleted. Maybe we should also send a build state for each index so that the receiving side could continue from the right place, not from the beginning. This problem will be resolved in the future. Currently we don't have indexes implemented. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17083) Universal full rebalance procedure for MV storage
[ https://issues.apache.org/jira/browse/IGNITE-17083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17083: --- Description: Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots of data. This is not always a good idea. First of all, for persistent data is already stored somewhere and can be read at any time. Second, for volatile storage this requirement is just absurd. So, a "rebalance snapshot" should be streamed from one node to another instead of being written to a storage. What's good is that this approach can be implemented independently from the storage engine (with few adjustments to storage API, of course). h2. General idea Once a "rebalance snapshot" operation is triggered, we open a special type of cursor from the partition storage, that is able to give us all versioned chains in {_}some fixed order{_}. Every time the next chain has been read, it's remembered as the last read (let's call it\{{ lastRowId}} for now). Then all versions for the specific row id should be sent to receiver node in "Oldest to Newest" order to simplify insertion. This works fine without concurrent load. To account for that we need to have a additional collection of row ids, associated with a snapshot. Let's call it {{{}overwrittenRowIds{}}}. With this in mind, every write command should look similar to this: {noformat} for (var rebalanceSnaphot : ongoingRebalanceSnapshots) { try (var lock = rebalanceSnaphot.lock()) { if (rowId <= rebalanceSnaphot.lastRowId()) continue; if (!rebalanceSnaphot.overwrittenRowIds().put(rowId)) continue; rebalanceSnapshot.sendRowToReceiver(rowId); } } // Now modification can be freely performed. // Snapshot itself will skip everything from the "overwrittenRowIds" collection.{noformat} NOTE: rebalance snapshot scan must also return uncommitted write intentions. Their commit will be replicated later from the RAFT log. NOTE: receiving side will have to rebuild indexes during the rebalancing. Just like it works in Ignite 2.x. NOTE: Technically it is possible to have several nodes entering the cluster that require a full rebalance. So, while triggering a rebalance snapshot cursor, we could wait for other nodes that might want to read the same data and process all of them with a single scan. This is an optimization, obviously. h2. Implementation The implementation will have to be split into several parts, because we need: * Support for snapshot streaming in RAFT state machine. * Storage API for this type of scan. * Every storage must implement the new scan method. * Streamer itself should be implemented, along with a specific logic in write commands. was: Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots of data. This is not always a good idea. First of all, for persistent data is already stored somewhere and can be read at any time. Second, for volatile storage this requirement is just absurd. So, a "rebalance snapshot" should be streamed from one node to another instead of being written to a storage. What's good is that this approach can be implemented independently from the storage engine (with few adjustments to storage API, of course). h2. General idea Once a "rebalance snapshot" operation is triggered, we open a special type of cursor from the partition storage, that is able to give us all versioned chains in {_}some fixed order{_}. Every time the next chain has been read, it's remembered as the last read (let's call it{{ lastRowId}} for now). Then all versions for the specific row id should be sent to receiver node in "Oldest to Newest" order to simplify insertion. This works fine without concurrent load. To account for that we need to have a additional collection of row ids, associated with a snapshot. Let's call it {{{}overwrittenRowIds{}}}. With this in mind, every write command should look similar to this: {noformat} for (var rebalanceSnaphot : ongoingRebalanceSnapshots) { try (var lock = rebalanceSnaphot.lock()) { if (rowId <= rebalanceSnaphot.lastRowId()) continue; if (!rebalanceSnaphot.overwrittenRowIds().put(rowId)) continue; rebalanceSnapshot.sendRowToReceiver(rowId); } } // Now modification can be freely performed. // Snapshot itself will skip everything from the "overwrittenRowIds" collection.{noformat} NOTE: rebalance snapshot scan must also return uncommitted write intentions. Their commit will be replicated later from the RAFT log. NOTE: receiving side will have to rebuild indexes during the rebalancing. Just like it works in Ignite 2.x. NOTE: Technically it is possible to have several nodes entering the cluster that require a full rebalance. So, while triggering a rebalance snapshot cursor, we could wait for other nodes that might want to read the same data and process all of them with a single scan. Thi
[jira] [Created] (IGNITE-17083) Universal full rebalance procedure for MV storage
Ivan Bessonov created IGNITE-17083: -- Summary: Universal full rebalance procedure for MV storage Key: IGNITE-17083 URL: https://issues.apache.org/jira/browse/IGNITE-17083 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots of data. This is not always a good idea. First of all, for persistent data is already stored somewhere and can be read at any time. Second, for volatile storage this requirement is just absurd. So, a "rebalance snapshot" should be streamed from one node to another instead of being written to a storage. What's good is that this approach can be implemented independently from the storage engine (with few adjustments to storage API, of course). h2. General idea Once a "rebalance snapshot" operation is triggered, we open a special type of cursor from the partition storage, that is able to give us all versioned chains in {_}some fixed order{_}. Every time the next chain has been read, it's remembered as the last read (let's call it{{ lastRowId}} for now). Then all versions for the specific row id should be sent to receiver node in "Oldest to Newest" order to simplify insertion. This works fine without concurrent load. To account for that we need to have a additional collection of row ids, associated with a snapshot. Let's call it {{{}overwrittenRowIds{}}}. With this in mind, every write command should look similar to this: {noformat} for (var rebalanceSnaphot : ongoingRebalanceSnapshots) { try (var lock = rebalanceSnaphot.lock()) { if (rowId <= rebalanceSnaphot.lastRowId()) continue; if (!rebalanceSnaphot.overwrittenRowIds().put(rowId)) continue; rebalanceSnapshot.sendRowToReceiver(rowId); } } // Now modification can be freely performed. // Snapshot itself will skip everything from the "overwrittenRowIds" collection.{noformat} NOTE: rebalance snapshot scan must also return uncommitted write intentions. Their commit will be replicated later from the RAFT log. NOTE: receiving side will have to rebuild indexes during the rebalancing. Just like it works in Ignite 2.x. NOTE: Technically it is possible to have several nodes entering the cluster that require a full rebalance. So, while triggering a rebalance snapshot cursor, we could wait for other nodes that might want to read the same data and process all of them with a single scan. This is an optimization, obviously. h2. Implementation The implementation will have to be split into several parts, because we need: * Support for snapshot streaming in RAFT state machine. * Storage API for this type of scan. * Every storage must implement the new scan method. * Streamer itself should be implemented, along with a specific logic in write commands. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-17081) Implement checkpointIndex for RocksDB
Ivan Bessonov created IGNITE-17081: -- Summary: Implement checkpointIndex for RocksDB Key: IGNITE-17081 URL: https://issues.apache.org/jira/browse/IGNITE-17081 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top of possible performance degradation. Obvious solution is inherently bad and needs to be improved. h2. General idea & implementation Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda breaks RocksDB recovery procedure, we need to take measures to avoid it. The only feasible way to do so is to use DBOptions#setAtomicFlush in conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save all column families consistently, if you have batches that cover several CFs. Basically, {{acquireConsistencyLock()}} would create a thread-local write batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will be affected by this change. NOTE: I believe that scans with unapplied batches should be prohibited for now (gladly, there's a WriteBatchInterface#count() to check). I don't see any practical value and a proper way of implementing it, considering how spread-out in time the scan process is. h2. Callbacks and RAFT snapshots Simply storing and reading update index is easy. Reading committed index is more challenging, I propose caching it and update only from the closure, that can also be used by RAFT to truncate the log. For a closure, there are several things to account for during the implementation: * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in atomic flush mode. And, once you have your first "completed" event ,you have a guarantee that *all* memtables are already persisted. This allows easy tracking of RocksDB flushes, monitoring events alteration is all that's needed. * Unlike PDS implementation, here we will be writing updateIndex value into a memtable every time. This makes it harder to find persistedIndex values for partitions. Gladly, considering the events that we have, during the time between first "completed" and the very next "begin", the state on disk is fully consistent. And there's a way to read data from storage avoiding memtable completely - ReadOptions#setReadTier(PERSISTED_TIER). Summarizing everything from the above, we should implement following protocol: {code:java} During table start: read latest values of update indexes. Store them in an in-memory structure. Set "lastEventType = ON_FLUSH_COMPLETED;". onFlushBegin: if (lastEventType == ON_FLUSH_BEGIN) return; waitForLastAsyncUpdateIndexesRead(); lastEventType = ON_FLUSH_BEGIN; onFlushCompleted: if (lastEventType == ON_FLUSH_COMPLETED) return; asyncReadUpdateIndexesFromDisk(); lastEventType = ON_FLUSH_COMPLETED;{code} Reading values from disk must be performed asynchronously to not stall flushing process. We don't control locks that RocksDb holds while calling listener's methods. That asynchronous process would invoke closures that provide presisted updateIndex values to other components. NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But my implementation says calling it during the first event. This is fine. I noticed that column families are flushed in order of their internal ids. These ids correspond to a sequence number of CFs, and the "default" CF is always created first. This is the exact CF that we use to store meta. Maybe we're going to change this and create a separate meta CF. Only then we could start optimizing this part, and only if we'll have an actual proof that there's a stall in this exact place. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB
[ https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17081: --- Description: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top of possible performance degradation. Obvious solution is inherently bad and needs to be improved. h2. General idea & implementation Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda breaks RocksDB recovery procedure, we need to take measures to avoid it. The only feasible way to do so is to use DBOptions#setAtomicFlush in conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save all column families consistently, if you have batches that cover several CFs. Basically, {{acquireConsistencyLock()}} would create a thread-local write batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will be affected by this change. NOTE: I believe that scans with unapplied batches should be prohibited for now (gladly, there's a WriteBatchInterface#count() to check). I don't see any practical value and a proper way of implementing it, considering how spread-out in time the scan process is. h2. Callbacks and RAFT snapshots Simply storing and reading update index is easy. Reading committed index is more challenging, I propose caching it and update only from the closure, that can also be used by RAFT to truncate the log. For a closure, there are several things to account for during the implementation: * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in atomic flush mode. And, once you have your first "completed" event ,you have a guarantee that *all* memtables are already persisted. This allows easy tracking of RocksDB flushes, monitoring events alteration is all that's needed. * Unlike PDS implementation, here we will be writing updateIndex value into a memtable every time. This makes it harder to find persistedIndex values for partitions. Gladly, considering the events that we have, during the time between first "completed" and the very next "begin", the state on disk is fully consistent. And there's a way to read data from storage avoiding memtable completely - ReadOptions#setReadTier(PERSISTED_TIER). Summarizing everything from the above, we should implement following protocol: {code:java} During table start: read latest values of update indexes. Store them in an in-memory structure. Set "lastEventType = ON_FLUSH_COMPLETED;". onFlushBegin: if (lastEventType == ON_FLUSH_BEGIN) return; waitForLastAsyncUpdateIndexesRead(); lastEventType = ON_FLUSH_BEGIN; onFlushCompleted: if (lastEventType == ON_FLUSH_COMPLETED) return; asyncReadUpdateIndexesFromDisk(); lastEventType = ON_FLUSH_COMPLETED;{code} Reading values from disk must be performed asynchronously to not stall flushing process. We don't control locks that RocksDb holds while calling listener's methods. That asynchronous process would invoke closures that provide presisted updateIndex values to other components. NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But my implementation says calling it during the first event. This is fine. I noticed that column families are flushed in order of their internal ids. These ids correspond to a sequence number of CFs, and the "default" CF is always created first. This is the exact CF that we use to store meta. Maybe we're going to change this and create a separate meta CF. Only then we could start optimizing this part, and only if we'll have an actual proof that there's a stall in this exact place. was: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top of possible performance degradation. Obvious solution is inherently bad and needs to be improved. h2. General idea & implementation Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda breaks RocksDB recovery procedure, we need to take
[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB
[ https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17081: --- Description: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top of possible performance degradation. Obvious solution is inherently bad and needs to be improved. h2. General idea & implementation Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda breaks RocksDB recovery procedure, we need to take measures to avoid it. The only feasible way to do so is to use DBOptions#setAtomicFlush in conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save all column families consistently, if you have batches that cover several CFs. Basically, {{acquireConsistencyLock()}} would create a thread-local write batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will be affected by this change. NOTE: I believe that scans with unapplied batches should be prohibited for now (gladly, there's a WriteBatchInterface#count() to check). I don't see any practical value and a proper way of implementing it, considering how spread-out in time the scan process is. h2. Callbacks and RAFT snapshots Simply storing and reading update index is easy. Reading committed index is more challenging, I propose caching it and update only from the closure, that can also be used by RAFT to truncate the log. For a closure, there are several things to account for during the implementation: * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in atomic flush mode. And, once you have your first "completed" event ,you have a guarantee that *all* memtables are already persisted. This allows easy tracking of RocksDB flushes, monitoring events alteration is all that's needed. * Unlike PDS implementation, here we will be writing updateIndex value into a memtable every time. This makes it harder to find persistedIndex values for partitions. Gladly, considering the events that we have, during the time between first "completed" and the very next "begin", the state on disk is fully consistent. And there's a way to read data from storage avoiding memtable completely - ReadOptions#setReadTier(PERSISTED_TIER). Summarizing everything from the above, we should implement following protocol: {code:java} During table start: read latest values of update indexes. Store them in an in-memory structure. Set "lastEventType = ON_FLUSH_COMPLETED;". onFlushBegin: if (lastEventType == ON_FLUSH_BEGIN) return; waitForLastAsyncUpdateIndexesRead(); lastEventType = ON_FLUSH_BEGIN; onFlushCompleted: if (lastEventType == ON_FLUSH_COMPLETED) return; asyncReadUpdateIndexesFromDisk(); lastEventType = ON_FLUSH_COMPLETED;{code} Reading values from disk must be performed asynchronously to not stall flushing process. We don't control locks that RocksDb holds while calling listener's methods. That asynchronous process would invoke closures that provide presisted updateIndex values to other components. NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But my implementation says calling it during the first event. This is fine. I noticed that column families are flushed in order of their internal ids. These ids correspond to a sequence number of CFs, and the "default" CF is always created first. This is the exact CF that we use to store meta. Maybe we're going to change this and create a separate meta CF. Only then we could start optimizing this part, and only if we'll have an actual proof that there's a stall in this exact place. was: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. Please also familiarize yourself with https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, the description is continued from there. For RocksDB based storage the recovery process is trivial, because RocksDB has its own WAL. So, for testing purposes, it would be enough to just store update index in meta column family. Immediately we have a write amplification issue, on top of possible performance degradation. Obvious solution is inherently bad and needs to be improved. h2. General idea & implementation Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda breaks RocksDB recovery procedure, we need to take m
[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB
[ https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17081: --- Labels: ignite-3 (was: ) > Implement checkpointIndex for RocksDB > - > > Key: IGNITE-17081 > URL: https://issues.apache.org/jira/browse/IGNITE-17081 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for > prerequisites. > Please also familiarize yourself with > https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, > the description is continued from there. > For RocksDB based storage the recovery process is trivial, because RocksDB > has its own WAL. So, for testing purposes, it would be enough to just store > update index in meta column family. > Immediately we have a write amplification issue, on top of possible > performance degradation. Obvious solution is inherently bad and needs to be > improved. > h2. General idea & implementation > Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda > breaks RocksDB recovery procedure, we need to take measures to avoid it. > The only feasible way to do so is to use DBOptions#setAtomicFlush in > conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save > all column families consistently, if you have batches that cover several CFs. > Basically, {{acquireConsistencyLock()}} would create a thread-local write > batch, that's applied on locks release. Most of RocksDbMvPartitionStorage > will be affected by this change. > NOTE: I believe that scans with unapplied batches should be prohibited for > now (gladly, there's a WriteBatchInterface#count() to check). I don't see > any practical value and a proper way of implementing it, considering how > spread-out in time the scan process is. > h2. Callbacks and RAFT snapshots > Simply storing and reading update index is easy. Reading committed index is > more challenging, I propose caching it and update only from the closure, that > can also be used by RAFT to truncate the log. > For a closure, there are several things to account for during the > implementation: > * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and > ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in > atomic flush mode. And, once you have your first "completed" event ,you have > a guarantee that *all* memtables are already persisted. > This allows easy tracking of RocksDB flushes, monitoring events alteration is > all that's needed. > * Unlike PDS implementation, here we will be writing updateIndex value into > a memtable every time. This makes it harder to find persistedIndex values for > partitions. Gladly, considering the events that we have, during the time > between first "completed" and the very next "begin", the state on disk is > fully consistent. And there's a way to read data from storage avoiding > memtable completely - ReadOptions#setReadTier(PERSISTED_TIER). > Summarizing everything from the above, we should implement following protocol: > > {code:java} > During table start: read latest values of update indexes. Store them in an > in-memory structure. > Set "lastEventType = ON_FLUSH_COMPLETED;". > onFlushBegin: > if (lastEventType == ON_FLUSH_BEGIN) > return; > waitForLastAsyncUpdateIndexesRead(); > lastEventType = ON_FLUSH_BEGIN; > onFlushCompleted: > if (lastEventType == ON_FLUSH_COMPLETED) > return; > asyncReadUpdateIndexesFromDisk(); > lastEventType = ON_FLUSH_COMPLETED;{code} > Reading values from disk must be performed asynchronously to not stall > flushing process. We don't control locks that RocksDb holds while calling > listener's methods. > > That asynchronous process would invoke closures that provide presisted > updateIndex values to other components. > NODE: One might say that we should call > "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But > my implementation says calling it during the first event. This is fine. I > noticed that column families are flushed in order of their internal ids. > These ids correspond to a sequence number of CFs, and the "default" CF is > always created first. This is the exact CF that we use to store meta. Maybe > we're going to change this and create a separate meta CF. Only then we could > start optimizing this part, and only if we'll have an actual proof that > there's a stall in this exact place. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17077) Implement checkpointIndex for PDS
[ https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17077: --- Description: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. h2. General idea The idea doesn't seem complicated. There will be a "setUpdateIndex" and "getUpdateIndex" methods (names might be different). * First one is invoked at the end of every write command, with RAFT commit index being passed as a parameter. This is done right before releasing checkpoint read lock (or whatever the name we will come up with). More on that later. * Second one is invoked at the beginning of every write command to validate that update don't come out of order or with gaps. This is the way to guarantee that IndexMismatchException can be thrown at the right time. So, the write command flow will look like this. All names here are completely random. {code:java} try (ConsistencyLock lock = partition.acquireConsistencyLock()) { long updateIndex = partition.getUpdateIndex(); long raftIndex = writeCommand.raftIndex(); if (raftIndex != updateIndex + 1) { throw new IndexMismatchException(updateIndex); } partition.write(writeCommand.row()); for (Index index : table.indexes(partition) { index.index(writeCommand.row()); } partition.setUpdateIndex(raftIndex); }{code} Some nuances: * Mismatch exception must be thrown before any data modifications. Storage content must be intact, otherwise we'll just break it. * Case above is the simplest one - there's a single "atomic" storage update. Generally speaking, we can't or sometimes don't want to work this way. Examples of operations, where atomicity this strict is not required: ** Batch insert/update from the transaction. ** Transaction commit might have a huge number of row ids, we can exhaust the memory while committing. * If we split write operation into several operations, we should externally guarantee their idempotence. "setUpdateIndex" should be at the end of the last "atomic" operation, so that the last command could be safely reapplied. h2. Implementation "set" method could write a value directly into partitions meta page. This *will* work. But it's not quite optimal. Optimal solution is tightly coupled with the way checkpoint should work. This may not be the right place to describe the issue, but I do it nonetheless. It'll probably get split into another issue one day. There's a simple way to touch every meta page only once per checkpoint. We just do it while holding checkpoint write lock. This way data is consistent. But this solution is equally {*}bad{*}, it forces us to perform pages manipulation under write lock. Flushing freelists is enough already. (NOTE: we should test the performance without onheap-cache, it'll speed-up checkpoint start process, thus reducing latency spikes) Better way to do this is not having meta pages in page memory whatsoever. Maybe during the start, but that's it. It's a common practice to have a pageSize being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a loaded page for every partition is just a waste of resources, all required data can be stored on-heap. Then, let's rely on two simple facts: * If meta page date is cached on-heap, no one would need to read it from disk. I should also mention that it will mostly be immutable. * We can write partition meta page into every delta file even if meta has not changed. In actuality, this is will be very rare situation. Considering both of these facts, checkpointer may unconditionally write meta page from heap to disk at the beginning of writing the delta file. This page will become a write-only page, which is basically what we need. h2. Callbacks and RAFT snapshots I argue against scheduled RAFT snapshots. They will produce a lot of junk checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine RAFT triggering snapshots for 100 partitions in a row. This will result in a 100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation: * partition.getCheckpointerUpdateIndex(); * partition.registerCheckpointedUpdateIndexListener(closure); Bot of these methods could be used by RAFT to determine whether it needs to truncate its log and to define a specific commit index for truncation. In case of PDS checkpointer, implementation for both of these methods is trivial. was: Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. h2. General idea The idea doesn't seem complicated. There will be a "setUpdateIndex" and "getUpdateIndex" methods (names might be different). * First one is invoked at the end of every write command, with RAFT commit index being passed as a parameter. This is done right b
[jira] [Created] (IGNITE-17077) Implement checkpointIndex for PDS
Ivan Bessonov created IGNITE-17077: -- Summary: Implement checkpointIndex for PDS Key: IGNITE-17077 URL: https://issues.apache.org/jira/browse/IGNITE-17077 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for prerequisites. h2. General idea The idea doesn't seem complicated. There will be a "setUpdateIndex" and "getUpdateIndex" methods (names might be different). * First one is invoked at the end of every write command, with RAFT commit index being passed as a parameter. This is done right before releasing checkpoint read lock (or whatever the name we will come up with). More on that later. * Second one is invoked at the beginning of every write command to validate that update don't come out of order or with gaps. This is the way to guarantee that IndexMismatchException can be thrown at the right time. So, the write command flow will look like this. All names here are completely random. {code:java} try (ConsistencyLock lock = partition.acquireConsistencyLock()) { long updateIndex = partition.getUpdateIndex(); long raftIndex = writeCommand.raftIndex(); if (raftIndex != updateIndex + 1) { throw new IndexMismatchException(updateIndex); } partition.write(writeCommand.row()); for (Index index : table.indexes(partition) { index.index(writeCommand.row()); } partition.setUpdateIndex(raftIndex); }{code} Some nuances: * Mismatch exception must be thrown before any data modifications. Storage content must be intact, otherwise we'll just break it. * Case above is the simplest one - there's a single "atomic" storage update. Generally speaking, we can't or sometimes don't want to work this way. Examples of operations, where atomicity this strict is not required: ** Batch insert/update from the transaction. ** Transaction commit might have a huge number of row ids, we can exhaust the memory while committing. * If we split write operation into several operations, we should externally guarantee their idempotence. "setUpdateIndex" should be at the end of the last "atomic" operation, so that the last command could be safely reapplied. h2. Implementation "set" method could write a value directly into partitions meta page. This *will* work. But it's not quite optimal. Optimal solution is tightly coupled with the way checkpoint should work. This may not be the right place to describe the issue, but I do it nonetheless. It'll probably get split into another issue one day. There's a simple way to touch every meta page only once per checkpoint. We just do it while holding checkpoint write lock. This way data is consistent. But this solution is equally {*}bad{*}, it forces us to perform pages manipulation under write lock. Flushing freelists is enough already. (NOTE: we should test the performance without onheap-cache, it'll speed-up checkpoint start process, thus reducing latency spikes) Better way to do this is not having meta pages in page memory whatsoever. Maybe during the start, but that's it. It's a common practice to have a pageSize being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a loaded page for every partition is just a waste of resources, all required data can be stored on-heap. Then, let's rely on two simple facts: * If meta page date is cached on-heap, no one would need to read it from disk. I should also mention that it will mostly be immutable. * We can write partition meta page into every delta file even if meta has not changed. In actuality, this is will be very rare situation. Considering both of these facts, checkpointer may unconditionally write meta page from heap to disk at the beginning of writing the delta file. This page will become a write-only page, which is basically what we need. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-17076) Unify RowId format for different storages
[ https://issues.apache.org/jira/browse/IGNITE-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-17076: --- Labels: ignite-3 (was: ) > Unify RowId format for different storages > - > > Key: IGNITE-17076 > URL: https://issues.apache.org/jira/browse/IGNITE-17076 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > Current MV store bridge API has a fatal flaw, born from a misunderstanding. > There's a method called "insert" that generates RowId by itself. This is > wrong, because it can lead to different id for the same row on the replica > storage. This completely breaks everything. > Every replicated write command, that inserts new value, should produce same > row ids. There are several ways to achieve this: > * Use timestamps as identifiers. This is not very convenient, because we > would have to attach partition id on top of it. It's mandatory to know the > partition of the row. > * Use more complicated structure, for example a tuple of (raftCommitIndex, > partitionId, batchCounter), where > ** raftCommitIndex is the index of write command that performs insertion. > ** partitionId is an integer identifier of the partition. Could be 4 bytes, > considering that there are plans to support more than 65000 partitions per > table. > ** batchCounter is used to differentiate insertions made in a single write > command. We can limit it with 2 bytes to save a little bit of space, if it's > necessary. > I prefer the second option, but maybe it could be revised during the > implementation. > Of course, method "insert" should be removed from bridge API. Tests have to > be updated. With the lack of RAFT group in storage tests, we can generate row > ids artificially, it's not a big deal. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-15818) [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and re-implementation
[ https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-15818: --- Description: h2. Goal Port and refactor core classes implementing page-based persistent store in Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, PageMemoryImpl, Checkpointer, FileWriteAheadLogManager. New checkpoint implementation to avoid excessive logging. Store lifecycle clarification to avoid complicated and invasive code of custom lifecycle managed mostly by DatabaseSharedManager. h2. Items to pay attention to New checkpoint implementation based on split-file storage, new page index structure to maintain disk-memory page mapping. File page store implementation should be extracted from GridCacheOffheapManager to a separate entity, target implementation should support new version of checkpoint (split-file store to enable always-consistent store and to eliminate binary recovery phase). Support of big pages (256+ kB). Support of throttling algorithms. h2. References New checkpoint design overview is available [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md] h2. Thoughts Although there is a technical opportunity to have independent checkpoints for different data regions, managing them could be a nightmare and it's definitely in the realm of optimizations and out of scope right now. So, let's assume that there's one good old checkpoint process. There's still a requirement to have checkpoint markers, but they will not have a reference to WAL, because there's no WAL. Instead, we will have to store RAFT log revision per partition. Or not, I'm not that familiar with a recovery procedure that's currently in development. Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version will have DO and UNDO. This drastically simplifies both checkpoint itself and node recovery. But is complicates data access. There will be two process that will share storage resource: "checkpointer" and "compactor". Let's examine what compactor should or shouldn't do: * it should not work in parallel with checkpointer, except for cases when there are too many layers (more on that later) * it should merge later checkpoint delta files into main partition files * it should delete checkpoint markers once all merges are completed for it, thus markers are decoupled from RAFT log About "cases when there are too many layers" - too many layers could compromise reading speed. Number of layers should not increase uncontrollably. So, when a threshold is exceeded, compactor should start working no mater what. If anything, writing load can be throttled, reading matters more. Recovery procedure: * read the list of checkpoint markers on engines start * remove all data from unfinished checkpoint, if it's there * trim main partition files to their proper size (should check it it's actually beneficial) Table start procedure: * read all layer files headers according to the list of checkpoints * construct a list oh hash tables (pageId -> pageIndex) for all layers, make it as effective as possible * everything else is just like before Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x after all. "Restore partition states" procedure could be revisited, I don't know how this will work yet. How to store hashmaps: regular maps might be too much, we should consider roaring map implementation or something similar that'll occupy less space. This is only a concern for in-memory structures. Files on disk may have a list of pairs, that's fine. Generally speaking, checkpoints with a size of 100 thousand pages are close to the top limit for most users. Splitting that to 500 partitions, for example, gives us 200 pages per partition. Entire map should fit into a single page. The only exception to these calculations is index.bin. Amount of pages per checkpoint can be an orders of magnitudes higher, so we should keep an eye on it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes pages. Map won't be too big IMO. Another important moment - we should enable direct IO, it's supported by Java natively since version 9 (I guess). There's a chance that not only regular disk operations will become somewhat faster, but fsync will become drastically faster as a result. Which is good, fsync can easily take half a time of the checkpoint, which is just unacceptable. h2. Thoughts 2.0 With high likelihood, we'll get rid of index.bin. This will remove the requirement of having checkpoint markers. All that we need is a consistently growing local counter that will be used to mark partition delta files. But, it doesn't need to be global even on a level of local node, it can be a local counter per partition, that's persiste
[jira] [Created] (IGNITE-17076) Unify RowId format for different storages
Ivan Bessonov created IGNITE-17076: -- Summary: Unify RowId format for different storages Key: IGNITE-17076 URL: https://issues.apache.org/jira/browse/IGNITE-17076 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Current MV store bridge API has a fatal flaw, born from a misunderstanding. There's a method called "insert" that generates RowId by itself. This is wrong, because it can lead to different id for the same row on the replica storage. This completely breaks everything. Every replicated write command, that inserts new value, should produce same row ids. There are several ways to achieve this: * Use timestamps as identifiers. This is not very convenient, because we would have to attach partition id on top of it. It's mandatory to know the partition of the row. * Use more complicated structure, for example a tuple of (raftCommitIndex, partitionId, batchCounter), where ** raftCommitIndex is the index of write command that performs insertion. ** partitionId is an integer identifier of the partition. Could be 4 bytes, considering that there are plans to support more than 65000 partitions per table. ** batchCounter is used to differentiate insertions made in a single write command. We can limit it with 2 bytes to save a little bit of space, if it's necessary. I prefer the second option, but maybe it could be revised during the implementation. Of course, method "insert" should be removed from bridge API. Tests have to be updated. With the lack of RAFT group in storage tests, we can generate row ids artificially, it's not a big deal. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-17074) Create integer tableId identifier for tables
Ivan Bessonov created IGNITE-17074: -- Summary: Create integer tableId identifier for tables Key: IGNITE-17074 URL: https://issues.apache.org/jira/browse/IGNITE-17074 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov First of all, this requirement comes from the PageMemory component restrictions - having an entire UUID for table id is too much for a loaded pages list. Currently the implementation uses String hash, just like in Ignite 2.x. This is a bad solution. In Ignite 3.x configuration model, every configuration update is serialized by design. This allows us to have atomic counters basically for free. We could add a {{int lastTableId }}configuration property to a {{{}TablesConfigurationSchema{}}}, for example, and increment it every time new table is created. Then all we need is to read this value in all components that need it. Maybe we should even use it in thin clients, but that needs a careful consideration. Originally, int tableId is intended to be used in storage implementations and maybe as a part of unique RowId, associated with tables, but that's only a speculation. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (IGNITE-16306) snaptree-based in-memory storage
[ https://issues.apache.org/jira/browse/IGNITE-16306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542064#comment-17542064 ] Ivan Bessonov commented on IGNITE-16306: [~sergeychugunov] sure, with great pleasure! > snaptree-based in-memory storage > > > Key: IGNITE-16306 > URL: https://issues.apache.org/jira/browse/IGNITE-16306 > Project: Ignite > Issue Type: Improvement >Affects Versions: 3.0.0-alpha3 >Reporter: Ivan Bessonov >Assignee: Aleksandr Polovtcev >Priority: Major > Labels: iep-74, ignite-3 > > Until a full-fledged MV store is implemented we can implement in-memory > storage on a snaptree library [1] that represents a concurrent AVL tree with > support of snapshots. > In this ticket we need to integrate the library with our existing storage > APIs (refine API if necessary), integrate its snapshot API with Raft > snapshots and provide configuration if necessary. > [1] https://github.com/nbronson/snaptree -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (IGNITE-16306) snaptree-based in-memory storage
[ https://issues.apache.org/jira/browse/IGNITE-16306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov resolved IGNITE-16306. Resolution: Won't Fix > snaptree-based in-memory storage > > > Key: IGNITE-16306 > URL: https://issues.apache.org/jira/browse/IGNITE-16306 > Project: Ignite > Issue Type: Improvement >Affects Versions: 3.0.0-alpha3 >Reporter: Ivan Bessonov >Assignee: Aleksandr Polovtcev >Priority: Major > Labels: iep-74, ignite-3 > > Until a full-fledged MV store is implemented we can implement in-memory > storage on a snaptree library [1] that represents a concurrent AVL tree with > support of snapshots. > In this ticket we need to integrate the library with our existing storage > APIs (refine API if necessary), integrate its snapshot API with Raft > snapshots and provide configuration if necessary. > [1] https://github.com/nbronson/snaptree -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (IGNITE-16937) [Versioned Storage] A multi version TableStorage for MvPartitionStorage partitions
[ https://issues.apache.org/jira/browse/IGNITE-16937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov reassigned IGNITE-16937: -- Assignee: Ivan Bessonov > [Versioned Storage] A multi version TableStorage for MvPartitionStorage > partitions > -- > > Key: IGNITE-16937 > URL: https://issues.apache.org/jira/browse/IGNITE-16937 > Project: Ignite > Issue Type: Task > Components: persistence >Reporter: Sergey Uttsel >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > Need to create a multi version table storage which aggregate > MvPartitionStorage partitions. > Need to think how to integrate the multi version table storage to Ignite. May > be it's need to create for example a multi version StorageEngine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-16926) Interrupted compute job may fail a node
[ https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16926: --- Fix Version/s: 2.14 > Interrupted compute job may fail a node > --- > > Key: IGNITE-16926 > URL: https://issues.apache.org/jira/browse/IGNITE-16926 > Project: Ignite > Issue Type: Bug > Components: persistence >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Fix For: 2.14 > > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, > super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], > failureCtx=FailureContext [type=CRITICAL_ERROR, err=class > o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is > corrupted [groupId=1234619879, pageIds=[7290201467513], > cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: > Row@79570772[ key: 1168930235, val: Data hidden due to > IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden > ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: > B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], > cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: > Row@79570772[ key: 1168930235, val: Data hidden due to > IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden ]] at > org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432) > at > org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500) > at > org.apache.ignite.internal.processors
[jira] [Updated] (IGNITE-16933) PageMemory-based MV storage implementation
[ https://issues.apache.org/jira/browse/IGNITE-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16933: --- Description: Similar to IGNITE-16611, we need an MV-storage implementation for page memory storage engine. Currently, I expect only row storage implementation, without primary or secondary indexes. h2. Chain Structure Here I'm going to describe a data format. Each row is stored as a versioned chain. It will be represented by a number of data entries that will have references to each other. {code:java} [ Timestamp | NextLink | PayloadSize | Payload ]{code} * Timestamp is a 16 bytes value derived from {{org.apache.ignite.internal.tx.Timestamp}} instance. It represents a commit time of corresponding row. * NextLink is a link to the next element in the chain or a NULL_LINK (or any other convenient name). It's a long value in standard format for Page Memory links (itemId, flag, partitionId, pageIdx). Technically, partition id is not needed here, because it's always the same. Removing it could allow us to save 2 bytes per chain element. * PayloadSize is a 4-bytes integer value that gives us the size of actual data in arbitrary format. * Payload - I expect it to be a serialized BinaryRow data. This is how it's implemented in RocksDB right now. For uncommitted (pending) entries I propose using maximal possible timestamp - {{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that we never store tx id in chain itself. Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It should be used as a header size in corresponding FreeList. h2. RowId pointer There's a requirement to have an immutable RowId for every versioned chain. One could argue that we should just make chain head immutable, but it would result in lots of complications. It's better to have a separate structure with immutable link, that will point to an actual head of the versioned chain. {code:java} [ TransactionId | HeadLink | NextLink ]{code} * TransactionId is a UUID. Can only be applied to pending entries. For committed head I propose storing 16 zeroes. * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already mentioned, I'd prefer 6. * NextLink is a "NextLink" value from the head chain element. It's a cheap shortcut for read-only transactions, you can skip uncommitted entry without even trying to read it, if there's a non-null transaction id. Debatable, I know, but looks cheap enough. In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6 = 28) bytes of data. There must e a separate FreeList for every partition even in In-Memory mode for reasons that I'll give later. "Header" size in that list must be equal to these 28 bytes. I wonder how effective FreeList will be for this case, where every chunk has the same size. We'll see. Maybe we should adjust a number of buckets somehow. h2. Data access and Full Scan Now, the fun part. There's no mention of B+Tree here. That's because we can probably just avoid it. If it existed, it would just point RowId to a described RowId structure in partition, but RowId is already a pointer itself. The only other problem that is usually solved by a tree-like structure is a full-scan of all rows in partition. This is useful when you need to rebuild indexes, for example. We should keep in mind that there's no code yet for rebuilding indexes. On the other hand, there's a method for partition scan in the API. This code could be used instead of Primary Index until we have it implemented. There's not FreeList full-scan currently in the code, it needs to be implemented. And, this particular full-scan is the reason why every partition should have its own list of row ids. There's also a chance that introducing new flag for row ids might be convenient. I don't know yet, let's not do it for now. Finally, we need an adequate protection from assertions if we, for some reason, have invalid row id. Things that can be checked be a normal code, not assertion: * data page type * number of items in the page was: Similar to IGNITE-16611, we need an MV-storage implementation for page memory storage engine. Currently, I expect only row storage implementation, without primary or secondary indexes. Here I'm going to describe a data format. Each row is stored as a versioned chain. It will be represented by a number of data entries that will have references to each other. {code:java} [ Timestamp | NextLink | PayloadSize | Payload ]{code} * Timestamp is a 16 bytes value derived from {{org.apache.ignite.internal.tx.Timestamp}} instance. * NextLink is a link to the next element in the chain or a NULL_LINK (or any other convenient name). It's a long value in standard format for Page Memory links (itemId, flag, partitionId, pageIdx). Technically, partition id is not needed here, because it's alway
[jira] [Created] (IGNITE-16933) PageMemory-based MV storage implementation
Ivan Bessonov created IGNITE-16933: -- Summary: PageMemory-based MV storage implementation Key: IGNITE-16933 URL: https://issues.apache.org/jira/browse/IGNITE-16933 Project: Ignite Issue Type: New Feature Reporter: Ivan Bessonov Similar to IGNITE-16611, we need an MV-storage implementation for page memory storage engine. Currently, I expect only row storage implementation, without primary or secondary indexes. Here I'm going to describe a data format. Each row is stored as a versioned chain. It will be represented by a number of data entries that will have references to each other. {code:java} [ Timestamp | NextLink | PayloadSize | Payload ]{code} * Timestamp is a 16 bytes value derived from {{org.apache.ignite.internal.tx.Timestamp}} instance. * NextLink is a link to the next element in the chain or a NULL_LINK (or any other convenient name). It's a long value in standard format for Page Memory links (itemId, flag, partitionId, pageIdx). Technically, partition id is not needed here, because it's always the same. Removing it could allow us to save 2 bytes per chain element. * PayloadSize is a 4-bytes integer value that gives us the size of actual data in arbitrary format. * Payload - I expect it to be a serialized BinaryRow data. This is how it's implemented in RocksDB right now. For uncommitted (pending) entries I propose using maximal possible timestamp - {{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that we never store tx id in chain itself. Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It should be used as a header size in corresponding FreeList. There's a requirement to have an immutable RowId for every versioned chain. One could argue that we should just make chain head immutable, but it would result in lots of complications. It's better to have a separate structure with immutable link, that will point to an actual head of the versioned chain. {code:java} [ TransactionId | HeadLink | NextLink ]{code} * TransactionId is a UUID. Can only be applied to pending entries. For committed head I propose storing 16 zeroes. * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already mentioned, I'd prefer 6. * NextLink is a "NextLink" value from the head chain element. It's a cheap shortcut for read-only transactions, you can skip uncommitted entry without even trying to read it, if there's a non-null transaction id. Debatable, I know, but looks cheap enough. In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6 = 28) bytes of data. There must e a separate FreeList for every partition even in In-Memory mode for reasons that I'll give later. "Header" size in that list must be equal to these 28 bytes. I wander how effective FreeList will be for this case, where every chunk has the same size. We'll see. Maybe we should adjust a number of buckets somehow. Now, the fun part. There's no mention of B+Tree here. That's because we can probably just avoid it. If it existed, it would just point RowId to a described RowId structure in partition, but RowId is already a pointer itself. The only other problem that is usually solved by a tree-like structure is a full-scan of all rows in partition. This is useful when you need to rebuild indexes, for example. We should keep in mind that there's no code yet for rebuilding indexes. On the other hand, there's a method for partition scan in the API. It could be used to implement a Primary Index imitation until we have a real implementation. There's not FreeList full-scan currently in the code, it needs to be implemented. And, this particular full-scan is the reason why every partition should have its own list of row ids. There's also a chance that introducing new flag for row ids might be convenient. I don't know yet, let's not do it for now. Finally, we need an adequate protection from assertions if we, for some reason, have invalid row id. Things that can be checked be a normal code, not assertion: * data page type * number of items in the page -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (IGNITE-16912) Revisit UUID generation for RowId
[ https://issues.apache.org/jira/browse/IGNITE-16912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16912: --- Epic Link: IGNITE-16923 > Revisit UUID generation for RowId > - > > Key: IGNITE-16912 > URL: https://issues.apache.org/jira/browse/IGNITE-16912 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > Current implementation uses UUID.randomUUID, which comes with a set of > problems: > * some people say that you can't avoid collisions this way. Technically it's > true, although I don't think that it's a real problem > * secure random is slow when you use it frequently. This can affect > insertion performance > * random uuids are randomly distributed, this can be a problem for RocksDB, > for example - if most insertions will go to the tail, this can improve > overall write performance > There are interesting approaches in this particular document, we should take > a look at it: > https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/ -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (IGNITE-16926) Interrupted compute job may fail a node
[ https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov reassigned IGNITE-16926: -- Assignee: Ivan Bessonov > Interrupted compute job may fail a node > --- > > Key: IGNITE-16926 > URL: https://issues.apache.org/jira/browse/IGNITE-16926 > Project: Ignite > Issue Type: Bug > Components: persistence >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, > super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], > failureCtx=FailureContext [type=CRITICAL_ERROR, err=class > o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is > corrupted [groupId=1234619879, pageIds=[7290201467513], > cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: > Row@79570772[ key: 1168930235, val: Data hidden due to > IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden > ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: > B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], > cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: > Row@79570772[ key: 1168930235, val: Data hidden due to > IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, > data hidden ]] at > org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432) > at > org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500) > at > org.apache.ignite.internal.processors.query.h2.opt.GridH
[jira] [Created] (IGNITE-16926) Interrupted compute job may fail a node
Ivan Bessonov created IGNITE-16926: -- Summary: Interrupted compute job may fail a node Key: IGNITE-16926 URL: https://issues.apache.org/jira/browse/IGNITE-16926 Project: Ignite Issue Type: Bug Components: persistence Reporter: Ivan Bessonov {code:java} Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: Row@79570772[ key: 1168930235, val: Data hidden due to IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: Row@79570772[ key: 1168930235, val: Data hidden due to IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data hidden ]] at org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432) at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500) at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.addToIndex(GridH2Table.java:880) at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:794) at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.store(IgniteH2Indexing.java:411) at org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:2546) at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridC
[jira] [Updated] (IGNITE-16915) ItClusterManagerTest#testNodeLeave is flaky
[ https://issues.apache.org/jira/browse/IGNITE-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16915: --- Description: https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_IntegrationTests_ModuleClusterManagement?branch=pull%2F787&buildTypeTab=overview&mode=builds > ItClusterManagerTest#testNodeLeave is flaky > --- > > Key: IGNITE-16915 > URL: https://issues.apache.org/jira/browse/IGNITE-16915 > Project: Ignite > Issue Type: Bug >Reporter: Aleksandr Polovtcev >Assignee: Aleksandr Polovtcev >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha5 > > Time Spent: 20m > Remaining Estimate: 0h > > https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_IntegrationTests_ModuleClusterManagement?branch=pull%2F787&buildTypeTab=overview&mode=builds -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-16914) [Versioned Storage] Test and optimize prefixes in RocksDB
Ivan Bessonov created IGNITE-16914: -- Summary: [Versioned Storage] Test and optimize prefixes in RocksDB Key: IGNITE-16914 URL: https://issues.apache.org/jira/browse/IGNITE-16914 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Main MV-storage doesn't require and specific order of elements, so partition scans don't have to be totally ordered. If I understand correctly, this allows us to use prefixes functionality of RocksDB, extending it to row ids, not only partition ids. In theory, this should noticeably increase performance of single reads and I guess somehow increase scan performance as well. Bloom filters and similar topics should be investigated here as well. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-16913) Provide affective way to write BinaryRow into byte buffer
Ivan Bessonov created IGNITE-16913: -- Summary: Provide affective way to write BinaryRow into byte buffer Key: IGNITE-16913 URL: https://issues.apache.org/jira/browse/IGNITE-16913 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Current API only allows to write row into OutputStream, which is not always convenient. For example, RocksDB implementation required writing into a byte buffer. Creating an output stream on top of the buffer is not the best idea. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (IGNITE-16912) Revisit UUID generation for RowId
Ivan Bessonov created IGNITE-16912: -- Summary: Revisit UUID generation for RowId Key: IGNITE-16912 URL: https://issues.apache.org/jira/browse/IGNITE-16912 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Current implementation uses UUID.randomUUID, which comes with a set of problems: * some people say that you can't avoid collisions this way. Technically it's true, although I don't think that it's a real problem * secure random is slow when you use it frequently. This can affect insertion performance * random uuids are randomly distributed, this can be a problem for RocksDB, for example - if most insertions will go to the tail, this can improve overall write performance There are interesting approaches in this particular document, we should take a look at it: https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/ -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (IGNITE-15734) Erroneous string formatting while changing cluster tag.
[ https://issues.apache.org/jira/browse/IGNITE-15734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524157#comment-17524157 ] Ivan Bessonov commented on IGNITE-15734: [~zstan] done, thank you for the fix! > Erroneous string formatting while changing cluster tag. > --- > > Key: IGNITE-15734 > URL: https://issues.apache.org/jira/browse/IGNITE-15734 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.11 >Reporter: Evgeny Stanilovsky >Assignee: Evgeny Stanilovsky >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > {noformat} > org.apache.ignite.internal.processors.cluster.ClusterProcessor#onReadyForRead > ... > log.info( > "Cluster tag will be set to new value: " + > newVal != null ? newVal.tag() : "null" + > ", previous value was: " + > oldVal != null ? oldVal.tag() : "null"); > {noformat} > without braces > {noformat} > "Cluster tag will be set to new value: " + newVal > {noformat} > always not null; -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (IGNITE-16848) [Versioned Storage] Provide common interface for abstract internal tuples
[ https://issues.apache.org/jira/browse/IGNITE-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov reassigned IGNITE-16848: -- Assignee: Ivan Bessonov > [Versioned Storage] Provide common interface for abstract internal tuples > - > > Key: IGNITE-16848 > URL: https://issues.apache.org/jira/browse/IGNITE-16848 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: iep-74, ignite-3 > Fix For: 3.0.0-alpha5 > > Time Spent: 10m > Remaining Estimate: 0h > > Methods from class "Row" should be extracted to provide generic tuple API to > components like SQL indexes or MV storage. > Tuple is NOT schema-aware and should NOW have methods like "Object value(int > col)", because it's represents basic blob with little to no meta information -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (IGNITE-16848) [Versioned Storage] Provide common interface for abstract internal tuples
Ivan Bessonov created IGNITE-16848: -- Summary: [Versioned Storage] Provide common interface for abstract internal tuples Key: IGNITE-16848 URL: https://issues.apache.org/jira/browse/IGNITE-16848 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Fix For: 3.0.0-alpha5 Methods from class "Row" should be extracted to provide generic tuple API to components like SQL indexes or MV storage. Tuple is NOT schema-aware and should NOW have methods like "Object value(int col)", because it's represents basic blob with little to no meta information -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-16611) [Versioned Storage] Version chain data structure for RocksDB-based storage
[ https://issues.apache.org/jira/browse/IGNITE-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16611: --- Labels: iep-74 ignite-3 (was: ignite-3) > [Versioned Storage] Version chain data structure for RocksDB-based storage > --- > > Key: IGNITE-16611 > URL: https://issues.apache.org/jira/browse/IGNITE-16611 > Project: Ignite > Issue Type: Task > Components: persistence >Reporter: Sergey Chugunov >Assignee: Ivan Bessonov >Priority: Major > Labels: iep-74, ignite-3 > > To support Concurrency Control and implement effective transactions > capability to store multiple values of the same key is needed in existing > storage. > h3. Version chain > Key component here is a special data structure called version chain: it is a > list of all versions of a particular key, with the most recent version at the > beginning (HEAD). > Each entry in the chain contains value, reference to the next entry in the > list, begin and end timestamps and id of active transaction that created this > version. > There are at least two approaches to implement this structure on top of > RocksDB: > * Combine original key and version into a new key which is put into a RocksDB > tree. In that case to restore version chain we need to iterate over the tree > using original key as a prefix. > * Use original key as-is but make it pointing not to the value directly but > to an array containing version and other metainformation (ts, id etc) and > keys in some secondary tree. > h3. New API to manage versions > The following new API should be implemented to provide access to version > chain: > * Methods to manipulate versions: add new version to the chain, commit > uncommited version, abort uncommited version. > * Method to cleanup old versions from the chain. > * Method to scan over keys up to provided timestamp. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (IGNITE-16792) Configuration for Default Storage Engine
[ https://issues.apache.org/jira/browse/IGNITE-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520506#comment-17520506 ] Ivan Bessonov commented on IGNITE-16792: [~ktkale...@gridgain.com] looks good to me, I'll merge it to main. Thank you! > Configuration for Default Storage Engine > > > Key: IGNITE-16792 > URL: https://issues.apache.org/jira/browse/IGNITE-16792 > Project: Ignite > Issue Type: Task > Components: persistence >Reporter: Sergey Chugunov >Assignee: Kirill Tkalenko >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha5 > > Time Spent: 3.5h > Remaining Estimate: 0h > > Pluggable storage concept enables user to set up different storage engines > (SE) on the same node e.g. for performance reasons, each table can be hosted > only by one storage. > From DDL point of view SE is specified as part of CREATE TABLE command. But > in case of only one SE and some other cases specifying it for each table > creates a lot of unnecessary boilerplate code. > To address this and free user from writing exactly the same code a > cluster-wide setting *defaultStorageEngine* should be introduced. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-16796) Rename is broken in configuration & other minor issues
[ https://issues.apache.org/jira/browse/IGNITE-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16796: --- Summary: Rename is broken in configuration & other minor issues (was: Rename is broken in configuration) > Rename is broken in configuration & other minor issues > -- > > Key: IGNITE-16796 > URL: https://issues.apache.org/jira/browse/IGNITE-16796 > Project: Ignite > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha5 > > > Rename changes "name" field in an immutable object, this shouldn't happen. > > There are also few more issues that I'd like to address: > * configuration values serialization wouldn't work for string with non-ascii > characters because of wrong "size" calculation > * signatures of ConfigurationNotificationEvent#config and > ConfigurationNotificationEvent#name are flawed and need to be refined a bit > * InjectName is not used where it needs to be used -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-16796) Rename is broken in configuration
[ https://issues.apache.org/jira/browse/IGNITE-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16796: --- Description: Rename changes "name" field in an immutable object, this shouldn't happen. There are also few more issues that I'd like to address: * configuration values serialization wouldn't work for string with non-ascii characters because of wrong "size" calculation * signatures of ConfigurationNotificationEvent#config and ConfigurationNotificationEvent#name are flawed and need to be refined a bit * InjectName is not used where it needs to be used was: Rename changes "name" field in an immutable object, this shouldn't happen. There are also few more issues that I'd like to address: * configuration values serialization wouldn't work for string with non-ascii characters because of wrong "size" calculation * signatures of ConfigurationNotificationEvent#config and ConfigurationNotificationEvent#name are flawed and need to be refined a bit > Rename is broken in configuration > - > > Key: IGNITE-16796 > URL: https://issues.apache.org/jira/browse/IGNITE-16796 > Project: Ignite > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha5 > > > Rename changes "name" field in an immutable object, this shouldn't happen. > > There are also few more issues that I'd like to address: > * configuration values serialization wouldn't work for string with non-ascii > characters because of wrong "size" calculation > * signatures of ConfigurationNotificationEvent#config and > ConfigurationNotificationEvent#name are flawed and need to be refined a bit > * InjectName is not used where it needs to be used -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-16796) Rename is broken in configuration
[ https://issues.apache.org/jira/browse/IGNITE-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16796: --- Description: Rename changes "name" field in an immutable object, this shouldn't happen. There are also few more issues that I'd like to address: * configuration values serialization wouldn't work for string with non-ascii characters because of wrong "size" calculation * signatures of ConfigurationNotificationEvent#config and ConfigurationNotificationEvent#name are flawed and need to be refined a bit > Rename is broken in configuration > - > > Key: IGNITE-16796 > URL: https://issues.apache.org/jira/browse/IGNITE-16796 > Project: Ignite > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-alpha5 > > > Rename changes "name" field in an immutable object, this shouldn't happen. > > There are also few more issues that I'd like to address: > * configuration values serialization wouldn't work for string with non-ascii > characters because of wrong "size" calculation > * signatures of ConfigurationNotificationEvent#config and > ConfigurationNotificationEvent#name are flawed and need to be refined a bit -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (IGNITE-16796) Rename is broken in configuration
Ivan Bessonov created IGNITE-16796: -- Summary: Rename is broken in configuration Key: IGNITE-16796 URL: https://issues.apache.org/jira/browse/IGNITE-16796 Project: Ignite Issue Type: Bug Affects Versions: 3.0.0-alpha4 Reporter: Ivan Bessonov Assignee: Ivan Bessonov Fix For: 3.0.0-alpha5 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-14931) Define common error scopes and prefix
[ https://issues.apache.org/jira/browse/IGNITE-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14931: --- Labels: iep-84 ignite-3 (was: ignite-3) > Define common error scopes and prefix > - > > Key: IGNITE-14931 > URL: https://issues.apache.org/jira/browse/IGNITE-14931 > Project: Ignite > Issue Type: Sub-task >Reporter: Vyacheslav Koptilin >Priority: Major > Labels: iep-84, ignite-3 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (IGNITE-16704) Remove unnecessary methods from BinaryRow interface
Ivan Bessonov created IGNITE-16704: -- Summary: Remove unnecessary methods from BinaryRow interface Key: IGNITE-16704 URL: https://issues.apache.org/jira/browse/IGNITE-16704 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Assignee: Ivan Bessonov Fix For: 3.0.0-alpha5 Current interface has several read* methods that are only used in implementation. I propose deleting them, this will simplify making new implementations of the interface. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-16704) Remove unnecessary methods from BinaryRow interface
[ https://issues.apache.org/jira/browse/IGNITE-16704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16704: --- Labels: iep-54 ignite-3 (was: ignite-3) > Remove unnecessary methods from BinaryRow interface > --- > > Key: IGNITE-16704 > URL: https://issues.apache.org/jira/browse/IGNITE-16704 > Project: Ignite > Issue Type: Improvement >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: iep-54, ignite-3 > Fix For: 3.0.0-alpha5 > > > Current interface has several read* methods that are only used in > implementation. I propose deleting them, this will simplify making new > implementations of the interface. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (IGNITE-16697) [Versioned Storage] POC - add methods for versioned data storage
Ivan Bessonov created IGNITE-16697: -- Summary: [Versioned Storage] POC - add methods for versioned data storage Key: IGNITE-16697 URL: https://issues.apache.org/jira/browse/IGNITE-16697 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov Assignee: Ivan Bessonov Fix For: 3.0.0-alpha5 As a first step towards MV-storage in Ignite 3.0 it's required to have specific methods on the partition storage and index storage interfaces. These will replace currently available VersionedRowStore, which was a prototype in itself and doesn't correspond to a desired functionality. Partition storage needs: * addWrite(k, v, txId) * commitWrite(k, ts) * abortWrite(k) * read(k, ts) * scan(ts, {_}tbd{_}) * cleanup({_}tbd{_}) Sorted index storage needs: * scan(lower, upper, bounds_options, projection, partition_filter, ts) Index updates will be hidden inside of {*}addWrite{*}, *abortWrite* and *cleanup* methods. No external "update" and "remove" are required. This particular issue is a precursor for the https://issues.apache.org/jira/browse/IGNITE-16611. Reference implementation is also required, it'll provide an example of what's expected from the storage and a set of tests to fix methods contracts. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-16611) [Versioned Storage] Version chain data structure for RocksDB-based storage
[ https://issues.apache.org/jira/browse/IGNITE-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-16611: --- Summary: [Versioned Storage] Version chain data structure for RocksDB-based storage (was: [Versioned Storage] POC - Version chain data structure for RocksDB-based storage) > [Versioned Storage] Version chain data structure for RocksDB-based storage > --- > > Key: IGNITE-16611 > URL: https://issues.apache.org/jira/browse/IGNITE-16611 > Project: Ignite > Issue Type: Task > Components: persistence >Reporter: Sergey Chugunov >Assignee: Ivan Bessonov >Priority: Major > Labels: ignite-3 > > To support Concurrency Control and implement effective transactions > capability to store multiple values of the same key is needed in existing > storage. > h3. Version chain > Key component here is a special data structure called version chain: it is a > list of all versions of a particular key, with the most recent version at the > beginning (HEAD). > Each entry in the chain contains value, reference to the next entry in the > list, begin and end timestamps and id of active transaction that created this > version. > There are at least two approaches to implement this structure on top of > RocksDB: > * Combine original key and version into a new key which is put into a RocksDB > tree. In that case to restore version chain we need to iterate over the tree > using original key as a prefix. > * Use original key as-is but make it pointing not to the value directly but > to an array containing version and other metainformation (ts, id etc) and > keys in some secondary tree. > h3. New API to manage versions > The following new API should be implemented to provide access to version > chain: > * Methods to manipulate versions: add new version to the chain, commit > uncommited version, abort uncommited version. > * Method to cleanup old versions from the chain. > * Method to scan over keys up to provided timestamp. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (IGNITE-14611) Implement error handling for public API based on error codes
[ https://issues.apache.org/jira/browse/IGNITE-14611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Bessonov updated IGNITE-14611: --- Labels: iep-84 ignite-3 (was: ignite-3) > Implement error handling for public API based on error codes > > > Key: IGNITE-14611 > URL: https://issues.apache.org/jira/browse/IGNITE-14611 > Project: Ignite > Issue Type: Task >Reporter: Alexey Scherbakov >Priority: Major > Labels: iep-84, ignite-3 > Fix For: 3.0 > > > Dev list discusstion [1] > [1] > http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Error-handling-in-Ignite-3-td52269.html -- This message was sent by Atlassian Jira (v8.20.1#820001)