[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-07-06 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17081:
---
Description: 
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in 
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
all column families consistently, if you have batches that cover several CFs. 
Basically, {{acquireConsistencyLock()}} would create a thread-local write 
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will 
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now  
(gladly, there's a WriteBatchInterface#count() to check). I don't see any 
practical value and a proper way of implementing it, considering how spread-out 
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is 
more challenging, I propose caching it and update only from the closure, that 
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the 
implementation:
 * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
atomic flush mode. And, once you have your first "completed" event ,you have a 
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is 
all that's needed.
 * Unlike PDS implementation, here we will be writing updateIndex value into a 
memtable every time. This makes it harder to find persistedIndex values for 
partitions. Gladly, considering the events that we have, during the time 
between first "completed" and the very next "begin", the state on disk is fully 
consistent. And there's a way to read data from storage avoiding memtable 
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

 
{code:java}
During table start: read latest values of update indexes. Store them in an 
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
  if (lastEventType == ON_FLUSH_BEGIN)
return;

  waitForLastAsyncUpdateIndexesRead();

  lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
  if (lastEventType == ON_FLUSH_COMPLETED)
return;

  asyncReadUpdateIndexesFromDisk();

  lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing 
process. We don't control locks that RocksDb holds while calling listener's 
methods.

That asynchronous process would invoke closures that provide presisted 
updateIndex values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" 
as late as possible just in case. But my implementation says calling it during 
the first event. This is fine. I noticed that column families are flushed in 
order of their internal ids. These ids correspond to a sequence number of CFs, 
and the "default" CF is always created first. This is the exact CF that we use 
to store meta. Maybe we're going to change this and create a separate meta CF. 
Only then we could start optimizing this part, and only if we'll have an actual 
proof that there's a stall in this exact place.
h3. Types of storages

RocksDB is used for:
 * tables
 * cluster management
 * meta-storage

All these types should use the same recovery procedure, but code is located in 
different places. I hope that it won't be a big problem and we can do 
everything at once.

  was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top 

[jira] [Created] (IGNITE-17310) Intergrate IndexStorage into a TableStorage API

2022-07-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17310:
--

 Summary: Intergrate IndexStorage into a TableStorage API
 Key: IGNITE-17310
 URL: https://issues.apache.org/jira/browse/IGNITE-17310
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


As an endpoint, we need an interface that represents a single index storage for 
a single partition. But, creating/destroying these storages is not as obvious 
from API standpoint.

When index is created, storages should be created for every existing partition. 
And when a partition is created, index storages should be created for it as 
well. This complicates things a little bit, but, generally speaking, something 
like this could be a solution:
 * CompletableFuture createIndex(indexCinfgiguration);
 * CompletableFuture dropIndex(indexId);
 * IndexMvStorage getIndexStorage(indexId, partitionId);

Build / rebuild API will be figured out later in another issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17308:
---
Description: 
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.
 * new methods, like {{update}} and {{remove}} should be added to API.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now. Porting current tests to new API is 
up to a developer.

h3. Other

I would say that effective InternalTuple comparison is out of scope. We could 
just adapt current test code somehow.

  was:
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now.


> Revisit SortedIndexMvStorage interface
> --
>
> Key: IGNITE-17308
> URL: https://issues.apache.org/jira/browse/IGNITE-17308
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
> contract is far from obvious and it's only used in tests as a part of 
> "reference implementation".
> Originally, it was implemented when the vision of MV store wasn't fully 
> solidified.
> h3. API changes
>  * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
> should be replaced with {{{}InternalTuple{}}}, with the requirement that 
> every internal tuple can be converted into a IEP-92 format.
>  * {{scan}} should not return rows, but only indexed rows and RowId 
> instances. Index scan should NOT by itself filter-out invalid rows, this will 
> be performed outside of scan.
>  * TxId / Timestamp parameters are no longer applicable, given that index 
> does not perform rows validation.
>  * Partition filter should be removed as well. To simplify things, every 
> partition will be indexed {+}independently{+}.
>  * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
> now. Former can be brought back in the future, while latter makes no sense 
> considering that indexes are not multiversioned.
>  * new methods, like {{update}} and {{remove}} should be added to API.
> h3. New API for removed functions
>  * There should be a new entity on top of partition and index store. It 
> updates indexes and filters scan queries. There's no point in fully designing 
> it right now, all we need is working tests for now. Porting current tests 

[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17308:
---
Description: 
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now.

  was:
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.


> Revisit SortedIndexMvStorage interface
> --
>
> Key: IGNITE-17308
> URL: https://issues.apache.org/jira/browse/IGNITE-17308
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
> contract is far from obvious and it's only used in tests as a part of 
> "reference implementation".
> Originally, it was implemented when the vision of MV store wasn't fully 
> solidified.
> h3. API changes
>  * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
> should be replaced with {{{}InternalTuple{}}}, with the requirement that 
> every internal tuple can be converted into a IEP-92 format.
>  * {{scan}} should not return rows, but only indexed rows and RowId 
> instances. Index scan should NOT by itself filter-out invalid rows, this will 
> be performed outside of scan.
>  * TxId / Timestamp parameters are no longer applicable, given that index 
> does not perform rows validation.
>  * Partition filter should be removed as well. To simplify things, every 
> partition will be indexed {+}independently{+}.
>  * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
> now. Former can be brought back in the future, while latter makes no sense 
> considering that indexes are not multiversioned.
> h3. New API for removed functions
>  * There should be a new entity on top of partition and index store. It 
> updates indexes and filters scan queries. There's no point in fully designing 
> it right now, all we need is working tests for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17308:
--

 Summary: Revisit SortedIndexMvStorage interface
 Key: IGNITE-17308
 URL: https://issues.apache.org/jira/browse/IGNITE-17308
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-16156) Byte ordered index keys.

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16156.

Resolution: Won't Fix

Other data format will be used

> Byte ordered index keys.
> 
>
> Key: IGNITE-16156
> URL: https://issues.apache.org/jira/browse/IGNITE-16156
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Reporter: Alexander Belyak
>Assignee: Alexander Belyak
>Priority: Major
>  Labels: ignite-3
>
> To improve speed of operations with indexes ignite can store keys in byte 
> ordered format so only natural byte[] comparator will be enough to scan it.
> Required features:
> 1) write any (almost) data types.
> Must to have: boolean, byte, short, int,long, float, double, bigint, 
> bigdecimal, String, Date, Time, DateTime.
> Like to have: byte[], bitset
> unlikely to have: timestamp with timezone
> 2) Support null values for any columns. Like to have: support 
> nullFirst/nullLast
> 3) write asc/desc ordered (in any combination for columns, for indexes like 
> "col1 asc, col2 desc, col3 asc").
> Non functional requirements: space used and speed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-16105) Replace sorted index binary storage protocol

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16105.

Resolution: Won't Fix

IGNITE-17192 will be used instead

> Replace sorted index binary storage protocol
> 
>
> Key: IGNITE-16105
> URL: https://issues.apache.org/jira/browse/IGNITE-16105
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage currently uses {{BinaryRow}} as way to convert column 
> values into byte arrays. This approach is not optimal for the following 
> reasons:
> # Data is stored in RocksDB and we can't use its native lexicographic 
> comparator, we rely on a custom Java-based comparator that needs to 
> de-serialize all columns in order to compare them. This is bad 
> performance-wise, because Java-based comparators are  slower and we need to 
> extract all column values;
> # Range scans can't use the prefix seek operation from RocksDB, because 
> {{BinaryRow}} seralization is not stable: serialized prefix of column values 
> will not be a prefix of the whole serialized row, because the format depends 
> on columns being serialized;
> # {{BinaryRow}} serialization is designed to store versioned row data and is 
> overall badly suited to the Sorted Index purposes, its API usage looks 
> awkward in this context.
> We need to find a new serialization protocol that will (ideally) satisfy the 
> following requirements:
> # It should be comparable lexicographically;
> # It should support null values;
> # It should support variable length columns (though this requirement can 
> probably be dropped);
> # It should support both ascending and descending order for individual 
> columns;
> # It should support all data types that {{BinaryRow}} uses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-16079) Rename search and data keys for the Partition Storage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16079.

Resolution: Won't Fix

> Rename search and data keys for the Partition Storage
> -
>
> Key: IGNITE-16079
> URL: https://issues.apache.org/jira/browse/IGNITE-16079
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> There are currently the following classes in the {{PartitionStorage}} that 
> act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the 
> {{SortedIndexStorage}} interface hard to understand, because it stores 
> {{SearchRows}} as values. It is proposed to rename these classes:
>  {{SearchRow}} -> {{PartitionKey}}
>  {{DataRow}} -> {{PartitionData}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16059.

Resolution: Won't Fix

> Add options to the "range" method in SortedIndexStorage
> ---
>
> Key: IGNITE-16059
> URL: https://issues.apache.org/jira/browse/IGNITE-16059
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage]
>  declares the following API for the {{SortedIndexStorage#range}} method:
> {code:java}
> /** Exclude lower bound. */
> byte GREATER = 0;
>  
> /** Include lower bound. */
> byte GREATER_OR_EQUAL = 1;
>  
> /** Exclude upper bound. */
> byte LESS = 0;
>  
> /** Include upper bound. */
> byte LESS_OR_EQUAL = 1 << 1;
> /**
>  * Return rows between lower and upper bounds.
>  * Fill results rows by fields specified at the projection set.
>  *
>  * @param low Lower bound of the scan.
>  * @param up Lower bound of the scan.
>  * @param scanBoundMask Scan bound mask (specify how to work with rows 
> equals to the bounds: include or exclude).
>  * @param proj Set of the columns IDs to fill results rows.
>  */
> Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj);
> {code}
> The {{scanBoundMask}} flags are currently not implemented. This API should be 
> revised and implemented, if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17306:
---
Description: 
There are a few places in presto that for too slow, we can easily optimize them

(Nothing will be committed if there's no visible difference in tests duration)

  was:There are a few places in presto that for too slow, we can easily 
optimize them


> Speedup runtime classes compilation speed for configuration
> ---
>
> Key: IGNITE-17306
> URL: https://issues.apache.org/jira/browse/IGNITE-17306
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in presto that for too slow, we can easily optimize 
> them
> (Nothing will be committed if there's no visible difference in tests duration)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17306:
--

Assignee: Ivan Bessonov

> Speedup runtime classes compilation speed for configuration
> ---
>
> Key: IGNITE-17306
> URL: https://issues.apache.org/jira/browse/IGNITE-17306
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in presto that for too slow, we can easily optimize 
> them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-07-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17306:
--

 Summary: Speedup runtime classes compilation speed for 
configuration
 Key: IGNITE-17306
 URL: https://issues.apache.org/jira/browse/IGNITE-17306
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


There are a few places in presto that for too slow, we can easily optimize them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-15908) Investigate index binary structure compatibility

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-15908:
---
Epic Link: IGNITE-17304

> Investigate index binary structure compatibility
> 
>
> Key: IGNITE-15908
> URL: https://issues.apache.org/jira/browse/IGNITE-15908
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage has a binary storage format that is subject to change in 
> the future. Though index schema is immutable and any change to it leads to 
> index being rebuilt, it should be possible to update the storage format 
> without rebuilding. It means that there should be some kind of a versioning 
> mechanism, so that {{IndexKey}} serialization format can be changed in a 
> backwards-compatilbe way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16059:
---
Epic Link: IGNITE-17304

> Add options to the "range" method in SortedIndexStorage
> ---
>
> Key: IGNITE-16059
> URL: https://issues.apache.org/jira/browse/IGNITE-16059
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage]
>  declares the following API for the {{SortedIndexStorage#range}} method:
> {code:java}
> /** Exclude lower bound. */
> byte GREATER = 0;
>  
> /** Include lower bound. */
> byte GREATER_OR_EQUAL = 1;
>  
> /** Exclude upper bound. */
> byte LESS = 0;
>  
> /** Include upper bound. */
> byte LESS_OR_EQUAL = 1 << 1;
> /**
>  * Return rows between lower and upper bounds.
>  * Fill results rows by fields specified at the projection set.
>  *
>  * @param low Lower bound of the scan.
>  * @param up Lower bound of the scan.
>  * @param scanBoundMask Scan bound mask (specify how to work with rows 
> equals to the bounds: include or exclude).
>  * @param proj Set of the columns IDs to fill results rows.
>  */
> Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj);
> {code}
> The {{scanBoundMask}} flags are currently not implemented. This API should be 
> revised and implemented, if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16079) Rename search and data keys for the Partition Storage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16079:
---
Epic Link: IGNITE-17304

> Rename search and data keys for the Partition Storage
> -
>
> Key: IGNITE-16079
> URL: https://issues.apache.org/jira/browse/IGNITE-16079
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> There are currently the following classes in the {{PartitionStorage}} that 
> act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the 
> {{SortedIndexStorage}} interface hard to understand, because it stores 
> {{SearchRows}} as values. It is proposed to rename these classes:
>  {{SearchRow}} -> {{PartitionKey}}
>  {{DataRow}} -> {{PartitionData}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16105) Replace sorted index binary storage protocol

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16105:
---
Epic Link: IGNITE-17304

> Replace sorted index binary storage protocol
> 
>
> Key: IGNITE-16105
> URL: https://issues.apache.org/jira/browse/IGNITE-16105
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage currently uses {{BinaryRow}} as way to convert column 
> values into byte arrays. This approach is not optimal for the following 
> reasons:
> # Data is stored in RocksDB and we can't use its native lexicographic 
> comparator, we rely on a custom Java-based comparator that needs to 
> de-serialize all columns in order to compare them. This is bad 
> performance-wise, because Java-based comparators are  slower and we need to 
> extract all column values;
> # Range scans can't use the prefix seek operation from RocksDB, because 
> {{BinaryRow}} seralization is not stable: serialized prefix of column values 
> will not be a prefix of the whole serialized row, because the format depends 
> on columns being serialized;
> # {{BinaryRow}} serialization is designed to store versioned row data and is 
> overall badly suited to the Sorted Index purposes, its API usage looks 
> awkward in this context.
> We need to find a new serialization protocol that will (ideally) satisfy the 
> following requirements:
> # It should be comparable lexicographically;
> # It should support null values;
> # It should support variable length columns (though this requirement can 
> probably be dropped);
> # It should support both ascending and descending order for individual 
> columns;
> # It should support all data types that {{BinaryRow}} uses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16156) Byte ordered index keys.

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16156:
---
Epic Link: IGNITE-17304

> Byte ordered index keys.
> 
>
> Key: IGNITE-16156
> URL: https://issues.apache.org/jira/browse/IGNITE-16156
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Reporter: Alexander Belyak
>Assignee: Alexander Belyak
>Priority: Major
>  Labels: ignite-3
>
> To improve speed of operations with indexes ignite can store keys in byte 
> ordered format so only natural byte[] comparator will be enough to scan it.
> Required features:
> 1) write any (almost) data types.
> Must to have: boolean, byte, short, int,long, float, double, bigint, 
> bigdecimal, String, Date, Time, DateTime.
> Like to have: byte[], bitset
> unlikely to have: timestamp with timezone
> 2) Support null values for any columns. Like to have: support 
> nullFirst/nullLast
> 3) write asc/desc ordered (in any combination for columns, for indexes like 
> "col1 asc, col2 desc, col3 asc").
> Non functional requirements: space used and speed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14937) Index schema & Index management integration

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14937:
---
Epic Link: IGNITE-17304

> Index schema & Index management integration
> ---
>
> Key: IGNITE-14937
> URL: https://issues.apache.org/jira/browse/IGNITE-14937
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Public index schema (required indexes) and current indexes state on the 
> cluster are different.
> We have to track it, store it and provide actual indexes schema state for any 
> components: select query, DDL query etc..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14936) Benchmark sorted index scan vs table's partitions scan

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14936:
---
Epic Link: IGNITE-17304

> Benchmark sorted index scan vs table's partitions scan
> --
>
> Key: IGNITE-14936
> URL: https://issues.apache.org/jira/browse/IGNITE-14936
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> We have to decide what are data structures is used for PK and table scan.
> Possible cases:
> - table partitions sorted by plain bytes/hash (in fact: unsorted);
> - table partitions sorted by PK columns;
> - PK sorted index (one store for all partitions on the node).
> All cases have pros and cons. The choice should be based on benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14940) Investigation parallel index scan

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14940:
---
Epic Link: IGNITE-17304

> Investigation parallel index scan
> -
>
> Key: IGNITE-14940
> URL: https://issues.apache.org/jira/browse/IGNITE-14940
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Motivation: 2.x version implements {{queryParallelism}} by creation index 
> segments. Each segment contains subset of partitions. This approach has 
> several shortcomings:
> - index scans parallelism cannot be changed / scaled on runtime.
> - we have always scan all segments (looks like virtual MapNode for query);
> - many index storages for one logical index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14938) Introduce persistance store for the indexes states on cluster

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14938:
---
Epic Link: IGNITE-17304

> Introduce persistance store for the indexes states on cluster
> -
>
> Key: IGNITE-14938
> URL: https://issues.apache.org/jira/browse/IGNITE-14938
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Includes:
> - building state progress;
> - ready to scan / building;
> - rebuild index;
> - support node restart and index recovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14939) Tests coverage for index rebuild and recovery scenarios

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14939:
---
Epic Link: IGNITE-17304

> Tests coverage for index rebuild and recovery scenarios
> ---
>
> Key: IGNITE-14939
> URL: https://issues.apache.org/jira/browse/IGNITE-14939
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Test cases from version 2.x must be analyzed and ported to 3.0.
> See in 2.x {{AbstractRebuildIndexTest}} and the children.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16199) Implements index build/rebuild

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16199:
---
Epic Link: IGNITE-17304

> Implements index build/rebuild 
> ---
>
> Key: IGNITE-16199
> URL: https://issues.apache.org/jira/browse/IGNITE-16199
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Index must be built on exists table data: scan table's data and build an 
> index.
> Now only update index by table updates is implemented.
> May be build and rebuild tasks may be split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16196) Supports index rename

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16196:
---
Epic Link: IGNITE-17304

> Supports index rename
> -
>
> Key: IGNITE-16196
> URL: https://issues.apache.org/jira/browse/IGNITE-16196
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Need to supports index rename.
> ALTER INDEX [ IF EXISTS ]  RENAME TO 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16265:
---
Epic Link: IGNITE-17304

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16202) Supports transactions by index

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16202:
---
Epic Link: IGNITE-17304

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-14925.

Resolution: Duplicate

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Epic Link: IGNITE-17304

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562614#comment-17562614
 ] 

Ivan Bessonov commented on IGNITE-14925:


Replaced with EPIC

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-17304) SQL indexes 3.0 epic

2022-07-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17304:
--

 Summary: SQL indexes 3.0 epic
 Key: IGNITE-17304
 URL: https://issues.apache.org/jira/browse/IGNITE-17304
 Project: Ignite
  Issue Type: Epic
Reporter: Ivan Bessonov


Ignite 3.x requires SQL indexes, just like any other database. Current Epic is 
the collection of issues related to indexes design and implementation.

This includes:
 * indexes configuration
 * indexes lifecycle
 * indexes storage
 * indexes integration into SQL queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Issue Type: New Feature  (was: Epic)

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16265:
---
Epic Link: (was: IGNITE-14925)

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16199) Implements index build/rebuild

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16199:
---
Epic Link: (was: IGNITE-14925)

> Implements index build/rebuild 
> ---
>
> Key: IGNITE-16199
> URL: https://issues.apache.org/jira/browse/IGNITE-16199
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Index must be built on exists table data: scan table's data and build an 
> index.
> Now only update index by table updates is implemented.
> May be build and rebuild tasks may be split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16202) Supports transactions by index

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16202:
---
Epic Link: (was: IGNITE-14925)

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16199) Implements index build/rebuild

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16199:
---
Epic Link: IGNITE-14925

> Implements index build/rebuild 
> ---
>
> Key: IGNITE-16199
> URL: https://issues.apache.org/jira/browse/IGNITE-16199
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Index must be built on exists table data: scan table's data and build an 
> index.
> Now only update index by table updates is implemented.
> May be build and rebuild tasks may be split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16265:
---
Epic Link: IGNITE-14925

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16202) Supports transactions by index

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16202:
---
Epic Link: IGNITE-14925

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Epic Name: Sorted SQL indexes

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: Epic
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Issue Type: Epic  (was: New Feature)

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: Epic
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches

2022-07-01 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17272:
---
Component/s: cache

> Logical recovery works incorrectly for encrypted caches
> ---
>
> Key: IGNITE-17272
> URL: https://issues.apache.org/jira/browse/IGNITE-17272
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.13
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
> Fix For: 2.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When encryption is enabled for a particular cache, its WAL records get 
> encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type 
> is considered a {{PHYSICAL}} record, which leads to such records being 
> omitted during logical recovery regardless of the fact that it can contain 
> logical records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches

2022-07-01 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17272:
---
Affects Version/s: 2.13

> Logical recovery works incorrectly for encrypted caches
> ---
>
> Key: IGNITE-17272
> URL: https://issues.apache.org/jira/browse/IGNITE-17272
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.13
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
> Fix For: 2.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When encryption is enabled for a particular cache, its WAL records get 
> encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type 
> is considered a {{PHYSICAL}} record, which leads to such records being 
> omitted during logical recovery regardless of the fact that it can contain 
> logical records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches

2022-07-01 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561364#comment-17561364
 ] 

Ivan Bessonov commented on IGNITE-17272:


Looks good to me, thank you! I'll merge it to master

> Logical recovery works incorrectly for encrypted caches
> ---
>
> Key: IGNITE-17272
> URL: https://issues.apache.org/jira/browse/IGNITE-17272
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When encryption is enabled for a particular cache, its WAL records get 
> encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type 
> is considered a {{PHYSICAL}} record, which leads to such records being 
> omitted during logical recovery regardless of the fact that it can contain 
> logical records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-17283) ItCmgRaftServiceTest should start Raft groups in parallel

2022-06-30 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17283:
---
Ignite Flags:   (was: Docs Required,Release Notes Required)

> ItCmgRaftServiceTest should start Raft groups in parallel
> -
>
> Key: IGNITE-17283
> URL: https://issues.apache.org/jira/browse/IGNITE-17283
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Minor
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ItCmgRaftServiceTest starts a couple of Raft groups sequentially, so the 
> first group waits for other members to appear before it times out. This leads 
> to this test running for quite a long time. It is proposed to start these 
> groups in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-17283) ItCmgRaftServiceTest should start Raft groups in parallel

2022-06-30 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561099#comment-17561099
 ] 

Ivan Bessonov commented on IGNITE-17283:


Looks good, thank you for the improvement!

> ItCmgRaftServiceTest should start Raft groups in parallel
> -
>
> Key: IGNITE-17283
> URL: https://issues.apache.org/jira/browse/IGNITE-17283
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Minor
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ItCmgRaftServiceTest starts a couple of Raft groups sequentially, so the 
> first group waits for other members to appear before it times out. This leads 
> to this test running for quite a long time. It is proposed to start these 
> groups in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IGNITE-17278) TableManager#directTableIds can't be implemented effectively

2022-06-30 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17278:
--

 Summary: TableManager#directTableIds can't be implemented 
effectively
 Key: IGNITE-17278
 URL: https://issues.apache.org/jira/browse/IGNITE-17278
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov


I propose adding a special method "internalIds" to direct proxy, so that there 
won't be the case for reading all tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IGNITE-16913) Provide affective way to write BinaryRow into byte buffer

2022-06-29 Thread Ivan Bessonov (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16913  
 
 
  Provide affective way to write BinaryRow into byte buffer   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 
 
Epic Link: 
 IGNITE-16923  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-29 Thread Ivan Bessonov (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16655  
 
 
  Volatile RAFT log for pure in-memory storages   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 

  
 
 
 
 

 
 h3. Original issue descriptionFor in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process.h3. More detailed descriptionApparently, we can't completely ignore writing to log. There are several situations where it needs to be collected: * During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved. * During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.So, it all comes to the aggressive log truncation to make it short.Preserved log can be quite big in reality, there must be a disk offloading operation available.The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways.  * Potentially, we could use a volatile page memory region for this purpose, since it already has a good control over the amount of memory used. But, memory overflow should be carefully processed, usually it's treated as an error and might even cause node failure.  
 

  
 
 
 
 

  

[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-29 Thread Ivan Bessonov (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16655  
 
 
  Volatile RAFT log for pure in-memory storages   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 

  
 
 
 
 

 
 h3. Original issue descriptionFor in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process.h3. More detailed description   Apparently, we can't completely ignore writing to log. There are several situations where it needs to be collected: * During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved. * During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.So, it all comes to the aggressive log truncation to make it short.Preserved log can be quite big in reality, there must be a disk offloading operation available.The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways.  
 

  
 
 
 
 

 
 
 

 
   

[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-28 Thread Ivan Bessonov (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16655  
 
 
  Volatile RAFT log for pure in-memory storages   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 

  
 
 
 
 

 
 h3. Original issue description For in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process. h3. More detailed description   
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-28 Thread Ivan Bessonov (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16655  
 
 
  Volatile RAFT log for pure in-memory storages   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 
 
Summary: 
 Raft Volatile RAFT  log  improvements  for pure in-memory storages  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (IGNITE-17230) Support splt-file page store

2022-06-27 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17230:
---
Description: 
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity {*}DelataFilePageStore{*}, which will be created for each partition 
at each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
 * Header (maybe updated in the course of implementation):
 ** Allocation *pageIdx* - *pageIdx* of the last created page;
 * Sorted list of *pageIdx* - allows a binary search to find the file offset 
for an {*}pageId -> pageIdx{*};
 * Page content - sorted by {*}pageIdx{*}.

What will change for {*}FilePageStore{*}:
 * List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
 * Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of {*}FilePageStore{*}. At node start, it will be read 
from the header of *FilePageStore* or obtained from the first 
*DelataFilePageStore* (the newest one).

How pages will be read by {*}pageId -> pageIdx{*}:
 * Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
 * If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
 * Format of the file name for the *DelataFilePageStore* is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
 * Before creating {*}part-1-delta-3.bin{*}, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to {*}part-1-delta-3.bin{*}.

  was:
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity {*}DelataFilePageStore{*}, which will be created for each partition 
at each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
 * Header (maybe updated in the course of implementation):
 ** Allocation *pageIdx* - *pageIdx* of the last created page;
 * Sorted list of *pageIds* - allows a binary search to find the file offset 
for an {*}pageId -> pageIdx{*};
 * Page content - sorted by {*}pageIdx{*}.

What will change for {*}FilePageStore{*}:
 * List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
 * Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of {*}FilePageStore{*}. At node start, it will be read 
from the header of *FilePageStore* or obtained from the first 
*DelataFilePageStore* (the newest one).

How pages will be read by {*}pageId -> pageIdx{*}:
 * Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
 * If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
 * Format of the file name for the *DelataFilePageStore* is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
 * Before creating {*}part-1-delta-3.bin{*}, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to {*}part-1-delta-3.bin{*}.


> Support splt-file page store
> 
>
> Key: IGNITE-17230
> URL: https://issues.apache.org/jira/browse/IGNITE-17230
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> *Notes*
> Description may not be complete.
> *Goal*
> To implement a new checkpoint (described in IGNITE-15818), we will introduce 
> a new entity {*}DelataFilePageStore{*}, which will be created for each 
> partition at each checkpoint and removed after merging with the 
> *FilePageStore* (the main partition file) using the compacter.
> *DelataFilePageStore* will consist of:
>  * Header (maybe updated in the course of implementation):
>  ** Allocation *pageIdx* - *pageIdx* of the last created page;
>  * Sorted list of *pageIdx* - allows a binary search to find the file offset 
> for an {*}pageId -> pageIdx{*};
>  * Page content - sorted by {*}pageIdx{*}.
> What will change for {*}FilePageStore{*}:
>  * List of class *DelataFilePageStore* will be added (from the newest to the 
> oldest by the time of creation);
>  * Allocation index (pageIdx of the last created page) - it will be logical 
> and contained in the header of {*}Fi

[jira] [Updated] (IGNITE-17230) Support splt-file page store

2022-06-27 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17230:
---
Description: 
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity {*}DelataFilePageStore{*}, which will be created for each partition 
at each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
 * Header (maybe updated in the course of implementation):
 ** Allocation *pageIdx* - *pageIdx* of the last created page;
 * Sorted list of *pageIds* - allows a binary search to find the file offset 
for an {*}pageId -> pageIdx{*};
 * Page content - sorted by {*}pageIdx{*}.

What will change for {*}FilePageStore{*}:
 * List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
 * Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of {*}FilePageStore{*}. At node start, it will be read 
from the header of *FilePageStore* or obtained from the first 
*DelataFilePageStore* (the newest one).

How pages will be read by {*}pageId -> pageIdx{*}:
 * Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
 * If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
 * Format of the file name for the *DelataFilePageStore* is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
 * Before creating {*}part-1-delta-3.bin{*}, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to {*}part-1-delta-3.bin{*}.

  was:
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity *DelataFilePageStore*, which will be created for each partition at 
each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
* Header (maybe updated in the course of implementation):
** Allocation *pageIdx* - *pageIdx* of the last created page;
* Sorted list of *pageIdx* - allows a binary search to find the file offset for 
an *pageId -> pageIdx*;
* Page content - sorted by *pageIdx*.

What will change for *FilePageStore*:
* List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
* Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of *FilePageStore*. At node start, it will be read from 
the header of *FilePageStore* or obtained from the first *DelataFilePageStore* 
(the newest one).

How pages will be read by *pageId -> pageIdx*:
* Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
* If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
* Format of the file name for the *DelataFilePageStore*  is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
* Before creating *part-1-delta-3.bin*, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to *part-1-delta-3.bin*.


> Support splt-file page store
> 
>
> Key: IGNITE-17230
> URL: https://issues.apache.org/jira/browse/IGNITE-17230
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> *Notes*
> Description may not be complete.
> *Goal*
> To implement a new checkpoint (described in IGNITE-15818), we will introduce 
> a new entity {*}DelataFilePageStore{*}, which will be created for each 
> partition at each checkpoint and removed after merging with the 
> *FilePageStore* (the main partition file) using the compacter.
> *DelataFilePageStore* will consist of:
>  * Header (maybe updated in the course of implementation):
>  ** Allocation *pageIdx* - *pageIdx* of the last created page;
>  * Sorted list of *pageIds* - allows a binary search to find the file offset 
> for an {*}pageId -> pageIdx{*};
>  * Page content - sorted by {*}pageIdx{*}.
> What will change for {*}FilePageStore{*}:
>  * List of class *DelataFilePageStore* will be added (from the newest to the 
> oldest by the time of creation);
>  * Allocation index (pageIdx of the last created page) - it will be logical 
> and contained in the header of {*}FilePageStore{*}. At node start, it will be

[jira] [Commented] (IGNITE-17199) Improve the usability of the abstract configuration interface

2022-06-21 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556722#comment-17556722
 ] 

Ivan Bessonov commented on IGNITE-17199:


[~ktkale...@gridgain.com] I don't think that improving something here is 
necessary. Wildcard types is integral part of Java type system, it's not a bad 
thing. Over-engineering everything because of several "" occasions in 
code won't make product better IMO.

> Improve the usability of the abstract configuration interface
> -
>
> Key: IGNITE-17199
> URL: https://issues.apache.org/jira/browse/IGNITE-17199
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: iep-55, ignite-3
> Fix For: 3.0.0-alpha6
>
>
> *Problem*
> Consider an example of generating configuration interfaces (**Configuration*) 
> for an abstract configuration.
> Configuration schemas:
> {code:java}
> @AbstractConfiguration
> public class BaseConfigurationSchema {
> @Value
> public int size;
> }
> @Config
> public class VolatileConfigurationSchema extends BaseConfigurationSchema {
> @Value
> public double evictionThreshold;
> }
> {code}
> Configuration interfaces:
> {code:java}
> public interface BaseConfiguration BaseChange> extends ConfigurationTree {
> ConfigurationValue size();
> }
> public interface VolatileConfiguration extends 
> BaseConfiguration {
> ConfigurationValue size();
> }
> {code}
> This implementation allows us to work with the inheritors of the abstract 
> configuration as with a regular configuration (as if 
> *VolatileConfigurationSchema* did not extend *BaseConfigurationSchema*), but 
> when working with the abstract configuration itself, it creates 
> inconvenience. 
> For example, to get a view of the abstract configuration, we will need to 
> write the following code:
> {code:java}
> BaseConfiguration baseConfig0 = ...;
> BaseConfiguration baseConfig1 = ...;
> 
> BaseView baseView0 = (BasePageMemoryDataRegionView) baseConfig0.value();
> BaseView baseView1 = baseConfig1.value();
> {code}
> Which is not convenient and I would like us to be able to work in the same 
> way as with the *VolatileConfiguration*.
> *Possible implementations*
> * Simplest is to leave it as is;
> * Creates an additional configuration interface that will be similar to 
> *BaseConfiguration*, for example *BaseConfigurationTree*, but it will be 
> extended by *BaseConfiguration* and all its inheritors like 
> *VolatileConfiguration*, then there may be confusion about whether to use 
> *BaseConfiguration* or *BaseConfigurationTree* in the end, so we need to 
> decide how to create a name for such an interface;
> ** *BaseConfigurationTree*;
> ** *AbstractBaseConfigurationTree*;
> ** other.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17077) Implement checkpointIndex for PDS

2022-06-03 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17077:
---
Description: 
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right before releasing 
checkpoint read lock (or whatever the name we will come up with). More on that 
later.
 * Second one is invoked at the beginning of every write command to validate 
that update don't come out of order or with gaps. This is the way to guarantee 
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely 
random.

 
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
long updateIndex = partition.getUpdateIndex();
long raftIndex = writeCommand.raftIndex();

if (raftIndex != updateIndex + 1) {
throw new IndexMismatchException(updateIndex);
}

partition.write(writeCommand.row());

for (Index index : table.indexes(partition) {
index.index(writeCommand.row());
}

partition.setUpdateIndex(raftIndex);
}{code}
 

Some nuances:
 * Mismatch exception must be thrown before any data modifications. Storage 
content must be intact, otherwise we'll just break it.
 * Case above is the simplest one - there's a single "atomic" storage update. 
Generally speaking, we can't or sometimes don't want to work this way. Examples 
of operations, where atomicity this strict is not required:
 ** Batch insert/update from the transaction.
 ** Transaction commit might have a huge number of row ids, we can exhaust the 
memory while committing.
 * If we split write operation into several operations, we should externally 
guarantee their idempotence. "setUpdateIndex" should be at the end of the last 
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This 
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This 
may not be the right place to describe the issue, but I do it nonetheless. 
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just 
do it while holding checkpoint write lock. This way data is consistent. But 
this solution is equally {*}bad{*}, it forces us to perform pages manipulation 
under write lock. Flushing freelists is enough already. (NOTE: we should test 
the performance without onheap-cache, it'll speed-up checkpoint start process, 
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe 
during the start, but that's it. It's a common practice to have a pageSize 
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is 
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a 
loaded page for every partition is just a waste of resources, all required data 
can be stored on-heap.

Then, let's rely on two simple facts:
 * If meta page date is cached on-heap, no one would need to read it from disk. 
I should also mention that it will mostly be immutable.
 * We can write partition meta page into every delta file even if meta has not 
changed. In actuality, this will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta 
page from heap to disk at the beginning of writing the delta file. This page 
will become a write-only page, which is basically what we need. 
h2. Callbacks and RAFT snapshots

I argue against scheduled RAFT snapshots. They will produce a lot of junk 
checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine 
RAFT triggering snapshots for 100 partitions in a row. This will result in a 
100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
 * partition.getCheckpointerUpdateIndex();
 * partition.registerCheckpointedUpdateIndexListener(closure);

Bot of these methods could be used by RAFT to determine whether it needs to 
truncate its log and to define a specific commit index for truncation.

In case of PDS checkpointer, implementation for both of these methods is 
trivial.

  was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right befo

[jira] [Resolved] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-03 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-17074.

Resolution: Duplicate

> Create integer tableId identifier for tables
> 
>
> Key: IGNITE-17074
> URL: https://issues.apache.org/jira/browse/IGNITE-17074
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> First of all, this requirement comes from the PageMemory component 
> restrictions - having an entire UUID for table id is too much for a loaded 
> pages list. Currently the implementation uses String hash, just like in 
> Ignite 2.x. This is a bad solution.
> In Ignite 3.x configuration model, every configuration update is serialized 
> by design. This allows us to have atomic counters basically for free. We 
> could add a {{int lastTableId}} configuration property to a 
> {{TablesConfigurationSchema}}, for example, and increment it every time new 
> table is created. Then all we need is to read this value in all components 
> that need it.
> Maybe we should even use it in thin clients, but that needs a careful 
> consideration. Originally, int tableId is intended to be used in storage 
> implementations and maybe as a part of unique RowId, associated with tables, 
> but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-03 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17074:
---
Description: 
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId}} configuration property to a 
{{TablesConfigurationSchema}}, for example, and increment it every time new 
table is created. Then all we need is to read this value in all components that 
need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.

  was:
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId}} configuration property to a 
{{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every time 
new table is created. Then all we need is to read this value in all components 
that need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.


> Create integer tableId identifier for tables
> 
>
> Key: IGNITE-17074
> URL: https://issues.apache.org/jira/browse/IGNITE-17074
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> First of all, this requirement comes from the PageMemory component 
> restrictions - having an entire UUID for table id is too much for a loaded 
> pages list. Currently the implementation uses String hash, just like in 
> Ignite 2.x. This is a bad solution.
> In Ignite 3.x configuration model, every configuration update is serialized 
> by design. This allows us to have atomic counters basically for free. We 
> could add a {{int lastTableId}} configuration property to a 
> {{TablesConfigurationSchema}}, for example, and increment it every time new 
> table is created. Then all we need is to read this value in all components 
> that need it.
> Maybe we should even use it in thin clients, but that needs a careful 
> consideration. Originally, int tableId is intended to be used in storage 
> implementations and maybe as a part of unique RowId, associated with tables, 
> but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-03 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17074:
---
Description: 
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId}} configuration property to a 
{{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every time 
new table is created. Then all we need is to read this value in all components 
that need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.

  was:
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId }}configuration property to a 
{{{}TablesConfigurationSchema{}}}, for example, and increment it every time new 
table is created. Then all we need is to read this value in all components that 
need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.


> Create integer tableId identifier for tables
> 
>
> Key: IGNITE-17074
> URL: https://issues.apache.org/jira/browse/IGNITE-17074
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> First of all, this requirement comes from the PageMemory component 
> restrictions - having an entire UUID for table id is too much for a loaded 
> pages list. Currently the implementation uses String hash, just like in 
> Ignite 2.x. This is a bad solution.
> In Ignite 3.x configuration model, every configuration update is serialized 
> by design. This allows us to have atomic counters basically for free. We 
> could add a {{int lastTableId}} configuration property to a 
> {{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every 
> time new table is created. Then all we need is to read this value in all 
> components that need it.
> Maybe we should even use it in thin clients, but that needs a careful 
> consideration. Originally, int tableId is intended to be used in storage 
> implementations and maybe as a part of unique RowId, associated with tables, 
> but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-17087) Native rebalance for PDS partitions

2022-06-03 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17087:
--

 Summary: Native rebalance for PDS partitions
 Key: IGNITE-17087
 URL: https://issues.apache.org/jira/browse/IGNITE-17087
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.
h2. General idea

In this case, PDS has checkpointing feature that saves consistent state on 
disk. I expect SQL indexes to be in the same partition file as other data.

For every partition, its state on disk would look like this:
{code:java}
part-x.bin
part-x-1.bin
part-x-2.bin
...
part-x-n.bin{code}
part-x.bin is a baseline, and every other file is a delta that should be 
applied to underlying layers to get consistent data. It can be viewed like full 
and incremental backups.

When rebalance snapshot is required, we could force a checkpoint and then 
*prohibit merging* of new deltas to delta files from the snapshot until 
rebalance is finished. We must guarantee that consistent state can be read from 
disk.

Now, there are several strategies of data transferring:
 * File-based. We can send baseline and delta files as files. Two possible 
issues here:
 ** Files contain duplicated pages, so the volume of data will be bigger than 
necessary.
 ** Baseline file has to be truncated, because some delta pages go directly 
into baseline file as optimization.
 * Page-based. Latest state of every required page is sent separately. Two 
strategies here:
 ** Iterate pages in order of page indexes. Overheads during reads, but writes 
are very effective.
 ** Iterate pages in order of delta files, skipping already read pages in the 
process (like snapshots in GridGain, for example). Little overhead on read, but 
write won't be append-only.
I would argue that slower reads are more appropriate then slower writes. 
Generally speaking, any write should be slower than any read of the same size, 
right?

Should we implement all strategies and give user a choice? It's hard to predict 
which one is better for which scenario. In the future, I think it would be 
convenient to implement many options, but at first we should stick to the 
simplest one.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.

NOTE: of course, it has to be mentioned that this approach might lead to 
ineffective storage space usage. It can be a problem in theory, but in practice 
full rebalance isn't expected to occur often, and event then we don't expect 
that users will rewrite the entire partition data in a span of a single 
rebalance.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17084) Native rebalance for RocksDB partitions

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17084:
---
Description: 
General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.

In this case, RocksDB allows consistent db iteration using a "Snapshot" 
feature. Idea is very simple:
 * Take a RoackDB snapshot.
 * Iterate through partition data.
 * Iterate through indexes.
 * Relese the snapshot.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.

NOTE: of course, it has to be mentioned that this approach might lead to 
ineffective storage space usage. What I mean is that "previous" versions of 
values, in terms of RocksDB, must be stored on the device if they're visible 
from any of snapshots. It can be a problem in theory, but in practice full 
rebalance isn't expected to occur often, and event then we don't expect that 
users will rewrite the entire partition data in a span of a single rebalance.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.

  was:
General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.

In this case, RocksDB allows consistent db iteration using a "Snapshot" 
feature. Idea is very simple:
 * Take a RoackDB snapshot.
 * Iterate through partition data.
 * Iterate through indexes.
 * Relese the snapshot.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.


> Native rebalance for RocksDB partitions
> ---
>
> Key: IGNITE-17084
> URL: https://issues.apache.org/jira/browse/IGNITE-17084
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> General idea of full rebalance is described in 
> https://issues.apache.org/jira/browse/IGNITE-17083
> For persistent storages, there's an option to avoid copy-on-write rebalance 
> algorithms if it's desired. Intuitively, it's a preferable option. Each 
> storage chooses its own format.
> In this case, RocksDB allows consistent db iteration using a "Snapshot" 
> feature. Idea is very simple:
>  * Take a RoackDB snapshot.
>  * Iterate through partition data.
>  * Iterate through indexes.
>  * Relese the snapshot.
> There must be a common "infrastructure" or a framework to stream native 
> rebalance snapshots. Data format should be as simple as possible.
> NOTE: of course, it has to be mentioned that this approach might lead to 
> ineffective storage space usage. What I mean is that "previous" versions of 
> values, in terms of RocksDB, must be stored on the device if they're visible 
> from any of snapshots. It can be a problem in theory, but in practice full 
> rebalance isn't expected to occur often, and event then we don't expect that 
> users will rewrite the entire partition data in a span of a single rebalance.
> h2. Possible problems
> Given that "raw" data is sent, including sql indexes, all incompleted indexes 
> will be sent incompleted. Maybe we should also send a build state for each 
> index so that the receiving side could continue from the right place, not 
> from the beginning.
> This problem will be resolved in the future. Currently we don't have indexes 
> implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-17084) Native rebalance for RocksDB partitions

2022-06-02 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17084:
--

 Summary: Native rebalance for RocksDB partitions
 Key: IGNITE-17084
 URL: https://issues.apache.org/jira/browse/IGNITE-17084
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.

In this case, RocksDB allows consistent db iteration using a "Snapshot" 
feature. Idea is very simple:
 * Take a RoackDB snapshot.
 * Iterate through partition data.
 * Iterate through indexes.
 * Relese the snapshot.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17083) Universal full rebalance procedure for MV storage

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17083:
---
Description: 
Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots 
of data. This is not always a good idea. First of all, for persistent data is 
already stored somewhere and can be read at any time. Second, for volatile 
storage this requirement is just absurd.

So, a "rebalance snapshot" should be streamed from one node to another instead 
of being written to a storage. What's good is that this approach can be 
implemented independently from the storage engine (with few adjustments to 
storage API, of course).
h2. General idea

Once a "rebalance snapshot" operation is triggered, we open a special type of 
cursor from the partition storage, that is able to give us all versioned chains 
in {_}some fixed order{_}. Every time the next chain has been read, it's 
remembered as the last read (let's call it\{{ lastRowId}} for now). Then all 
versions for the specific row id should be sent to receiver node in "Oldest to 
Newest" order to simplify insertion.

This works fine without concurrent load. To account for that we need to have a 
additional collection of row ids, associated with a snapshot. Let's call it 
{{{}overwrittenRowIds{}}}.

With this in mind, every write command should look similar to this:
{noformat}
for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
  try (var lock = rebalanceSnaphot.lock()) {
if (rowId <= rebalanceSnaphot.lastRowId())
  continue;

if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
  continue;

rebalanceSnapshot.sendRowToReceiver(rowId);
  }
}

// Now modification can be freely performed.
// Snapshot itself will skip everything from the "overwrittenRowIds" 
collection.{noformat}
NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
Their commit will be replicated later from the RAFT log.

NOTE: receiving side will have to rebuild indexes during the rebalancing. Just 
like it works in Ignite 2.x.

NOTE: Technically it is possible to have several nodes entering the cluster 
that require a full rebalance. So, while triggering a rebalance snapshot 
cursor, we could wait for other nodes that might want to read the same data and 
process all of them with a single scan. This is an optimization, obviously.
h2. Implementation

The implementation will have to be split into several parts, because we need:
 * Support for snapshot streaming in RAFT state machine.
 * Storage API for this type of scan.
 * Every storage must implement the new scan method.
 * Streamer itself should be implemented, along with a specific logic in write 
commands.

  was:
Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots 
of data. This is not always a good idea. First of all, for persistent data is 
already stored somewhere and can be read at any time. Second, for volatile 
storage this requirement is just absurd.

So, a "rebalance snapshot" should be streamed from one node to another instead 
of being written to a storage. What's good is that this approach can be 
implemented independently from the storage engine (with few adjustments to 
storage API, of course).
h2. General idea

Once a "rebalance snapshot" operation is triggered, we open a special type of 
cursor from the partition storage, that is able to give us all versioned chains 
in {_}some fixed order{_}. Every time the next chain has been read, it's 
remembered as the last read (let's call it{{ lastRowId}} for now). Then all 
versions for the specific row id should be sent to receiver node in "Oldest to 
Newest" order to simplify insertion.

This works fine without concurrent load. To account for that we need to have a 
additional collection of row ids, associated with a snapshot. Let's call it 
{{{}overwrittenRowIds{}}}.

With this in mind, every write command should look similar to this:

 
{noformat}
for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
  try (var lock = rebalanceSnaphot.lock()) {
if (rowId <= rebalanceSnaphot.lastRowId())
  continue;

if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
  continue;

rebalanceSnapshot.sendRowToReceiver(rowId);
  }
}

// Now modification can be freely performed.
// Snapshot itself will skip everything from the "overwrittenRowIds" 
collection.{noformat}
NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
Their commit will be replicated later from the RAFT log.

 

NOTE: receiving side will have to rebuild indexes during the rebalancing. Just 
like it works in Ignite 2.x.

NOTE: Technically it is possible to have several nodes entering the cluster 
that require a full rebalance. So, while triggering a rebalance snapshot 
cursor, we could wait for other nodes that might want to read the same data and 
process all of them with a single scan. Thi

[jira] [Created] (IGNITE-17083) Universal full rebalance procedure for MV storage

2022-06-02 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17083:
--

 Summary: Universal full rebalance procedure for MV storage
 Key: IGNITE-17083
 URL: https://issues.apache.org/jira/browse/IGNITE-17083
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots 
of data. This is not always a good idea. First of all, for persistent data is 
already stored somewhere and can be read at any time. Second, for volatile 
storage this requirement is just absurd.

So, a "rebalance snapshot" should be streamed from one node to another instead 
of being written to a storage. What's good is that this approach can be 
implemented independently from the storage engine (with few adjustments to 
storage API, of course).
h2. General idea

Once a "rebalance snapshot" operation is triggered, we open a special type of 
cursor from the partition storage, that is able to give us all versioned chains 
in {_}some fixed order{_}. Every time the next chain has been read, it's 
remembered as the last read (let's call it{{ lastRowId}} for now). Then all 
versions for the specific row id should be sent to receiver node in "Oldest to 
Newest" order to simplify insertion.

This works fine without concurrent load. To account for that we need to have a 
additional collection of row ids, associated with a snapshot. Let's call it 
{{{}overwrittenRowIds{}}}.

With this in mind, every write command should look similar to this:

 
{noformat}
for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
  try (var lock = rebalanceSnaphot.lock()) {
if (rowId <= rebalanceSnaphot.lastRowId())
  continue;

if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
  continue;

rebalanceSnapshot.sendRowToReceiver(rowId);
  }
}

// Now modification can be freely performed.
// Snapshot itself will skip everything from the "overwrittenRowIds" 
collection.{noformat}
NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
Their commit will be replicated later from the RAFT log.

 

NOTE: receiving side will have to rebuild indexes during the rebalancing. Just 
like it works in Ignite 2.x.

NOTE: Technically it is possible to have several nodes entering the cluster 
that require a full rebalance. So, while triggering a rebalance snapshot 
cursor, we could wait for other nodes that might want to read the same data and 
process all of them with a single scan. This is an optimization, obviously.
h2. Implementation

The implementation will have to be split into several parts, because we need:
 * Support for snapshot streaming in RAFT state machine.
 * Storage API for this type of scan.
 * Every storage must implement the new scan method.
 * Streamer itself should be implemented, along with a specific logic in write 
commands.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-06-02 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17081:
--

 Summary: Implement checkpointIndex for RocksDB
 Key: IGNITE-17081
 URL: https://issues.apache.org/jira/browse/IGNITE-17081
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in 
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
all column families consistently, if you have batches that cover several CFs. 
Basically, {{acquireConsistencyLock()}} would create a thread-local write 
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will 
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now  
(gladly, there's a WriteBatchInterface#count() to check). I don't see any 
practical value and a proper way of implementing it, considering how spread-out 
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is 
more challenging, I propose caching it and update only from the closure, that 
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the 
implementation:
 * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
atomic flush mode. And, once you have your first "completed" event ,you have a 
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is 
all that's needed.
 * Unlike PDS implementation, here we will be writing updateIndex value into a 
memtable every time. This makes it harder to find persistedIndex values for 
partitions. Gladly, considering the events that we have, during the time 
between first "completed" and the very next "begin", the state on disk is fully 
consistent. And there's a way to read data from storage avoiding memtable 
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

 
{code:java}
During table start: read latest values of update indexes. Store them in an 
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
  if (lastEventType == ON_FLUSH_BEGIN)
return;

  waitForLastAsyncUpdateIndexesRead();

  lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
  if (lastEventType == ON_FLUSH_COMPLETED)
return;

  asyncReadUpdateIndexesFromDisk();

  lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing 
process. We don't control locks that RocksDb holds while calling listener's 
methods.

 

That asynchronous process would invoke closures that provide presisted 
updateIndex  values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" 
as late as possible just in case. But my implementation says calling it during 
the first event. This is fine. I noticed that column families are flushed in 
order of their internal ids. These ids correspond to a sequence number of CFs, 
and the "default" CF is always created first. This is the exact CF that we use 
to store meta. Maybe we're going to change this and create a separate meta CF. 
Only then we could start optimizing this part, and only if we'll have an actual 
proof that there's a stall in this exact place.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17081:
---
Description: 
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in 
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
all column families consistently, if you have batches that cover several CFs. 
Basically, {{acquireConsistencyLock()}} would create a thread-local write 
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will 
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now  
(gladly, there's a WriteBatchInterface#count() to check). I don't see any 
practical value and a proper way of implementing it, considering how spread-out 
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is 
more challenging, I propose caching it and update only from the closure, that 
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the 
implementation:
 * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
atomic flush mode. And, once you have your first "completed" event ,you have a 
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is 
all that's needed.
 * Unlike PDS implementation, here we will be writing updateIndex value into a 
memtable every time. This makes it harder to find persistedIndex values for 
partitions. Gladly, considering the events that we have, during the time 
between first "completed" and the very next "begin", the state on disk is fully 
consistent. And there's a way to read data from storage avoiding memtable 
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

 
{code:java}
During table start: read latest values of update indexes. Store them in an 
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
  if (lastEventType == ON_FLUSH_BEGIN)
return;

  waitForLastAsyncUpdateIndexesRead();

  lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
  if (lastEventType == ON_FLUSH_COMPLETED)
return;

  asyncReadUpdateIndexesFromDisk();

  lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing 
process. We don't control locks that RocksDb holds while calling listener's 
methods.

That asynchronous process would invoke closures that provide presisted 
updateIndex  values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" 
as late as possible just in case. But my implementation says calling it during 
the first event. This is fine. I noticed that column families are flushed in 
order of their internal ids. These ids correspond to a sequence number of CFs, 
and the "default" CF is always created first. This is the exact CF that we use 
to store meta. Maybe we're going to change this and create a separate meta CF. 
Only then we could start optimizing this part, and only if we'll have an actual 
proof that there's a stall in this exact place.

  was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take 

[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17081:
---
Description: 
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in 
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
all column families consistently, if you have batches that cover several CFs. 
Basically, {{acquireConsistencyLock()}} would create a thread-local write 
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will 
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now  
(gladly, there's a WriteBatchInterface#count() to check). I don't see any 
practical value and a proper way of implementing it, considering how spread-out 
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is 
more challenging, I propose caching it and update only from the closure, that 
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the 
implementation:
 * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
atomic flush mode. And, once you have your first "completed" event ,you have a 
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is 
all that's needed.
 * Unlike PDS implementation, here we will be writing updateIndex value into a 
memtable every time. This makes it harder to find persistedIndex values for 
partitions. Gladly, considering the events that we have, during the time 
between first "completed" and the very next "begin", the state on disk is fully 
consistent. And there's a way to read data from storage avoiding memtable 
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

 
{code:java}
During table start: read latest values of update indexes. Store them in an 
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
  if (lastEventType == ON_FLUSH_BEGIN)
return;

  waitForLastAsyncUpdateIndexesRead();

  lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
  if (lastEventType == ON_FLUSH_COMPLETED)
return;

  asyncReadUpdateIndexesFromDisk();

  lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing 
process. We don't control locks that RocksDb holds while calling listener's 
methods.

That asynchronous process would invoke closures that provide presisted 
updateIndex values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" 
as late as possible just in case. But my implementation says calling it during 
the first event. This is fine. I noticed that column families are flushed in 
order of their internal ids. These ids correspond to a sequence number of CFs, 
and the "default" CF is always created first. This is the exact CF that we use 
to store meta. Maybe we're going to change this and create a separate meta CF. 
Only then we could start optimizing this part, and only if we'll have an actual 
proof that there's a stall in this exact place.

  was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take m

[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17081:
---
Labels: ignite-3  (was: )

> Implement checkpointIndex for RocksDB
> -
>
> Key: IGNITE-17081
> URL: https://issues.apache.org/jira/browse/IGNITE-17081
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
> prerequisites.
> Please also familiarize yourself with 
> https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
> the description is continued from there.
> For RocksDB based storage the recovery process is trivial, because RocksDB 
> has its own WAL. So, for testing purposes, it would be enough to just store 
> update index in meta column family.
> Immediately we have a write amplification issue, on top of possible 
> performance degradation. Obvious solution is inherently bad and needs to be 
> improved.
> h2. General idea & implementation
> Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
> breaks RocksDB recovery procedure, we need to take measures to avoid it.
> The only feasible way to do so is to use DBOptions#setAtomicFlush in 
> conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
> all column families consistently, if you have batches that cover several CFs. 
> Basically, {{acquireConsistencyLock()}} would create a thread-local write 
> batch, that's applied on locks release. Most of RocksDbMvPartitionStorage 
> will be affected by this change.
> NOTE: I believe that scans with unapplied batches should be prohibited for 
> now  (gladly, there's a WriteBatchInterface#count() to check). I don't see 
> any practical value and a proper way of implementing it, considering how 
> spread-out in time the scan process is.
> h2. Callbacks and RAFT snapshots
> Simply storing and reading update index is easy. Reading committed index is 
> more challenging, I propose caching it and update only from the closure, that 
> can also be used by RAFT to truncate the log.
> For a closure, there are several things to account for during the 
> implementation:
>  * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
> ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
> atomic flush mode. And, once you have your first "completed" event ,you have 
> a guarantee that *all* memtables are already persisted.
> This allows easy tracking of RocksDB flushes, monitoring events alteration is 
> all that's needed.
>  * Unlike PDS implementation, here we will be writing updateIndex value into 
> a memtable every time. This makes it harder to find persistedIndex values for 
> partitions. Gladly, considering the events that we have, during the time 
> between first "completed" and the very next "begin", the state on disk is 
> fully consistent. And there's a way to read data from storage avoiding 
> memtable completely - ReadOptions#setReadTier(PERSISTED_TIER).
> Summarizing everything from the above, we should implement following protocol:
>  
> {code:java}
> During table start: read latest values of update indexes. Store them in an 
> in-memory structure.
> Set "lastEventType = ON_FLUSH_COMPLETED;".
> onFlushBegin:
>   if (lastEventType == ON_FLUSH_BEGIN)
> return;
>   waitForLastAsyncUpdateIndexesRead();
>   lastEventType = ON_FLUSH_BEGIN;
> onFlushCompleted:
>   if (lastEventType == ON_FLUSH_COMPLETED)
> return;
>   asyncReadUpdateIndexesFromDisk();
>   lastEventType = ON_FLUSH_COMPLETED;{code}
> Reading values from disk must be performed asynchronously to not stall 
> flushing process. We don't control locks that RocksDb holds while calling 
> listener's methods.
>  
> That asynchronous process would invoke closures that provide presisted 
> updateIndex  values to other components.
> NODE: One might say that we should call 
> "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But 
> my implementation says calling it during the first event. This is fine. I 
> noticed that column families are flushed in order of their internal ids. 
> These ids correspond to a sequence number of CFs, and the "default" CF is 
> always created first. This is the exact CF that we use to store meta. Maybe 
> we're going to change this and create a separate meta CF. Only then we could 
> start optimizing this part, and only if we'll have an actual proof that 
> there's a stall in this exact place.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17077) Implement checkpointIndex for PDS

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17077:
---
Description: 
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right before releasing 
checkpoint read lock (or whatever the name we will come up with). More on that 
later.
 * Second one is invoked at the beginning of every write command to validate 
that update don't come out of order or with gaps. This is the way to guarantee 
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely 
random.

 
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
long updateIndex = partition.getUpdateIndex();
long raftIndex = writeCommand.raftIndex();

if (raftIndex != updateIndex + 1) {
throw new IndexMismatchException(updateIndex);
}

partition.write(writeCommand.row());

for (Index index : table.indexes(partition) {
index.index(writeCommand.row());
}

partition.setUpdateIndex(raftIndex);
}{code}
 

Some nuances:
 * Mismatch exception must be thrown before any data modifications. Storage 
content must be intact, otherwise we'll just break it.
 * Case above is the simplest one - there's a single "atomic" storage update. 
Generally speaking, we can't or sometimes don't want to work this way. Examples 
of operations, where atomicity this strict is not required:
 ** Batch insert/update from the transaction.
 ** Transaction commit might have a huge number of row ids, we can exhaust the 
memory while committing.
 * If we split write operation into several operations, we should externally 
guarantee their idempotence. "setUpdateIndex" should be at the end of the last 
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This 
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This 
may not be the right place to describe the issue, but I do it nonetheless. 
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just 
do it while holding checkpoint write lock. This way data is consistent. But 
this solution is equally {*}bad{*}, it forces us to perform pages manipulation 
under write lock. Flushing freelists is enough already. (NOTE: we should test 
the performance without onheap-cache, it'll speed-up checkpoint start process, 
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe 
during the start, but that's it. It's a common practice to have a pageSize 
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is 
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a 
loaded page for every partition is just a waste of resources, all required data 
can be stored on-heap.

Then, let's rely on two simple facts:
 * If meta page date is cached on-heap, no one would need to read it from disk. 
I should also mention that it will mostly be immutable.
 * We can write partition meta page into every delta file even if meta has not 
changed. In actuality, this is will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta 
page from heap to disk at the beginning of writing the delta file. This page 
will become a write-only page, which is basically what we need. 
h2. Callbacks and RAFT snapshots

I argue against scheduled RAFT snapshots. They will produce a lot of junk 
checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine 
RAFT triggering snapshots for 100 partitions in a row. This will result in a 
100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
 * partition.getCheckpointerUpdateIndex();
 * partition.registerCheckpointedUpdateIndexListener(closure);

Bot of these methods could be used by RAFT to determine whether it needs to 
truncate its log and to define a specific commit index for truncation.

In case of PDS checkpointer, implementation for both of these methods is 
trivial.

  was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right b

[jira] [Created] (IGNITE-17077) Implement checkpointIndex for PDS

2022-06-02 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17077:
--

 Summary: Implement checkpointIndex for PDS
 Key: IGNITE-17077
 URL: https://issues.apache.org/jira/browse/IGNITE-17077
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and 
"getUpdateIndex" methods (names might be different).
 * First one is invoked at the end of every write command, with RAFT commit 
index being passed as a parameter. This is done right before releasing 
checkpoint read lock (or whatever the name we will come up with). More on that 
later.
 * Second one is invoked at the beginning of every write command to validate 
that update don't come out of order or with gaps. This is the way to guarantee 
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely 
random.

 
{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
long updateIndex = partition.getUpdateIndex();
long raftIndex = writeCommand.raftIndex();

if (raftIndex != updateIndex + 1) {
throw new IndexMismatchException(updateIndex);
}

partition.write(writeCommand.row());

for (Index index : table.indexes(partition) {
index.index(writeCommand.row());
}

partition.setUpdateIndex(raftIndex);
}{code}
 

Some nuances:
 * Mismatch exception must be thrown before any data modifications. Storage 
content must be intact, otherwise we'll just break it.
 * Case above is the simplest one - there's a single "atomic" storage update. 
Generally speaking, we can't or sometimes don't want to work this way. Examples 
of operations, where atomicity this strict is not required:
 ** Batch insert/update from the transaction.
 ** Transaction commit might have a huge number of row ids, we can exhaust the 
memory while committing.
 * If we split write operation into several operations, we should externally 
guarantee their idempotence. "setUpdateIndex" should be at the end of the last 
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This 
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This 
may not be the right place to describe the issue, but I do it nonetheless. 
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just 
do it while holding checkpoint write lock. This way data is consistent. But 
this solution is equally {*}bad{*}, it forces us to perform pages manipulation 
under write lock. Flushing freelists is enough already. (NOTE: we should test 
the performance without onheap-cache, it'll speed-up checkpoint start process, 
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe 
during the start, but that's it. It's a common practice to have a pageSize 
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is 
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a 
loaded page for every partition is just a waste of resources, all required data 
can be stored on-heap.

Then, let's rely on two simple facts:
 * If meta page date is cached on-heap, no one would need to read it from disk. 
I should also mention that it will mostly be immutable.
 * We can write partition meta page into every delta file even if meta has not 
changed. In actuality, this is will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta 
page from heap to disk at the beginning of writing the delta file. This page 
will become a write-only page, which is basically what we need. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-17076) Unify RowId format for different storages

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-17076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17076:
---
Labels: ignite-3  (was: )

> Unify RowId format for different storages
> -
>
> Key: IGNITE-17076
> URL: https://issues.apache.org/jira/browse/IGNITE-17076
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
> There's a method called "insert" that generates RowId by itself. This is 
> wrong, because it can lead to different id for the same row on the replica 
> storage. This completely breaks everything.
> Every replicated write command, that inserts new value, should produce same 
> row ids. There are several ways to achieve this:
>  * Use timestamps as identifiers. This is not very convenient, because we 
> would have to attach partition id on top of it. It's mandatory to know the 
> partition of the row.
>  * Use more complicated structure, for example a tuple of (raftCommitIndex, 
> partitionId, batchCounter), where
>  ** raftCommitIndex is the index of write command that performs insertion.
>  ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
> considering that there are plans to support more than 65000 partitions per 
> table.
>  ** batchCounter is used to differentiate insertions made in a single write 
> command. We can limit it with 2 bytes to save a little bit of space, if it's 
> necessary.
> I prefer the second option, but maybe it could be revised during the 
> implementation.
> Of course, method "insert" should be removed from bridge API. Tests have to 
> be updated. With the lack of RAFT group in storage tests, we can generate row 
> ids artificially, it's not a big deal.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-15818) [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and re-implementation

2022-06-02 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-15818:
---
Description: 
h2. Goal

Port and refactor core classes implementing page-based persistent store in 
Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.

New checkpoint implementation to avoid excessive logging.

Store lifecycle clarification to avoid complicated and invasive code of custom 
lifecycle managed mostly by DatabaseSharedManager.
h2. Items to pay attention to

New checkpoint implementation based on split-file storage, new page index 
structure to maintain disk-memory page mapping.

File page store implementation should be extracted from GridCacheOffheapManager 
to a separate entity, target implementation should support new version of 
checkpoint (split-file store to enable always-consistent store and to eliminate 
binary recovery phase).

Support of big pages (256+ kB).

Support of throttling algorithms.
h2. References

New checkpoint design overview is available 
[here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
h2. Thoughts

Although there is a technical opportunity to have independent checkpoints for 
different data regions, managing them could be a nightmare and it's definitely 
in the realm of optimizations and out of scope right now.

So, let's assume that there's one good old checkpoint process. There's still a 
requirement to have checkpoint markers, but they will not have a reference to 
WAL, because there's no WAL. Instead, we will have to store RAFT log revision 
per partition. Or not, I'm not that familiar with a recovery procedure that's 
currently in development.

Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new version 
will have DO and UNDO. This drastically simplifies both checkpoint itself and 
node recovery. But is complicates data access.

There will be two process that will share storage resource: "checkpointer" and 
"compactor". Let's examine what compactor should or shouldn't do:
 * it should not work in parallel with checkpointer, except for cases when 
there are too many layers (more on that later)
 * it should merge later checkpoint delta files into main partition files
 * it should delete checkpoint markers once all merges are completed for it, 
thus markers are decoupled from RAFT log

About "cases when there are too many layers" - too many layers could compromise 
reading speed. Number of layers should not increase uncontrollably. So, when a 
threshold is exceeded, compactor should start working no mater what. If 
anything, writing load can be throttled, reading matters more.

Recovery procedure:
 * read the list of checkpoint markers on engines start
 * remove all data from unfinished checkpoint, if it's there
 * trim main partition files to their proper size (should check it it's 
actually beneficial)

Table start procedure:
 * read all layer files headers according to the list of checkpoints
 * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
it as effective as possible
 * everything else is just like before

Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
after all. "Restore partition states" procedure could be revisited, I don't 
know how this will work yet.

How to store hashmaps:

regular maps might be too much, we should consider roaring map implementation 
or something similar that'll occupy less space. This is only a concern for 
in-memory structures. Files on disk may have a list of pairs, that's fine. 
Generally speaking, checkpoints with a size of 100 thousand pages are close to 
the top limit for most users. Splitting that to 500 partitions, for example, 
gives us 200 pages per partition. Entire map should fit into a single page.

The only exception to these calculations is index.bin. Amount of pages per 
checkpoint can be an orders of magnitudes higher, so we should keep an eye on 
it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is 
enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes 
pages. Map won't be too big IMO.

Another important moment - we should enable direct IO, it's supported by Java 
natively since version 9 (I guess). There's a chance that not only regular disk 
operations will become somewhat faster, but fsync will become drastically 
faster as a result. Which is good, fsync can easily take half a time of the 
checkpoint, which is just unacceptable.
h2. Thoughts 2.0

With high likelihood, we'll get rid of index.bin. This will remove the 
requirement of having checkpoint markers.

All that we need is a consistently growing local counter that will be used to 
mark partition delta files. But, it doesn't need to be global even on a level 
of local node, it can be a local counter per partition, that's persiste

[jira] [Created] (IGNITE-17076) Unify RowId format for different storages

2022-06-02 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17076:
--

 Summary: Unify RowId format for different storages
 Key: IGNITE-17076
 URL: https://issues.apache.org/jira/browse/IGNITE-17076
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Current MV store bridge API has a fatal flaw, born from a misunderstanding. 
There's a method called "insert" that generates RowId by itself. This is wrong, 
because it can lead to different id for the same row on the replica storage. 
This completely breaks everything.

Every replicated write command, that inserts new value, should produce same row 
ids. There are several ways to achieve this:
 * Use timestamps as identifiers. This is not very convenient, because we would 
have to attach partition id on top of it. It's mandatory to know the partition 
of the row.
 * Use more complicated structure, for example a tuple of (raftCommitIndex, 
partitionId, batchCounter), where

 ** raftCommitIndex is the index of write command that performs insertion.
 ** partitionId is an integer identifier of the partition. Could be 4 bytes, 
considering that there are plans to support more than 65000 partitions per 
table.
 ** batchCounter is used to differentiate insertions made in a single write 
command. We can limit it with 2 bytes to save a little bit of space, if it's 
necessary.

I prefer the second option, but maybe it could be revised during the 
implementation.

Of course, method "insert" should be removed from bridge API. Tests have to be 
updated. With the lack of RAFT group in storage tests, we can generate row ids 
artificially, it's not a big deal.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-02 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-17074:
--

 Summary: Create integer tableId identifier for tables
 Key: IGNITE-17074
 URL: https://issues.apache.org/jira/browse/IGNITE-17074
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId }}configuration property to a 
{{{}TablesConfigurationSchema{}}}, for example, and increment it every time new 
table is created. Then all we need is to read this value in all components that 
need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16306) snaptree-based in-memory storage

2022-05-25 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542064#comment-17542064
 ] 

Ivan Bessonov commented on IGNITE-16306:


[~sergeychugunov] sure, with great pleasure!

> snaptree-based in-memory storage
> 
>
> Key: IGNITE-16306
> URL: https://issues.apache.org/jira/browse/IGNITE-16306
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha3
>Reporter: Ivan Bessonov
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: iep-74, ignite-3
>
> Until a full-fledged MV store is implemented we can implement in-memory 
> storage on a snaptree library [1] that represents a concurrent AVL tree with 
> support of snapshots.
> In this ticket we need to integrate the library with our existing storage 
> APIs (refine API if necessary), integrate its snapshot API with Raft 
> snapshots and provide configuration if necessary.
> [1] https://github.com/nbronson/snaptree



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (IGNITE-16306) snaptree-based in-memory storage

2022-05-25 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16306.

Resolution: Won't Fix

> snaptree-based in-memory storage
> 
>
> Key: IGNITE-16306
> URL: https://issues.apache.org/jira/browse/IGNITE-16306
> Project: Ignite
>  Issue Type: Improvement
>Affects Versions: 3.0.0-alpha3
>Reporter: Ivan Bessonov
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: iep-74, ignite-3
>
> Until a full-fledged MV store is implemented we can implement in-memory 
> storage on a snaptree library [1] that represents a concurrent AVL tree with 
> support of snapshots.
> In this ticket we need to integrate the library with our existing storage 
> APIs (refine API if necessary), integrate its snapshot API with Raft 
> snapshots and provide configuration if necessary.
> [1] https://github.com/nbronson/snaptree



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (IGNITE-16937) [Versioned Storage] A multi version TableStorage for MvPartitionStorage partitions

2022-05-11 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16937:
--

Assignee: Ivan Bessonov

> [Versioned Storage] A multi version TableStorage for MvPartitionStorage 
> partitions
> --
>
> Key: IGNITE-16937
> URL: https://issues.apache.org/jira/browse/IGNITE-16937
> Project: Ignite
>  Issue Type: Task
>  Components: persistence
>Reporter: Sergey Uttsel
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Need to create a multi version table storage which aggregate 
> MvPartitionStorage partitions.
> Need to think how to integrate the multi version table storage to Ignite. May 
> be it's need to create for example a multi version StorageEngine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16926) Interrupted compute job may fail a node

2022-05-06 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16926:
---
Fix Version/s: 2.14

> Interrupted compute job may fail a node
> ---
>
> Key: IGNITE-16926
> URL: https://issues.apache.org/jira/browse/IGNITE-16926
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
> Fix For: 2.14
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden 
> ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden ]] at 
> org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003)
>  at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492)
>  at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432)
>  at 
> org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500)
>  at 
> org.apache.ignite.internal.processors

[jira] [Updated] (IGNITE-16933) PageMemory-based MV storage implementation

2022-05-06 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16933:
---
Description: 
Similar to IGNITE-16611, we need an MV-storage implementation for page memory 
storage engine. Currently, I expect only row storage implementation, without 
primary or secondary indexes.
h2. Chain Structure

Here I'm going to describe a data format. Each row is stored as a versioned 
chain. It will be represented by a number of data entries that will have 
references to each other.
{code:java}
[ Timestamp | NextLink | PayloadSize | Payload ]{code}
 * Timestamp is a 16 bytes value derived from 
{{org.apache.ignite.internal.tx.Timestamp}} instance. It represents a commit 
time of corresponding row.
 * NextLink is a link to the next element in the chain or a NULL_LINK (or any 
other convenient name). It's a long value in standard format for Page Memory 
links (itemId, flag, partitionId, pageIdx). Technically, partition id is not 
needed here, because it's always the same. Removing it could allow us to save 2 
bytes per chain element.
 * PayloadSize is a 4-bytes integer value that gives us the size of actual data 
in arbitrary format.
 * Payload - I expect it to be a serialized BinaryRow data. This is how it's 
implemented in RocksDB right now.

For uncommitted (pending) entries I propose using maximal possible timestamp - 
{{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that 
we never store tx id in chain itself.

Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It 
should be used as a header size in corresponding FreeList.
h2. RowId pointer

There's a requirement to have an immutable RowId for every versioned chain. One 
could argue that we should just make chain head immutable, but it would result 
in lots of complications. It's better to have a separate structure with 
immutable link, that will point to an actual head of the versioned chain.
{code:java}
[ TransactionId | HeadLink | NextLink ]{code}
 * TransactionId is a UUID. Can only be applied to pending entries. For 
committed head I propose storing 16 zeroes.
 * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already 
mentioned, I'd prefer 6.
 * NextLink is a "NextLink" value from the head chain element. It's a cheap 
shortcut for read-only transactions, you can skip uncommitted entry without 
even trying to read it, if there's a non-null transaction id. Debatable, I 
know, but looks cheap enough.

In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6 
= 28) bytes of data. There must e a separate FreeList for every partition even 
in In-Memory mode for reasons that I'll give later. "Header" size in that list 
must be equal to these 28 bytes. I wonder how effective FreeList will be for 
this case, where every chunk has the same size. We'll see. Maybe we should 
adjust a number of buckets somehow.
h2. Data access and Full Scan

Now, the fun part. There's no mention of B+Tree here. That's because we can 
probably just avoid it. If it existed, it would just point RowId to a described 
RowId structure in partition, but RowId is already a pointer itself. The only 
other problem that is usually solved by a tree-like structure is a full-scan of 
all rows in partition. This is useful when you need to rebuild indexes, for 
example.

We should keep in mind that there's no code yet for rebuilding indexes. On the 
other hand, there's a method for partition scan in the API. This code could be 
used instead of Primary Index until we have it implemented.

There's not FreeList full-scan currently in the code, it needs to be 
implemented. And, this particular full-scan is the reason why every partition 
should have its own list of row ids.

There's also a chance that introducing new flag for row ids might be 
convenient. I don't know yet, let's not do it for now.

Finally, we need an adequate protection from assertions if we, for some reason, 
have invalid row id. Things that can be checked be a normal code, not assertion:
 * data page type
 * number of items in the page

  was:
Similar to IGNITE-16611, we need an MV-storage implementation for page memory 
storage engine. Currently, I expect only row storage implementation, without 
primary or secondary indexes.

Here I'm going to describe a data format. Each row is stored as a versioned 
chain. It will be represented by a number of data entries that will have 
references to each other.
{code:java}
[ Timestamp | NextLink | PayloadSize | Payload ]{code}
 * Timestamp is a 16 bytes value derived from 
{{org.apache.ignite.internal.tx.Timestamp}} instance.
 * NextLink is a link to the next element in the chain or a NULL_LINK (or any 
other convenient name). It's a long value in standard format for Page Memory 
links (itemId, flag, partitionId, pageIdx). Technically, partition id is not 
needed here, because it's alway

[jira] [Created] (IGNITE-16933) PageMemory-based MV storage implementation

2022-05-06 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16933:
--

 Summary: PageMemory-based MV storage implementation
 Key: IGNITE-16933
 URL: https://issues.apache.org/jira/browse/IGNITE-16933
 Project: Ignite
  Issue Type: New Feature
Reporter: Ivan Bessonov


Similar to IGNITE-16611, we need an MV-storage implementation for page memory 
storage engine. Currently, I expect only row storage implementation, without 
primary or secondary indexes.

Here I'm going to describe a data format. Each row is stored as a versioned 
chain. It will be represented by a number of data entries that will have 
references to each other.
{code:java}
[ Timestamp | NextLink | PayloadSize | Payload ]{code}
 * Timestamp is a 16 bytes value derived from 
{{org.apache.ignite.internal.tx.Timestamp}} instance.
 * NextLink is a link to the next element in the chain or a NULL_LINK (or any 
other convenient name). It's a long value in standard format for Page Memory 
links (itemId, flag, partitionId, pageIdx). Technically, partition id is not 
needed here, because it's always the same. Removing it could allow us to save 2 
bytes per chain element.
 * PayloadSize is a 4-bytes integer value that gives us the size of actual data 
in arbitrary format.
 * Payload - I expect it to be a serialized BinaryRow data. This is how it's 
implemented in RocksDB right now.

For uncommitted (pending) entries I propose using maximal possible timestamp - 
{{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that 
we never store tx id in chain itself.

Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It 
should be used as a header size in corresponding FreeList.

There's a requirement to have an immutable RowId for every versioned chain. One 
could argue that we should just make chain head immutable, but it would result 
in lots of complications. It's better to have a separate structure with 
immutable link, that will point to an actual head of the versioned chain.
{code:java}
[ TransactionId | HeadLink | NextLink ]{code}
 * TransactionId is a UUID. Can only be applied to pending entries. For 
committed head I propose storing 16 zeroes.
 * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already 
mentioned, I'd prefer 6.
 * NextLink is a "NextLink" value from the head chain element. It's a cheap 
shortcut for read-only transactions, you can skip uncommitted entry without 
even trying to read it, if there's a non-null transaction id. Debatable, I 
know, but looks cheap enough.

In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6 
= 28) bytes of data. There must e a separate FreeList for every partition even 
in In-Memory mode for reasons that I'll give later. "Header" size in that list 
must be equal to these 28 bytes. I wander how effective FreeList will be for 
this case, where every chunk has the same size. We'll see. Maybe we should 
adjust a number of buckets somehow.

Now, the fun part. There's no mention of B+Tree here. That's because we can 
probably just avoid it. If it existed, it would just point RowId to a described 
RowId structure in partition, but RowId is already a pointer itself. The only 
other problem that is usually solved by a tree-like structure is a full-scan of 
all rows in partition. This is useful when you need to rebuild indexes, for 
example.

We should keep in mind that there's no code yet for rebuilding indexes. On the 
other hand, there's a method for partition scan in the API. It could be used to 
implement a Primary Index imitation until we have a real implementation.

There's not FreeList full-scan currently in the code, it needs to be 
implemented. And, this particular full-scan is the reason why every partition 
should have its own list of row ids.

There's also a chance that introducing new flag for row ids might be 
convenient. I don't know yet, let's not do it for now.

Finally, we need an adequate protection from assertions if we, for some reason, 
have invalid row id. Things that can be checked be a normal code, not assertion:
 * data page type
 * number of items in the page



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16912) Revisit UUID generation for RowId

2022-05-06 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16912:
---
Epic Link: IGNITE-16923

> Revisit UUID generation for RowId
> -
>
> Key: IGNITE-16912
> URL: https://issues.apache.org/jira/browse/IGNITE-16912
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Current implementation uses UUID.randomUUID, which comes with a set of 
> problems:
>  * some people say that you can't avoid collisions this way. Technically it's 
> true, although I don't think that it's a real problem
>  * secure random is slow when you use it frequently. This can affect 
> insertion performance
>  * random uuids are randomly distributed, this can be a problem for RocksDB, 
> for example - if most insertions will go to the tail, this can improve 
> overall write performance
> There are interesting approaches in this particular document, we should take 
> a look at it:
> https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (IGNITE-16926) Interrupted compute job may fail a node

2022-05-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16926:
--

Assignee: Ivan Bessonov

> Interrupted compute job may fail a node
> ---
>
> Key: IGNITE-16926
> URL: https://issues.apache.org/jira/browse/IGNITE-16926
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden 
> ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden ]] at 
> org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003)
>  at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492)
>  at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432)
>  at 
> org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500)
>  at 
> org.apache.ignite.internal.processors.query.h2.opt.GridH

[jira] [Created] (IGNITE-16926) Interrupted compute job may fail a node

2022-05-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16926:
--

 Summary: Interrupted compute job may fail a node
 Key: IGNITE-16926
 URL: https://issues.apache.org/jira/browse/IGNITE-16926
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Reporter: Ivan Bessonov


{code:java}
Critical system error detected. Will be handled accordingly to configured 
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
corrupted [groupId=1234619879, pageIds=[7290201467513], cacheId=645096946, 
cacheName=*, indexName=*, msg=Runtime failure on row: Row@79570772[ key: 
1168930235, val: Data hidden due to IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden 
","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
 B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
Row@79570772[ key: 1168930235, val: Data hidden due to 
IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden ]] at 
org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003)
 at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492)
 at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432)
 at 
org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500)
 at 
org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.addToIndex(GridH2Table.java:880)
 at 
org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:794)
 at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.store(IgniteH2Indexing.java:411)
 at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:2546)
 at 
org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridC

[jira] [Updated] (IGNITE-16915) ItClusterManagerTest#testNodeLeave is flaky

2022-04-29 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16915:
---
Description: 
https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_IntegrationTests_ModuleClusterManagement?branch=pull%2F787&buildTypeTab=overview&mode=builds

> ItClusterManagerTest#testNodeLeave is flaky
> ---
>
> Key: IGNITE-16915
> URL: https://issues.apache.org/jira/browse/IGNITE-16915
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://ci.ignite.apache.org/buildConfiguration/ignite3_Test_IntegrationTests_ModuleClusterManagement?branch=pull%2F787&buildTypeTab=overview&mode=builds



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16914) [Versioned Storage] Test and optimize prefixes in RocksDB

2022-04-29 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16914:
--

 Summary: [Versioned Storage] Test and optimize prefixes in RocksDB
 Key: IGNITE-16914
 URL: https://issues.apache.org/jira/browse/IGNITE-16914
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Main MV-storage doesn't require and specific order of elements, so partition 
scans don't have to be totally ordered.

If I understand correctly, this allows us to use prefixes functionality of 
RocksDB, extending it to row ids, not only partition ids. In theory, this 
should noticeably increase performance of single reads and I guess somehow 
increase scan performance as well.

Bloom filters and similar topics should be investigated here as well.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16913) Provide affective way to write BinaryRow into byte buffer

2022-04-29 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16913:
--

 Summary: Provide affective way to write BinaryRow into byte buffer
 Key: IGNITE-16913
 URL: https://issues.apache.org/jira/browse/IGNITE-16913
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Current API only allows to write row into OutputStream, which is not always 
convenient. For example, RocksDB implementation required writing into a byte 
buffer.

Creating an output stream on top of the buffer is not the best idea.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16912) Revisit UUID generation for RowId

2022-04-29 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16912:
--

 Summary: Revisit UUID generation for RowId
 Key: IGNITE-16912
 URL: https://issues.apache.org/jira/browse/IGNITE-16912
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Current implementation uses UUID.randomUUID, which comes with a set of problems:
 * some people say that you can't avoid collisions this way. Technically it's 
true, although I don't think that it's a real problem
 * secure random is slow when you use it frequently. This can affect insertion 
performance
 * random uuids are randomly distributed, this can be a problem for RocksDB, 
for example - if most insertions will go to the tail, this can improve overall 
write performance

There are interesting approaches in this particular document, we should take a 
look at it:

https://datatracker.ietf.org/doc/draft-peabody-dispatch-new-uuid-format/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-15734) Erroneous string formatting while changing cluster tag.

2022-04-19 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-15734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524157#comment-17524157
 ] 

Ivan Bessonov commented on IGNITE-15734:


[~zstan] done, thank you for the fix!

> Erroneous string formatting while changing cluster tag.
> ---
>
> Key: IGNITE-15734
> URL: https://issues.apache.org/jira/browse/IGNITE-15734
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.11
>Reporter: Evgeny Stanilovsky
>Assignee: Evgeny Stanilovsky
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {noformat}
> org.apache.ignite.internal.processors.cluster.ClusterProcessor#onReadyForRead
> ...
> log.info(
> "Cluster tag will be set to new value: " +
> newVal != null ? newVal.tag() : "null" +
> ", previous value was: " +
> oldVal != null ? oldVal.tag() : "null");
> {noformat}
> without braces 
> {noformat}
> "Cluster tag will be set to new value: " + newVal
> {noformat}
> always not null;



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (IGNITE-16848) [Versioned Storage] Provide common interface for abstract internal tuples

2022-04-13 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16848:
--

Assignee: Ivan Bessonov

> [Versioned Storage] Provide common interface for abstract internal tuples
> -
>
> Key: IGNITE-16848
> URL: https://issues.apache.org/jira/browse/IGNITE-16848
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: iep-74, ignite-3
> Fix For: 3.0.0-alpha5
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Methods from class "Row" should be extracted to provide generic tuple API to 
> components like SQL indexes or MV storage.
> Tuple is NOT schema-aware and should NOW have methods like "Object value(int 
> col)", because it's represents basic blob with little to no meta information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16848) [Versioned Storage] Provide common interface for abstract internal tuples

2022-04-13 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16848:
--

 Summary: [Versioned Storage] Provide common interface for abstract 
internal tuples
 Key: IGNITE-16848
 URL: https://issues.apache.org/jira/browse/IGNITE-16848
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
 Fix For: 3.0.0-alpha5


Methods from class "Row" should be extracted to provide generic tuple API to 
components like SQL indexes or MV storage.

Tuple is NOT schema-aware and should NOW have methods like "Object value(int 
col)", because it's represents basic blob with little to no meta information



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16611) [Versioned Storage] Version chain data structure for RocksDB-based storage

2022-04-13 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16611:
---
Labels: iep-74 ignite-3  (was: ignite-3)

> [Versioned Storage]  Version chain data structure for RocksDB-based storage
> ---
>
> Key: IGNITE-16611
> URL: https://issues.apache.org/jira/browse/IGNITE-16611
> Project: Ignite
>  Issue Type: Task
>  Components: persistence
>Reporter: Sergey Chugunov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: iep-74, ignite-3
>
> To support Concurrency Control and implement effective transactions 
> capability to store multiple values of the same key is needed in existing 
> storage.
> h3. Version chain
> Key component here is a special data structure called version chain: it is a 
> list of all versions of a particular key, with the most recent version at the 
> beginning (HEAD).
> Each entry in the chain contains value, reference to the next entry in the 
> list, begin and end timestamps and id of active transaction that created this 
> version.
> There are at least two approaches to implement this structure on top of 
> RocksDB:
> * Combine original key and version into a new key which is put into a RocksDB 
> tree. In that case to restore version chain we need to iterate over the tree 
> using original key as a prefix.
> * Use original key as-is but make it pointing not to the value directly but 
> to an array containing version and other metainformation (ts, id etc) and 
> keys in some secondary tree.
> h3. New API to manage versions
> The following new API should be implemented to provide access to version 
> chain:
> * Methods to manipulate versions: add new version to the chain, commit 
> uncommited version, abort uncommited version.
> * Method to cleanup old versions from the chain.
> * Method to scan over keys up to provided timestamp.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (IGNITE-16792) Configuration for Default Storage Engine

2022-04-11 Thread Ivan Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520506#comment-17520506
 ] 

Ivan Bessonov commented on IGNITE-16792:


[~ktkale...@gridgain.com] looks good to me, I'll merge it to main. Thank you!

> Configuration for Default Storage Engine
> 
>
> Key: IGNITE-16792
> URL: https://issues.apache.org/jira/browse/IGNITE-16792
> Project: Ignite
>  Issue Type: Task
>  Components: persistence
>Reporter: Sergey Chugunov
>Assignee: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Pluggable storage concept enables user to set up different storage engines 
> (SE) on the same node e.g. for performance reasons, each table can be hosted 
> only by one storage.
> From DDL point of view SE is specified as part of CREATE TABLE command. But 
> in case of only one SE and some other cases specifying it for each table 
> creates a lot of unnecessary boilerplate code.
> To address this and free user from writing exactly the same code a 
> cluster-wide setting *defaultStorageEngine* should be introduced.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16796) Rename is broken in configuration & other minor issues

2022-04-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16796:
---
Summary: Rename is broken in configuration & other minor issues  (was: 
Rename is broken in configuration)

> Rename is broken in configuration & other minor issues
> --
>
> Key: IGNITE-16796
> URL: https://issues.apache.org/jira/browse/IGNITE-16796
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>
> Rename changes "name" field in an immutable object, this shouldn't happen.
>  
> There are also few more issues that I'd like to address:
>  * configuration values serialization wouldn't work for string with non-ascii 
> characters because of wrong "size" calculation
>  * signatures of ConfigurationNotificationEvent#config and 
> ConfigurationNotificationEvent#name are flawed and need to be refined a bit
>  * InjectName is not used where it needs to be used



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16796) Rename is broken in configuration

2022-04-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16796:
---
Description: 
Rename changes "name" field in an immutable object, this shouldn't happen.

 

There are also few more issues that I'd like to address:
 * configuration values serialization wouldn't work for string with non-ascii 
characters because of wrong "size" calculation
 * signatures of ConfigurationNotificationEvent#config and 
ConfigurationNotificationEvent#name are flawed and need to be refined a bit
 * InjectName is not used where it needs to be used

  was:
Rename changes "name" field in an immutable object, this shouldn't happen.

 

There are also few more issues that I'd like to address:
 * configuration values serialization wouldn't work for string with non-ascii 
characters because of wrong "size" calculation
 * signatures of ConfigurationNotificationEvent#config and 
ConfigurationNotificationEvent#name are flawed and need to be refined a bit


> Rename is broken in configuration
> -
>
> Key: IGNITE-16796
> URL: https://issues.apache.org/jira/browse/IGNITE-16796
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>
> Rename changes "name" field in an immutable object, this shouldn't happen.
>  
> There are also few more issues that I'd like to address:
>  * configuration values serialization wouldn't work for string with non-ascii 
> characters because of wrong "size" calculation
>  * signatures of ConfigurationNotificationEvent#config and 
> ConfigurationNotificationEvent#name are flawed and need to be refined a bit
>  * InjectName is not used where it needs to be used



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16796) Rename is broken in configuration

2022-04-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16796:
---
Description: 
Rename changes "name" field in an immutable object, this shouldn't happen.

 

There are also few more issues that I'd like to address:
 * configuration values serialization wouldn't work for string with non-ascii 
characters because of wrong "size" calculation
 * signatures of ConfigurationNotificationEvent#config and 
ConfigurationNotificationEvent#name are flawed and need to be refined a bit

> Rename is broken in configuration
> -
>
> Key: IGNITE-16796
> URL: https://issues.apache.org/jira/browse/IGNITE-16796
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>
> Rename changes "name" field in an immutable object, this shouldn't happen.
>  
> There are also few more issues that I'd like to address:
>  * configuration values serialization wouldn't work for string with non-ascii 
> characters because of wrong "size" calculation
>  * signatures of ConfigurationNotificationEvent#config and 
> ConfigurationNotificationEvent#name are flawed and need to be refined a bit



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16796) Rename is broken in configuration

2022-04-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16796:
--

 Summary: Rename is broken in configuration
 Key: IGNITE-16796
 URL: https://issues.apache.org/jira/browse/IGNITE-16796
 Project: Ignite
  Issue Type: Bug
Affects Versions: 3.0.0-alpha4
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov
 Fix For: 3.0.0-alpha5






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-14931) Define common error scopes and prefix

2022-04-03 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14931:
---
Labels: iep-84 ignite-3  (was: ignite-3)

> Define common error scopes and prefix
> -
>
> Key: IGNITE-14931
> URL: https://issues.apache.org/jira/browse/IGNITE-14931
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Vyacheslav Koptilin
>Priority: Major
>  Labels: iep-84, ignite-3
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16704) Remove unnecessary methods from BinaryRow interface

2022-03-17 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16704:
--

 Summary: Remove unnecessary methods from BinaryRow interface
 Key: IGNITE-16704
 URL: https://issues.apache.org/jira/browse/IGNITE-16704
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov
 Fix For: 3.0.0-alpha5


Current interface has several read* methods that are only used in 
implementation. I propose deleting them, this will simplify making new 
implementations of the interface.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16704) Remove unnecessary methods from BinaryRow interface

2022-03-17 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16704:
---
Labels: iep-54 ignite-3  (was: ignite-3)

> Remove unnecessary methods from BinaryRow interface
> ---
>
> Key: IGNITE-16704
> URL: https://issues.apache.org/jira/browse/IGNITE-16704
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: iep-54, ignite-3
> Fix For: 3.0.0-alpha5
>
>
> Current interface has several read* methods that are only used in 
> implementation. I propose deleting them, this will simplify making new 
> implementations of the interface.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (IGNITE-16697) [Versioned Storage] POC - add methods for versioned data storage

2022-03-16 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16697:
--

 Summary: [Versioned Storage] POC - add methods for versioned data 
storage
 Key: IGNITE-16697
 URL: https://issues.apache.org/jira/browse/IGNITE-16697
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov
 Fix For: 3.0.0-alpha5


As a first step towards MV-storage in Ignite 3.0 it's required to have specific 
methods on the partition storage and index storage interfaces. These will 
replace currently available VersionedRowStore, which was a prototype in itself 
and doesn't correspond to a desired functionality.

Partition storage needs:
 * addWrite(k, v, txId)
 * commitWrite(k, ts)
 * abortWrite(k)
 * read(k, ts)
 * scan(ts, {_}tbd{_})
 * cleanup({_}tbd{_})

Sorted index storage needs:
 * scan(lower, upper, bounds_options, projection, partition_filter, ts)

Index updates will be hidden inside of {*}addWrite{*}, *abortWrite* and 
*cleanup* methods. No external "update" and "remove" are required.

This particular issue is a precursor for the 
https://issues.apache.org/jira/browse/IGNITE-16611.

Reference implementation is also required, it'll provide an example of what's 
expected from the storage and a set of tests to fix methods contracts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-16611) [Versioned Storage] Version chain data structure for RocksDB-based storage

2022-03-16 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16611:
---
Summary: [Versioned Storage]  Version chain data structure for 
RocksDB-based storage  (was: [Versioned Storage]  POC - Version chain data 
structure for RocksDB-based storage)

> [Versioned Storage]  Version chain data structure for RocksDB-based storage
> ---
>
> Key: IGNITE-16611
> URL: https://issues.apache.org/jira/browse/IGNITE-16611
> Project: Ignite
>  Issue Type: Task
>  Components: persistence
>Reporter: Sergey Chugunov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> To support Concurrency Control and implement effective transactions 
> capability to store multiple values of the same key is needed in existing 
> storage.
> h3. Version chain
> Key component here is a special data structure called version chain: it is a 
> list of all versions of a particular key, with the most recent version at the 
> beginning (HEAD).
> Each entry in the chain contains value, reference to the next entry in the 
> list, begin and end timestamps and id of active transaction that created this 
> version.
> There are at least two approaches to implement this structure on top of 
> RocksDB:
> * Combine original key and version into a new key which is put into a RocksDB 
> tree. In that case to restore version chain we need to iterate over the tree 
> using original key as a prefix.
> * Use original key as-is but make it pointing not to the value directly but 
> to an array containing version and other metainformation (ts, id etc) and 
> keys in some secondary tree.
> h3. New API to manage versions
> The following new API should be implemented to provide access to version 
> chain:
> * Methods to manipulate versions: add new version to the chain, commit 
> uncommited version, abort uncommited version.
> * Method to cleanup old versions from the chain.
> * Method to scan over keys up to provided timestamp.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (IGNITE-14611) Implement error handling for public API based on error codes

2022-03-14 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14611:
---
Labels: iep-84 ignite-3  (was: ignite-3)

> Implement error handling for public API based on error codes
> 
>
> Key: IGNITE-14611
> URL: https://issues.apache.org/jira/browse/IGNITE-14611
> Project: Ignite
>  Issue Type: Task
>Reporter: Alexey Scherbakov
>Priority: Major
>  Labels: iep-84, ignite-3
> Fix For: 3.0
>
>
> Dev list discusstion [1]
> [1] 
> http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Error-handling-in-Ignite-3-td52269.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


<    3   4   5   6   7   8   9   10   11   12   >