from:"Ivan Bessonov \\\(JIRA\\\)"

[jira] [Updated] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-07-06 Thread Ivan Bessonov (Jira)

[
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ivan Bessonov updated IGNITE-17081:
---
Description:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.

Please also familiarize yourself with
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding,
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has
its own WAL. So, for testing purposes, it would be enough to just store update
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save
all column families consistently, if you have batches that cover several CFs.
Basically, {{acquireConsistencyLock()}} would create a thread-local write
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now
(gladly, there's a WriteBatchInterface#count() to check). I don't see any
practical value and a proper way of implementing it, considering how spread-out
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is
more challenging, I propose caching it and update only from the closure, that
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the
implementation:
* DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in
atomic flush mode. And, once you have your first "completed" event ,you have a
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is
all that's needed.
* Unlike PDS implementation, here we will be writing updateIndex value into a
memtable every time. This makes it harder to find persistedIndex values for
partitions. Gladly, considering the events that we have, during the time
between first "completed" and the very next "begin", the state on disk is fully
consistent. And there's a way to read data from storage avoiding memtable
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

{code:java}
During table start: read latest values of update indexes. Store them in an
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
if (lastEventType == ON_FLUSH_BEGIN)
return;

waitForLastAsyncUpdateIndexesRead();

lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
if (lastEventType == ON_FLUSH_COMPLETED)
return;

asyncReadUpdateIndexesFromDisk();

lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing
process. We don't control locks that RocksDb holds while calling listener's
methods.

That asynchronous process would invoke closures that provide presisted
updateIndex values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();"
as late as possible just in case. But my implementation says calling it during
the first event. This is fine. I noticed that column families are flushed in
order of their internal ids. These ids correspond to a sequence number of CFs,
and the "default" CF is always created first. This is the exact CF that we use
to store meta. Maybe we're going to change this and create a separate meta CF.
Only then we could start optimizing this part, and only if we'll have an actual
proof that there's a stall in this exact place.
h3. Types of storages

RocksDB is used for:
* tables
* cluster management
* meta-storage

All these types should use the same recovery procedure, but code is located in
different places. I hope that it won't be a big problem and we can do
everything at once.

was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.

Please also familiarize yourself with
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding,
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has
its own WAL. So, for testing purposes, it would be enough to just store update
index in meta column family.

Immediately we have a write amplification issue, on top

[jira] [Created] (IGNITE-17310) Intergrate IndexStorage into a TableStorage API

2022-07-05 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17310:
--

 Summary: Intergrate IndexStorage into a TableStorage API
 Key: IGNITE-17310
 URL: https://issues.apache.org/jira/browse/IGNITE-17310
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


As an endpoint, we need an interface that represents a single index storage for 
a single partition. But, creating/destroying these storages is not as obvious 
from API standpoint.

When index is created, storages should be created for every existing partition. 
And when a partition is created, index storages should be created for it as 
well. This complicates things a little bit, but, generally speaking, something 
like this could be a solution:
 * CompletableFuture createIndex(indexCinfgiguration);
 * CompletableFuture dropIndex(indexId);
 * IndexMvStorage getIndexStorage(indexId, partitionId);

Build / rebuild API will be figured out later in another issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17308:
---
Description: 
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.
 * new methods, like {{update}} and {{remove}} should be added to API.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now. Porting current tests to new API is 
up to a developer.

h3. Other

I would say that effective InternalTuple comparison is out of scope. We could 
just adapt current test code somehow.

  was:
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now.


> Revisit SortedIndexMvStorage interface
> --
>
> Key: IGNITE-17308
> URL: https://issues.apache.org/jira/browse/IGNITE-17308
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
> contract is far from obvious and it's only used in tests as a part of 
> "reference implementation".
> Originally, it was implemented when the vision of MV store wasn't fully 
> solidified.
> h3. API changes
>  * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
> should be replaced with {{{}InternalTuple{}}}, with the requirement that 
> every internal tuple can be converted into a IEP-92 format.
>  * {{scan}} should not return rows, but only indexed rows and RowId 
> instances. Index scan should NOT by itself filter-out invalid rows, this will 
> be performed outside of scan.
>  * TxId / Timestamp parameters are no longer applicable, given that index 
> does not perform rows validation.
>  * Partition filter should be removed as well. To simplify things, every 
> partition will be indexed {+}independently{+}.
>  * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
> now. Former can be brought back in the future, while latter makes no sense 
> considering that indexes are not multiversioned.
>  * new methods, like {{update}} and {{remove}} should be added to API.
> h3. New API for removed functions
>  * There should be a new entity on top of partition and index store. It 
> updates indexes and filters scan queries. There's no point in fully designing 
> it right now, all we need is working tests for now. Porting current tests

[jira] [Updated] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17308:
---
Description: 
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.

h3. New API for removed functions
 * There should be a new entity on top of partition and index store. It updates 
indexes and filters scan queries. There's no point in fully designing it right 
now, all we need is working tests for now.

  was:
Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.


> Revisit SortedIndexMvStorage interface
> --
>
> Key: IGNITE-17308
> URL: https://issues.apache.org/jira/browse/IGNITE-17308
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
> contract is far from obvious and it's only used in tests as a part of 
> "reference implementation".
> Originally, it was implemented when the vision of MV store wasn't fully 
> solidified.
> h3. API changes
>  * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
> should be replaced with {{{}InternalTuple{}}}, with the requirement that 
> every internal tuple can be converted into a IEP-92 format.
>  * {{scan}} should not return rows, but only indexed rows and RowId 
> instances. Index scan should NOT by itself filter-out invalid rows, this will 
> be performed outside of scan.
>  * TxId / Timestamp parameters are no longer applicable, given that index 
> does not perform rows validation.
>  * Partition filter should be removed as well. To simplify things, every 
> partition will be indexed {+}independently{+}.
>  * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
> now. Former can be brought back in the future, while latter makes no sense 
> considering that indexes are not multiversioned.
> h3. New API for removed functions
>  * There should be a new entity on top of partition and index store. It 
> updates indexes and filters scan queries. There's no point in fully designing 
> it right now, all we need is working tests for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17308) Revisit SortedIndexMvStorage interface

2022-07-05 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17308:
--

 Summary: Revisit SortedIndexMvStorage interface
 Key: IGNITE-17308
 URL: https://issues.apache.org/jira/browse/IGNITE-17308
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Currently, SortedIndexMvStorage is a very weird mixture of many things. Its 
contract is far from obvious and it's only used in tests as a part of 
"reference implementation".

Originally, it was implemented when the vision of MV store wasn't fully 
solidified.
h3. API changes
 * {{IndexRowEx}} should disappear. It was a quick and dirty solution. It 
should be replaced with {{{}InternalTuple{}}}, with the requirement that every 
internal tuple can be converted into a IEP-92 format.
 * {{scan}} should not return rows, but only indexed rows and RowId instances. 
Index scan should NOT by itself filter-out invalid rows, this will be performed 
outside of scan.
 * TxId / Timestamp parameters are no longer applicable, given that index does 
not perform rows validation.
 * Partition filter should be removed as well. To simplify things, every 
partition will be indexed {+}independently{+}.
 * {{supportsBackwardsScan}} and {{supportsIndexOnlyScan}} can be removed for 
now. Former can be brought back in the future, while latter makes no sense 
considering that indexes are not multiversioned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16156) Byte ordered index keys.

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16156.

Resolution: Won't Fix

Other data format will be used

> Byte ordered index keys.
> 
>
> Key: IGNITE-16156
> URL: https://issues.apache.org/jira/browse/IGNITE-16156
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Reporter: Alexander Belyak
>Assignee: Alexander Belyak
>Priority: Major
>  Labels: ignite-3
>
> To improve speed of operations with indexes ignite can store keys in byte 
> ordered format so only natural byte[] comparator will be enough to scan it.
> Required features:
> 1) write any (almost) data types.
> Must to have: boolean, byte, short, int,long, float, double, bigint, 
> bigdecimal, String, Date, Time, DateTime.
> Like to have: byte[], bitset
> unlikely to have: timestamp with timezone
> 2) Support null values for any columns. Like to have: support 
> nullFirst/nullLast
> 3) write asc/desc ordered (in any combination for columns, for indexes like 
> "col1 asc, col2 desc, col3 asc").
> Non functional requirements: space used and speed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16105) Replace sorted index binary storage protocol

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16105.

Resolution: Won't Fix

IGNITE-17192 will be used instead

> Replace sorted index binary storage protocol
> 
>
> Key: IGNITE-16105
> URL: https://issues.apache.org/jira/browse/IGNITE-16105
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage currently uses {{BinaryRow}} as way to convert column 
> values into byte arrays. This approach is not optimal for the following 
> reasons:
> # Data is stored in RocksDB and we can't use its native lexicographic 
> comparator, we rely on a custom Java-based comparator that needs to 
> de-serialize all columns in order to compare them. This is bad 
> performance-wise, because Java-based comparators are  slower and we need to 
> extract all column values;
> # Range scans can't use the prefix seek operation from RocksDB, because 
> {{BinaryRow}} seralization is not stable: serialized prefix of column values 
> will not be a prefix of the whole serialized row, because the format depends 
> on columns being serialized;
> # {{BinaryRow}} serialization is designed to store versioned row data and is 
> overall badly suited to the Sorted Index purposes, its API usage looks 
> awkward in this context.
> We need to find a new serialization protocol that will (ideally) satisfy the 
> following requirements:
> # It should be comparable lexicographically;
> # It should support null values;
> # It should support variable length columns (though this requirement can 
> probably be dropped);
> # It should support both ascending and descending order for individual 
> columns;
> # It should support all data types that {{BinaryRow}} uses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16079) Rename search and data keys for the Partition Storage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16079.

Resolution: Won't Fix

> Rename search and data keys for the Partition Storage
> -
>
> Key: IGNITE-16079
> URL: https://issues.apache.org/jira/browse/IGNITE-16079
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> There are currently the following classes in the {{PartitionStorage}} that 
> act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the 
> {{SortedIndexStorage}} interface hard to understand, because it stores 
> {{SearchRows}} as values. It is proposed to rename these classes:
>  {{SearchRow}} -> {{PartitionKey}}
>  {{DataRow}} -> {{PartitionData}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-16059.

Resolution: Won't Fix

> Add options to the "range" method in SortedIndexStorage
> ---
>
> Key: IGNITE-16059
> URL: https://issues.apache.org/jira/browse/IGNITE-16059
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage]
>  declares the following API for the {{SortedIndexStorage#range}} method:
> {code:java}
> /** Exclude lower bound. */
> byte GREATER = 0;
>  
> /** Include lower bound. */
> byte GREATER_OR_EQUAL = 1;
>  
> /** Exclude upper bound. */
> byte LESS = 0;
>  
> /** Include upper bound. */
> byte LESS_OR_EQUAL = 1 << 1;
> /**
>  * Return rows between lower and upper bounds.
>  * Fill results rows by fields specified at the projection set.
>  *
>  * @param low Lower bound of the scan.
>  * @param up Lower bound of the scan.
>  * @param scanBoundMask Scan bound mask (specify how to work with rows 
> equals to the bounds: include or exclude).
>  * @param proj Set of the columns IDs to fill results rows.
>  */
> Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj);
> {code}
> The {{scanBoundMask}} flags are currently not implemented. This API should be 
> revised and implemented, if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17306:
---
Description: 
There are a few places in presto that for too slow, we can easily optimize them

(Nothing will be committed if there's no visible difference in tests duration)

  was:There are a few places in presto that for too slow, we can easily 
optimize them


> Speedup runtime classes compilation speed for configuration
> ---
>
> Key: IGNITE-17306
> URL: https://issues.apache.org/jira/browse/IGNITE-17306
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in presto that for too slow, we can easily optimize 
> them
> (Nothing will be committed if there's no visible difference in tests duration)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17306:
--

Assignee: Ivan Bessonov

> Speedup runtime classes compilation speed for configuration
> ---
>
> Key: IGNITE-17306
> URL: https://issues.apache.org/jira/browse/IGNITE-17306
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are a few places in presto that for too slow, we can easily optimize 
> them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17306) Speedup runtime classes compilation speed for configuration

2022-07-05 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17306:
--

 Summary: Speedup runtime classes compilation speed for 
configuration
 Key: IGNITE-17306
 URL: https://issues.apache.org/jira/browse/IGNITE-17306
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


There are a few places in presto that for too slow, we can easily optimize them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-15908) Investigate index binary structure compatibility

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-15908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-15908:
---
Epic Link: IGNITE-17304

> Investigate index binary structure compatibility
> 
>
> Key: IGNITE-15908
> URL: https://issues.apache.org/jira/browse/IGNITE-15908
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage has a binary storage format that is subject to change in 
> the future. Though index schema is immutable and any change to it leads to 
> index being rebuilt, it should be possible to update the storage format 
> without rebuilding. It means that there should be some kind of a versioning 
> mechanism, so that {{IndexKey}} serialization format can be changed in a 
> backwards-compatilbe way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16059) Add options to the "range" method in SortedIndexStorage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16059:
---
Epic Link: IGNITE-17304

> Add options to the "range" method in SortedIndexStorage
> ---
>
> Key: IGNITE-16059
> URL: https://issues.apache.org/jira/browse/IGNITE-16059
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> [IEP-74|https://cwiki.apache.org/confluence/display/IGNITE/IEP-74+Data+Storage]
>  declares the following API for the {{SortedIndexStorage#range}} method:
> {code:java}
> /** Exclude lower bound. */
> byte GREATER = 0;
>  
> /** Include lower bound. */
> byte GREATER_OR_EQUAL = 1;
>  
> /** Exclude upper bound. */
> byte LESS = 0;
>  
> /** Include upper bound. */
> byte LESS_OR_EQUAL = 1 << 1;
> /**
>  * Return rows between lower and upper bounds.
>  * Fill results rows by fields specified at the projection set.
>  *
>  * @param low Lower bound of the scan.
>  * @param up Lower bound of the scan.
>  * @param scanBoundMask Scan bound mask (specify how to work with rows 
> equals to the bounds: include or exclude).
>  * @param proj Set of the columns IDs to fill results rows.
>  */
> Cursor scan(Row low, Row up, byte scanBoundMask, BitSet proj);
> {code}
> The {{scanBoundMask}} flags are currently not implemented. This API should be 
> revised and implemented, if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16079) Rename search and data keys for the Partition Storage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16079:
---
Epic Link: IGNITE-17304

> Rename search and data keys for the Partition Storage
> -
>
> Key: IGNITE-16079
> URL: https://issues.apache.org/jira/browse/IGNITE-16079
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> There are currently the following classes in the {{PartitionStorage}} that 
> act as data and search keys: {{SearchRow}} and {{DataRow}}. This makes the 
> {{SortedIndexStorage}} interface hard to understand, because it stores 
> {{SearchRows}} as values. It is proposed to rename these classes:
>  {{SearchRow}} -> {{PartitionKey}}
>  {{DataRow}} -> {{PartitionData}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16105) Replace sorted index binary storage protocol

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16105:
---
Epic Link: IGNITE-17304

> Replace sorted index binary storage protocol
> 
>
> Key: IGNITE-16105
> URL: https://issues.apache.org/jira/browse/IGNITE-16105
> Project: Ignite
>  Issue Type: Task
>Reporter: Aleksandr Polovtcev
>Priority: Major
>  Labels: ignite-3
>
> Sorted Index Storage currently uses {{BinaryRow}} as way to convert column 
> values into byte arrays. This approach is not optimal for the following 
> reasons:
> # Data is stored in RocksDB and we can't use its native lexicographic 
> comparator, we rely on a custom Java-based comparator that needs to 
> de-serialize all columns in order to compare them. This is bad 
> performance-wise, because Java-based comparators are  slower and we need to 
> extract all column values;
> # Range scans can't use the prefix seek operation from RocksDB, because 
> {{BinaryRow}} seralization is not stable: serialized prefix of column values 
> will not be a prefix of the whole serialized row, because the format depends 
> on columns being serialized;
> # {{BinaryRow}} serialization is designed to store versioned row data and is 
> overall badly suited to the Sorted Index purposes, its API usage looks 
> awkward in this context.
> We need to find a new serialization protocol that will (ideally) satisfy the 
> following requirements:
> # It should be comparable lexicographically;
> # It should support null values;
> # It should support variable length columns (though this requirement can 
> probably be dropped);
> # It should support both ascending and descending order for individual 
> columns;
> # It should support all data types that {{BinaryRow}} uses.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16156) Byte ordered index keys.

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16156:
---
Epic Link: IGNITE-17304

> Byte ordered index keys.
> 
>
> Key: IGNITE-16156
> URL: https://issues.apache.org/jira/browse/IGNITE-16156
> Project: Ignite
>  Issue Type: Task
>  Components: sql
>Reporter: Alexander Belyak
>Assignee: Alexander Belyak
>Priority: Major
>  Labels: ignite-3
>
> To improve speed of operations with indexes ignite can store keys in byte 
> ordered format so only natural byte[] comparator will be enough to scan it.
> Required features:
> 1) write any (almost) data types.
> Must to have: boolean, byte, short, int,long, float, double, bigint, 
> bigdecimal, String, Date, Time, DateTime.
> Like to have: byte[], bitset
> unlikely to have: timestamp with timezone
> 2) Support null values for any columns. Like to have: support 
> nullFirst/nullLast
> 3) write asc/desc ordered (in any combination for columns, for indexes like 
> "col1 asc, col2 desc, col3 asc").
> Non functional requirements: space used and speed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14937) Index schema & Index management integration

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14937:
---
Epic Link: IGNITE-17304

> Index schema & Index management integration
> ---
>
> Key: IGNITE-14937
> URL: https://issues.apache.org/jira/browse/IGNITE-14937
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Public index schema (required indexes) and current indexes state on the 
> cluster are different.
> We have to track it, store it and provide actual indexes schema state for any 
> components: select query, DDL query etc..



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14936) Benchmark sorted index scan vs table's partitions scan

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14936:
---
Epic Link: IGNITE-17304

> Benchmark sorted index scan vs table's partitions scan
> --
>
> Key: IGNITE-14936
> URL: https://issues.apache.org/jira/browse/IGNITE-14936
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> We have to decide what are data structures is used for PK and table scan.
> Possible cases:
> - table partitions sorted by plain bytes/hash (in fact: unsorted);
> - table partitions sorted by PK columns;
> - PK sorted index (one store for all partitions on the node).
> All cases have pros and cons. The choice should be based on benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14940) Investigation parallel index scan

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14940:
---
Epic Link: IGNITE-17304

> Investigation parallel index scan
> -
>
> Key: IGNITE-14940
> URL: https://issues.apache.org/jira/browse/IGNITE-14940
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Motivation: 2.x version implements {{queryParallelism}} by creation index 
> segments. Each segment contains subset of partitions. This approach has 
> several shortcomings:
> - index scans parallelism cannot be changed / scaled on runtime.
> - we have always scan all segments (looks like virtual MapNode for query);
> - many index storages for one logical index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14938) Introduce persistance store for the indexes states on cluster

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14938:
---
Epic Link: IGNITE-17304

> Introduce persistance store for the indexes states on cluster
> -
>
> Key: IGNITE-14938
> URL: https://issues.apache.org/jira/browse/IGNITE-14938
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Includes:
> - building state progress;
> - ready to scan / building;
> - rebuild index;
> - support node restart and index recovery.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14939) Tests coverage for index rebuild and recovery scenarios

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14939:
---
Epic Link: IGNITE-17304

> Tests coverage for index rebuild and recovery scenarios
> ---
>
> Key: IGNITE-14939
> URL: https://issues.apache.org/jira/browse/IGNITE-14939
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>
> Test cases from version 2.x must be analyzed and ported to 3.0.
> See in 2.x {{AbstractRebuildIndexTest}} and the children.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16199) Implements index build/rebuild

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16199:
---
Epic Link: IGNITE-17304

> Implements index build/rebuild 
> ---
>
> Key: IGNITE-16199
> URL: https://issues.apache.org/jira/browse/IGNITE-16199
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Index must be built on exists table data: scan table's data and build an 
> index.
> Now only update index by table updates is implemented.
> May be build and rebuild tasks may be split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16196) Supports index rename

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16196:
---
Epic Link: IGNITE-17304

> Supports index rename
> -
>
> Key: IGNITE-16196
> URL: https://issues.apache.org/jira/browse/IGNITE-16196
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Need to supports index rename.
> ALTER INDEX [ IF EXISTS ]  RENAME TO 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16265:
---
Epic Link: IGNITE-17304

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16202) Supports transactions by index

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16202:
---
Epic Link: IGNITE-17304

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-14925.

Resolution: Duplicate

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Epic Link: IGNITE-17304

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562614#comment-17562614
 ] 

Ivan Bessonov commented on IGNITE-14925:


Replaced with EPIC

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17304) SQL indexes 3.0 epic

2022-07-05 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17304:
--

 Summary: SQL indexes 3.0 epic
 Key: IGNITE-17304
 URL: https://issues.apache.org/jira/browse/IGNITE-17304
 Project: Ignite
  Issue Type: Epic
Reporter: Ivan Bessonov


Ignite 3.x requires SQL indexes, just like any other database. Current Epic is 
the collection of issues related to indexes design and implementation.

This includes:
 * indexes configuration
 * indexes lifecycle
 * indexes storage
 * indexes integration into SQL queries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Issue Type: New Feature  (was: Epic)

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: New Feature
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16265:
---
Epic Link: (was: IGNITE-14925)

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16199) Implements index build/rebuild

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16199:
---
Epic Link: (was: IGNITE-14925)

> Implements index build/rebuild 
> ---
>
> Key: IGNITE-16199
> URL: https://issues.apache.org/jira/browse/IGNITE-16199
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Index must be built on exists table data: scan table's data and build an 
> index.
> Now only update index by table updates is implemented.
> May be build and rebuild tasks may be split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16202) Supports transactions by index

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16202:
---
Epic Link: (was: IGNITE-14925)

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16199) Implements index build/rebuild

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16199:
---
Epic Link: IGNITE-14925

> Implements index build/rebuild 
> ---
>
> Key: IGNITE-16199
> URL: https://issues.apache.org/jira/browse/IGNITE-16199
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Affects Versions: 3.0.0-alpha3
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Index must be built on exists table data: scan table's data and build an 
> index.
> Now only update index by table updates is implemented.
> May be build and rebuild tasks may be split.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16265) Integration SQL Index and data storage

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16265:
---
Epic Link: IGNITE-14925

> Integration SQL Index and data storage
> --
>
> Key: IGNITE-16265
> URL: https://issues.apache.org/jira/browse/IGNITE-16265
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Yury Gerzhedovich
>Assignee: Konstantin Orlov
>Priority: Major
>  Labels: ignite-3
>
> Need to think about point of integration of data modification 
> (put/remove/amend) with update data at SQL indexes. 
> Let's as first version for integation will be update index on commit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16202) Supports transactions by index

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-16202:
---
Epic Link: IGNITE-14925

> Supports transactions by index
> --
>
> Key: IGNITE-16202
> URL: https://issues.apache.org/jira/browse/IGNITE-16202
> Project: Ignite
>  Issue Type: Improvement
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> Indexes must support transaction protocol.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Epic Name: Sorted SQL indexes

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: Epic
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-14925) Sorted indexes engine

2022-07-05 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-14925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-14925:
---
Issue Type: Epic  (was: New Feature)

> Sorted indexes engine
> -
>
> Key: IGNITE-14925
> URL: https://issues.apache.org/jira/browse/IGNITE-14925
> Project: Ignite
>  Issue Type: Epic
>  Components: sql
>Reporter: Taras Ledkov
>Priority: Major
>  Labels: ignite-3
>
> The umbrella ticket to track improvements and issues related to design and 
> development sorted index engine for Ignite 3.0.
> Feature branch: 
> [ignite-14925-sorted-indexes|https://github.com/apache/ignite-3/tree/ignite-14925-sorted-indexes]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches

2022-07-01 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17272:
---
Component/s: cache

> Logical recovery works incorrectly for encrypted caches
> ---
>
> Key: IGNITE-17272
> URL: https://issues.apache.org/jira/browse/IGNITE-17272
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.13
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
> Fix For: 2.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When encryption is enabled for a particular cache, its WAL records get 
> encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type 
> is considered a {{PHYSICAL}} record, which leads to such records being 
> omitted during logical recovery regardless of the fact that it can contain 
> logical records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches

2022-07-01 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17272:
---
Affects Version/s: 2.13

> Logical recovery works incorrectly for encrypted caches
> ---
>
> Key: IGNITE-17272
> URL: https://issues.apache.org/jira/browse/IGNITE-17272
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.13
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
> Fix For: 2.14
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When encryption is enabled for a particular cache, its WAL records get 
> encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type 
> is considered a {{PHYSICAL}} record, which leads to such records being 
> omitted during logical recovery regardless of the fact that it can contain 
> logical records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17272) Logical recovery works incorrectly for encrypted caches

2022-07-01 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561364#comment-17561364
 ] 

Ivan Bessonov commented on IGNITE-17272:


Looks good to me, thank you! I'll merge it to master

> Logical recovery works incorrectly for encrypted caches
> ---
>
> Key: IGNITE-17272
> URL: https://issues.apache.org/jira/browse/IGNITE-17272
> Project: Ignite
>  Issue Type: Bug
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When encryption is enabled for a particular cache, its WAL records get 
> encrypted and wrapped in an {{EncryptedRecord}}. This encrypted record type 
> is considered a {{PHYSICAL}} record, which leads to such records being 
> omitted during logical recovery regardless of the fact that it can contain 
> logical records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17283) ItCmgRaftServiceTest should start Raft groups in parallel

2022-06-30 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17283:
---
Ignite Flags:   (was: Docs Required,Release Notes Required)

> ItCmgRaftServiceTest should start Raft groups in parallel
> -
>
> Key: IGNITE-17283
> URL: https://issues.apache.org/jira/browse/IGNITE-17283
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Minor
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ItCmgRaftServiceTest starts a couple of Raft groups sequentially, so the 
> first group waits for other members to appear before it times out. This leads 
> to this test running for quite a long time. It is proposed to start these 
> groups in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-17283) ItCmgRaftServiceTest should start Raft groups in parallel

2022-06-30 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561099#comment-17561099
 ] 

Ivan Bessonov commented on IGNITE-17283:


Looks good, thank you for the improvement!

> ItCmgRaftServiceTest should start Raft groups in parallel
> -
>
> Key: IGNITE-17283
> URL: https://issues.apache.org/jira/browse/IGNITE-17283
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Aleksandr Polovtcev
>Assignee: Aleksandr Polovtcev
>Priority: Minor
>  Labels: ignite-3
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ItCmgRaftServiceTest starts a couple of Raft groups sequentially, so the 
> first group waits for other members to appear before it times out. This leads 
> to this test running for quite a long time. It is proposed to start these 
> groups in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-17278) TableManager#directTableIds can't be implemented effectively

2022-06-30 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17278:
--

 Summary: TableManager#directTableIds can't be implemented 
effectively
 Key: IGNITE-17278
 URL: https://issues.apache.org/jira/browse/IGNITE-17278
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov
Assignee: Ivan Bessonov


I propose adding a special method "internalIds" to direct proxy, so that there 
won't be the case for reading all tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-16913) Provide affective way to write BinaryRow into byte buffer

2022-06-29 Thread Ivan Bessonov (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16913  
 
 
  Provide affective way to write BinaryRow into byte buffer   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 
 
Epic Link: 
 IGNITE-16923  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)

[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-29 Thread Ivan Bessonov (Jira)

Title: Message Title

Ivan Bessonov updated an issue

Ignite / IGNITE-16655

Volatile RAFT log for pure in-memory storages

Change By:
Ivan Bessonov

h3. Original issue descriptionFor in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process.h3. More detailed descriptionApparently, we can't completely ignore writing to log. There are several situations where it needs to be collected: * During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved. * During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.So, it all comes to the aggressive log truncation to make it short.Preserved log can be quite big in reality, there must be a disk offloading operation available.The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways. * Potentially, we could use a volatile page memory region for this purpose, since it already has a good control over the amount of memory used. But, memory overflow should be carefully processed, usually it's treated as an error and might even cause node failure.

[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-29 Thread Ivan Bessonov (Jira)

Title: Message Title

Ivan Bessonov updated an issue

Ignite / IGNITE-16655

Volatile RAFT log for pure in-memory storages

Change By:
Ivan Bessonov

h3. Original issue descriptionFor in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process.h3. More detailed description Apparently, we can't completely ignore writing to log. There are several situations where it needs to be collected: * During a regular workload, each node needs to have a small portion of log in case if it becomes a leader. There might be a number of "slow" nodes outside of "quorum" that require older data to be re-sent to them. Log entry can be truncated only when all nodes reply with "ack" or fail, otherwise log entry should be preserved. * During a clean node join - it will need to apply part of the log that wasn't included in the full-rebalance snapshot. So, everything, starting with snapshots applied index, will have to be preserved.It feels like the second option is just a special case of the first one - we can't truncate log until we receive all acks. And we can't receive an ack from the joining node until it finishes its rebalancing procedure.So, it all comes to the aggressive log truncation to make it short.Preserved log can be quite big in reality, there must be a disk offloading operation available.The easiest way to achieve it is to write into a RocksDB instance with WAL disabled. It'll store everything in memory until the flush, and even then the amount of flushed data will be small on stable topology. Absence of WAL is not an issue, the entire rocks instance can be dropped on restart, since it's supposed to be volatile.To avoid even the smallest flush, we can use additional volatile structure, like ring buffer or concurrent map, to store part of the log, and transfer records into RocksDB only on structure overflow. This sounds more compilcated and makes memory management more difficult. But, we should take it into consideration anyways.

[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-28 Thread Ivan Bessonov (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16655  
 
 
  Volatile RAFT log for pure in-memory storages   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 

  
 
 
 
 

 
 h3. Original issue description For in-memory storage Raft logging can be optimized as we don't need to have it active when topology is stable.Each write can directly go to in-memory storage at much lower cost than synchronizing it with disk so it is possible to avoid writing Raft log.As nodes don't have any state and always join cluster clean we always need to transfer full snapshot during rebalancing - no need to keep long Raft log for historical rebalancing purposes.So we need to implement API for Raft component enabling configuration of Raft logging process. h3. More detailed description   
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)

[jira] [Updated] (IGNITE-16655) Volatile RAFT log for pure in-memory storages

2022-06-28 Thread Ivan Bessonov (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Ivan Bessonov updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Ignite /  IGNITE-16655  
 
 
  Volatile RAFT log for pure in-memory storages   
 

  
 
 
 
 

 
Change By: 
 Ivan Bessonov  
 
 
Summary: 
 Raft Volatile RAFT  log  improvements  for pure in-memory storages  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)

[jira] [Updated] (IGNITE-17230) Support splt-file page store

2022-06-27 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17230:
---
Description: 
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity {*}DelataFilePageStore{*}, which will be created for each partition 
at each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
 * Header (maybe updated in the course of implementation):
 ** Allocation *pageIdx* - *pageIdx* of the last created page;
 * Sorted list of *pageIdx* - allows a binary search to find the file offset 
for an {*}pageId -> pageIdx{*};
 * Page content - sorted by {*}pageIdx{*}.

What will change for {*}FilePageStore{*}:
 * List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
 * Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of {*}FilePageStore{*}. At node start, it will be read 
from the header of *FilePageStore* or obtained from the first 
*DelataFilePageStore* (the newest one).

How pages will be read by {*}pageId -> pageIdx{*}:
 * Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
 * If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
 * Format of the file name for the *DelataFilePageStore* is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
 * Before creating {*}part-1-delta-3.bin{*}, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to {*}part-1-delta-3.bin{*}.

  was:
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity {*}DelataFilePageStore{*}, which will be created for each partition 
at each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
 * Header (maybe updated in the course of implementation):
 ** Allocation *pageIdx* - *pageIdx* of the last created page;
 * Sorted list of *pageIds* - allows a binary search to find the file offset 
for an {*}pageId -> pageIdx{*};
 * Page content - sorted by {*}pageIdx{*}.

What will change for {*}FilePageStore{*}:
 * List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
 * Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of {*}FilePageStore{*}. At node start, it will be read 
from the header of *FilePageStore* or obtained from the first 
*DelataFilePageStore* (the newest one).

How pages will be read by {*}pageId -> pageIdx{*}:
 * Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
 * If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
 * Format of the file name for the *DelataFilePageStore* is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
 * Before creating {*}part-1-delta-3.bin{*}, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to {*}part-1-delta-3.bin{*}.


> Support splt-file page store
> 
>
> Key: IGNITE-17230
> URL: https://issues.apache.org/jira/browse/IGNITE-17230
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> *Notes*
> Description may not be complete.
> *Goal*
> To implement a new checkpoint (described in IGNITE-15818), we will introduce 
> a new entity {*}DelataFilePageStore{*}, which will be created for each 
> partition at each checkpoint and removed after merging with the 
> *FilePageStore* (the main partition file) using the compacter.
> *DelataFilePageStore* will consist of:
>  * Header (maybe updated in the course of implementation):
>  ** Allocation *pageIdx* - *pageIdx* of the last created page;
>  * Sorted list of *pageIdx* - allows a binary search to find the file offset 
> for an {*}pageId -> pageIdx{*};
>  * Page content - sorted by {*}pageIdx{*}.
> What will change for {*}FilePageStore{*}:
>  * List of class *DelataFilePageStore* will be added (from the newest to the 
> oldest by the time of creation);
>  * Allocation index (pageIdx of the last created page) - it will be logical 
> and contained in the header of {*}Fi

[jira] [Updated] (IGNITE-17230) Support splt-file page store

2022-06-27 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17230:
---
Description: 
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity {*}DelataFilePageStore{*}, which will be created for each partition 
at each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
 * Header (maybe updated in the course of implementation):
 ** Allocation *pageIdx* - *pageIdx* of the last created page;
 * Sorted list of *pageIds* - allows a binary search to find the file offset 
for an {*}pageId -> pageIdx{*};
 * Page content - sorted by {*}pageIdx{*}.

What will change for {*}FilePageStore{*}:
 * List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
 * Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of {*}FilePageStore{*}. At node start, it will be read 
from the header of *FilePageStore* or obtained from the first 
*DelataFilePageStore* (the newest one).

How pages will be read by {*}pageId -> pageIdx{*}:
 * Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
 * If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
 * Format of the file name for the *DelataFilePageStore* is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
 * Before creating {*}part-1-delta-3.bin{*}, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to {*}part-1-delta-3.bin{*}.

  was:
*Notes*
Description may not be complete.

*Goal*
To implement a new checkpoint (described in IGNITE-15818), we will introduce a 
new entity *DelataFilePageStore*, which will be created for each partition at 
each checkpoint and removed after merging with the *FilePageStore* (the main 
partition file) using the compacter.

*DelataFilePageStore* will consist of:
* Header (maybe updated in the course of implementation):
** Allocation *pageIdx* - *pageIdx* of the last created page;
* Sorted list of *pageIdx* - allows a binary search to find the file offset for 
an *pageId -> pageIdx*;
* Page content - sorted by *pageIdx*.

What will change for *FilePageStore*:
* List of class *DelataFilePageStore* will be added (from the newest to the 
oldest by the time of creation);
* Allocation index (pageIdx of the last created page) - it will be logical and 
contained in the header of *FilePageStore*. At node start, it will be read from 
the header of *FilePageStore* or obtained from the first *DelataFilePageStore* 
(the newest one).

How pages will be read by *pageId -> pageIdx*:
* Interrogates the class *DelataFilePageStore* in order from the newest to the 
oldest;
* If not found, then we read page from the *FilePageStore* itself.

*Some implementation notes*
* Format of the file name for the *DelataFilePageStore*  is 
*part-%d-delta-%d.bin* for example *part-1-delta-3.bin* where the first digit 
is the partition identifier, and the second is the serial number of the delta 
file for this partition;
* Before creating *part-1-delta-3.bin*, a temporary file 
*part-1-delta-3.bin.tmp* will be created at the checkpoint first, then filled, 
then renamed to *part-1-delta-3.bin*.


> Support splt-file page store
> 
>
> Key: IGNITE-17230
> URL: https://issues.apache.org/jira/browse/IGNITE-17230
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
>
> *Notes*
> Description may not be complete.
> *Goal*
> To implement a new checkpoint (described in IGNITE-15818), we will introduce 
> a new entity {*}DelataFilePageStore{*}, which will be created for each 
> partition at each checkpoint and removed after merging with the 
> *FilePageStore* (the main partition file) using the compacter.
> *DelataFilePageStore* will consist of:
>  * Header (maybe updated in the course of implementation):
>  ** Allocation *pageIdx* - *pageIdx* of the last created page;
>  * Sorted list of *pageIds* - allows a binary search to find the file offset 
> for an {*}pageId -> pageIdx{*};
>  * Page content - sorted by {*}pageIdx{*}.
> What will change for {*}FilePageStore{*}:
>  * List of class *DelataFilePageStore* will be added (from the newest to the 
> oldest by the time of creation);
>  * Allocation index (pageIdx of the last created page) - it will be logical 
> and contained in the header of {*}FilePageStore{*}. At node start, it will be

[jira] [Commented] (IGNITE-17199) Improve the usability of the abstract configuration interface

2022-06-21 Thread Ivan Bessonov (Jira)



[ 
https://issues.apache.org/jira/browse/IGNITE-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556722#comment-17556722
 ] 

Ivan Bessonov commented on IGNITE-17199:


[~ktkale...@gridgain.com] I don't think that improving something here is 
necessary. Wildcard types is integral part of Java type system, it's not a bad 
thing. Over-engineering everything because of several "" occasions in 
code won't make product better IMO.

> Improve the usability of the abstract configuration interface
> -
>
> Key: IGNITE-17199
> URL: https://issues.apache.org/jira/browse/IGNITE-17199
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Kirill Tkalenko
>Priority: Major
>  Labels: iep-55, ignite-3
> Fix For: 3.0.0-alpha6
>
>
> *Problem*
> Consider an example of generating configuration interfaces (**Configuration*) 
> for an abstract configuration.
> Configuration schemas:
> {code:java}
> @AbstractConfiguration
> public class BaseConfigurationSchema {
> @Value
> public int size;
> }
> @Config
> public class VolatileConfigurationSchema extends BaseConfigurationSchema {
> @Value
> public double evictionThreshold;
> }
> {code}
> Configuration interfaces:
> {code:java}
> public interface BaseConfiguration BaseChange> extends ConfigurationTree {
> ConfigurationValue size();
> }
> public interface VolatileConfiguration extends 
> BaseConfiguration {
> ConfigurationValue size();
> }
> {code}
> This implementation allows us to work with the inheritors of the abstract 
> configuration as with a regular configuration (as if 
> *VolatileConfigurationSchema* did not extend *BaseConfigurationSchema*), but 
> when working with the abstract configuration itself, it creates 
> inconvenience. 
> For example, to get a view of the abstract configuration, we will need to 
> write the following code:
> {code:java}
> BaseConfiguration baseConfig0 = ...;
> BaseConfiguration baseConfig1 = ...;
> 
> BaseView baseView0 = (BasePageMemoryDataRegionView) baseConfig0.value();
> BaseView baseView1 = baseConfig1.value();
> {code}
> Which is not convenient and I would like us to be able to work in the same 
> way as with the *VolatileConfiguration*.
> *Possible implementations*
> * Simplest is to leave it as is;
> * Creates an additional configuration interface that will be similar to 
> *BaseConfiguration*, for example *BaseConfigurationTree*, but it will be 
> extended by *BaseConfiguration* and all its inheritors like 
> *VolatileConfiguration*, then there may be confusion about whether to use 
> *BaseConfiguration* or *BaseConfigurationTree* in the end, so we need to 
> decide how to create a name for such an interface;
> ** *BaseConfigurationTree*;
> ** *AbstractBaseConfigurationTree*;
> ** other.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (IGNITE-17077) Implement checkpointIndex for PDS

2022-06-03 Thread Ivan Bessonov (Jira)

[
https://issues.apache.org/jira/browse/IGNITE-17077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ivan Bessonov updated IGNITE-17077:
---
Description:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.
h2. General idea

The idea doesn't seem complicated. There will be a "setUpdateIndex" and
"getUpdateIndex" methods (names might be different).
* First one is invoked at the end of every write command, with RAFT commit
index being passed as a parameter. This is done right before releasing
checkpoint read lock (or whatever the name we will come up with). More on that
later.
* Second one is invoked at the beginning of every write command to validate
that update don't come out of order or with gaps. This is the way to guarantee
that IndexMismatchException can be thrown at the right time.

So, the write command flow will look like this. All names here are completely
random.

{code:java}
try (ConsistencyLock lock = partition.acquireConsistencyLock()) {
long updateIndex = partition.getUpdateIndex();
long raftIndex = writeCommand.raftIndex();

if (raftIndex != updateIndex + 1) {
throw new IndexMismatchException(updateIndex);
}

partition.write(writeCommand.row());

for (Index index : table.indexes(partition) {
index.index(writeCommand.row());
}

partition.setUpdateIndex(raftIndex);
}{code}

Some nuances:
* Mismatch exception must be thrown before any data modifications. Storage
content must be intact, otherwise we'll just break it.
* Case above is the simplest one - there's a single "atomic" storage update.
Generally speaking, we can't or sometimes don't want to work this way. Examples
of operations, where atomicity this strict is not required:
** Batch insert/update from the transaction.
** Transaction commit might have a huge number of row ids, we can exhaust the
memory while committing.
* If we split write operation into several operations, we should externally
guarantee their idempotence. "setUpdateIndex" should be at the end of the last
"atomic" operation, so that the last command could be safely reapplied.

h2. Implementation

"set" method could write a value directly into partitions meta page. This
*will* work. But it's not quite optimal.

Optimal solution is tightly coupled with the way checkpoint should work. This
may not be the right place to describe the issue, but I do it nonetheless.
It'll probably get split into another issue one day.

There's a simple way to touch every meta page only once per checkpoint. We just
do it while holding checkpoint write lock. This way data is consistent. But
this solution is equally {*}bad{*}, it forces us to perform pages manipulation
under write lock. Flushing freelists is enough already. (NOTE: we should test
the performance without onheap-cache, it'll speed-up checkpoint start process,
thus reducing latency spikes)

Better way to do this is not having meta pages in page memory whatsoever. Maybe
during the start, but that's it. It's a common practice to have a pageSize
being equal to 16Kb. Effective payload of partition meta page in Ignite 2.x is
just above 100 bytes. I expect it to be way lower in Ignite 3.0. Having a
loaded page for every partition is just a waste of resources, all required data
can be stored on-heap.

Then, let's rely on two simple facts:
* If meta page date is cached on-heap, no one would need to read it from disk.
I should also mention that it will mostly be immutable.
* We can write partition meta page into every delta file even if meta has not
changed. In actuality, this will be very rare situation.

Considering both of these facts, checkpointer may unconditionally write meta
page from heap to disk at the beginning of writing the delta file. This page
will become a write-only page, which is basically what we need.
h2. Callbacks and RAFT snapshots

I argue against scheduled RAFT snapshots. They will produce a lot of junk
checkpoints. This is because checkpoint is a {*}global operation{*}. Imagine
RAFT triggering snapshots for 100 partitions in a row. This will result in a
100 minuscule checkpoints, no one needs it. So, I'd say, we need to operation:
* partition.getCheckpointerUpdateIndex();
* partition.registerCheckpointedUpdateIndexListener(closure);

Bot of these methods could be used by RAFT to determine whether it needs to
truncate its log and to define a specific commit index for truncation.

In case of PDS checkpointer, implementation for both of these methods is
trivial.

was:
Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for
prerequisites.
h2. General idea

[jira] [Resolved] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-03 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov resolved IGNITE-17074.

Resolution: Duplicate

> Create integer tableId identifier for tables
> 
>
> Key: IGNITE-17074
> URL: https://issues.apache.org/jira/browse/IGNITE-17074
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> First of all, this requirement comes from the PageMemory component 
> restrictions - having an entire UUID for table id is too much for a loaded 
> pages list. Currently the implementation uses String hash, just like in 
> Ignite 2.x. This is a bad solution.
> In Ignite 3.x configuration model, every configuration update is serialized 
> by design. This allows us to have atomic counters basically for free. We 
> could add a {{int lastTableId}} configuration property to a 
> {{TablesConfigurationSchema}}, for example, and increment it every time new 
> table is created. Then all we need is to read this value in all components 
> that need it.
> Maybe we should even use it in thin clients, but that needs a careful 
> consideration. Originally, int tableId is intended to be used in storage 
> implementations and maybe as a part of unique RowId, associated with tables, 
> but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-03 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17074:
---
Description: 
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId}} configuration property to a 
{{TablesConfigurationSchema}}, for example, and increment it every time new 
table is created. Then all we need is to read this value in all components that 
need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.

  was:
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId}} configuration property to a 
{{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every time 
new table is created. Then all we need is to read this value in all components 
that need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.


> Create integer tableId identifier for tables
> 
>
> Key: IGNITE-17074
> URL: https://issues.apache.org/jira/browse/IGNITE-17074
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> First of all, this requirement comes from the PageMemory component 
> restrictions - having an entire UUID for table id is too much for a loaded 
> pages list. Currently the implementation uses String hash, just like in 
> Ignite 2.x. This is a bad solution.
> In Ignite 3.x configuration model, every configuration update is serialized 
> by design. This allows us to have atomic counters basically for free. We 
> could add a {{int lastTableId}} configuration property to a 
> {{TablesConfigurationSchema}}, for example, and increment it every time new 
> table is created. Then all we need is to read this value in all components 
> that need it.
> Maybe we should even use it in thin clients, but that needs a careful 
> consideration. Originally, int tableId is intended to be used in storage 
> implementations and maybe as a part of unique RowId, associated with tables, 
> but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (IGNITE-17074) Create integer tableId identifier for tables

2022-06-03 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17074:
---
Description: 
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId}} configuration property to a 
{{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every time 
new table is created. Then all we need is to read this value in all components 
that need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.

  was:
First of all, this requirement comes from the PageMemory component restrictions 
- having an entire UUID for table id is too much for a loaded pages list. 
Currently the implementation uses String hash, just like in Ignite 2.x. This is 
a bad solution.

In Ignite 3.x configuration model, every configuration update is serialized by 
design. This allows us to have atomic counters basically for free. We could add 
a {{int lastTableId }}configuration property to a 
{{{}TablesConfigurationSchema{}}}, for example, and increment it every time new 
table is created. Then all we need is to read this value in all components that 
need it.

Maybe we should even use it in thin clients, but that needs a careful 
consideration. Originally, int tableId is intended to be used in storage 
implementations and maybe as a part of unique RowId, associated with tables, 
but that's only a speculation.


> Create integer tableId identifier for tables
> 
>
> Key: IGNITE-17074
> URL: https://issues.apache.org/jira/browse/IGNITE-17074
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> First of all, this requirement comes from the PageMemory component 
> restrictions - having an entire UUID for table id is too much for a loaded 
> pages list. Currently the implementation uses String hash, just like in 
> Ignite 2.x. This is a bad solution.
> In Ignite 3.x configuration model, every configuration update is serialized 
> by design. This allows us to have atomic counters basically for free. We 
> could add a {{int lastTableId}} configuration property to a 
> {{{}{{TablesConfigurationSchema}}{}}}, for example, and increment it every 
> time new table is created. Then all we need is to read this value in all 
> components that need it.
> Maybe we should even use it in thin clients, but that needs a careful 
> consideration. Originally, int tableId is intended to be used in storage 
> implementations and maybe as a part of unique RowId, associated with tables, 
> but that's only a speculation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (IGNITE-17087) Native rebalance for PDS partitions

2022-06-03 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17087:
--

 Summary: Native rebalance for PDS partitions
 Key: IGNITE-17087
 URL: https://issues.apache.org/jira/browse/IGNITE-17087
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.
h2. General idea

In this case, PDS has checkpointing feature that saves consistent state on 
disk. I expect SQL indexes to be in the same partition file as other data.

For every partition, its state on disk would look like this:
{code:java}
part-x.bin
part-x-1.bin
part-x-2.bin
...
part-x-n.bin{code}
part-x.bin is a baseline, and every other file is a delta that should be 
applied to underlying layers to get consistent data. It can be viewed like full 
and incremental backups.

When rebalance snapshot is required, we could force a checkpoint and then 
*prohibit merging* of new deltas to delta files from the snapshot until 
rebalance is finished. We must guarantee that consistent state can be read from 
disk.

Now, there are several strategies of data transferring:
 * File-based. We can send baseline and delta files as files. Two possible 
issues here:
 ** Files contain duplicated pages, so the volume of data will be bigger than 
necessary.
 ** Baseline file has to be truncated, because some delta pages go directly 
into baseline file as optimization.
 * Page-based. Latest state of every required page is sent separately. Two 
strategies here:
 ** Iterate pages in order of page indexes. Overheads during reads, but writes 
are very effective.
 ** Iterate pages in order of delta files, skipping already read pages in the 
process (like snapshots in GridGain, for example). Little overhead on read, but 
write won't be append-only.
I would argue that slower reads are more appropriate then slower writes. 
Generally speaking, any write should be slower than any read of the same size, 
right?

Should we implement all strategies and give user a choice? It's hard to predict 
which one is better for which scenario. In the future, I think it would be 
convenient to implement many options, but at first we should stick to the 
simplest one.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.

NOTE: of course, it has to be mentioned that this approach might lead to 
ineffective storage space usage. It can be a problem in theory, but in practice 
full rebalance isn't expected to occur often, and event then we don't expect 
that users will rewrite the entire partition data in a span of a single 
rebalance.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (IGNITE-17084) Native rebalance for RocksDB partitions

2022-06-02 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17084:
---
Description: 
General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.

In this case, RocksDB allows consistent db iteration using a "Snapshot" 
feature. Idea is very simple:
 * Take a RoackDB snapshot.
 * Iterate through partition data.
 * Iterate through indexes.
 * Relese the snapshot.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.

NOTE: of course, it has to be mentioned that this approach might lead to 
ineffective storage space usage. What I mean is that "previous" versions of 
values, in terms of RocksDB, must be stored on the device if they're visible 
from any of snapshots. It can be a problem in theory, but in practice full 
rebalance isn't expected to occur often, and event then we don't expect that 
users will rewrite the entire partition data in a span of a single rebalance.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.

  was:
General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.

In this case, RocksDB allows consistent db iteration using a "Snapshot" 
feature. Idea is very simple:
 * Take a RoackDB snapshot.
 * Iterate through partition data.
 * Iterate through indexes.
 * Relese the snapshot.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.


> Native rebalance for RocksDB partitions
> ---
>
> Key: IGNITE-17084
> URL: https://issues.apache.org/jira/browse/IGNITE-17084
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Ivan Bessonov
>Priority: Major
>  Labels: ignite-3
>
> General idea of full rebalance is described in 
> https://issues.apache.org/jira/browse/IGNITE-17083
> For persistent storages, there's an option to avoid copy-on-write rebalance 
> algorithms if it's desired. Intuitively, it's a preferable option. Each 
> storage chooses its own format.
> In this case, RocksDB allows consistent db iteration using a "Snapshot" 
> feature. Idea is very simple:
>  * Take a RoackDB snapshot.
>  * Iterate through partition data.
>  * Iterate through indexes.
>  * Relese the snapshot.
> There must be a common "infrastructure" or a framework to stream native 
> rebalance snapshots. Data format should be as simple as possible.
> NOTE: of course, it has to be mentioned that this approach might lead to 
> ineffective storage space usage. What I mean is that "previous" versions of 
> values, in terms of RocksDB, must be stored on the device if they're visible 
> from any of snapshots. It can be a problem in theory, but in practice full 
> rebalance isn't expected to occur often, and event then we don't expect that 
> users will rewrite the entire partition data in a span of a single rebalance.
> h2. Possible problems
> Given that "raw" data is sent, including sql indexes, all incompleted indexes 
> will be sent incompleted. Maybe we should also send a build state for each 
> index so that the receiving side could continue from the right place, not 
> from the beginning.
> This problem will be resolved in the future. Currently we don't have indexes 
> implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (IGNITE-17084) Native rebalance for RocksDB partitions

2022-06-02 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17084:
--

 Summary: Native rebalance for RocksDB partitions
 Key: IGNITE-17084
 URL: https://issues.apache.org/jira/browse/IGNITE-17084
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


General idea of full rebalance is described in 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.

In this case, RocksDB allows consistent db iteration using a "Snapshot" 
feature. Idea is very simple:
 * Take a RoackDB snapshot.
 * Iterate through partition data.
 * Iterate through indexes.
 * Relese the snapshot.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (IGNITE-17083) Universal full rebalance procedure for MV storage

2022-06-02 Thread Ivan Bessonov (Jira)



 [ 
https://issues.apache.org/jira/browse/IGNITE-17083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-17083:
---
Description: 
Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots 
of data. This is not always a good idea. First of all, for persistent data is 
already stored somewhere and can be read at any time. Second, for volatile 
storage this requirement is just absurd.

So, a "rebalance snapshot" should be streamed from one node to another instead 
of being written to a storage. What's good is that this approach can be 
implemented independently from the storage engine (with few adjustments to 
storage API, of course).
h2. General idea

Once a "rebalance snapshot" operation is triggered, we open a special type of 
cursor from the partition storage, that is able to give us all versioned chains 
in {_}some fixed order{_}. Every time the next chain has been read, it's 
remembered as the last read (let's call it\{{ lastRowId}} for now). Then all 
versions for the specific row id should be sent to receiver node in "Oldest to 
Newest" order to simplify insertion.

This works fine without concurrent load. To account for that we need to have a 
additional collection of row ids, associated with a snapshot. Let's call it 
{{{}overwrittenRowIds{}}}.

With this in mind, every write command should look similar to this:
{noformat}
for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
  try (var lock = rebalanceSnaphot.lock()) {
if (rowId <= rebalanceSnaphot.lastRowId())
  continue;

if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
  continue;

rebalanceSnapshot.sendRowToReceiver(rowId);
  }
}

// Now modification can be freely performed.
// Snapshot itself will skip everything from the "overwrittenRowIds" 
collection.{noformat}
NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
Their commit will be replicated later from the RAFT log.

NOTE: receiving side will have to rebuild indexes during the rebalancing. Just 
like it works in Ignite 2.x.

NOTE: Technically it is possible to have several nodes entering the cluster 
that require a full rebalance. So, while triggering a rebalance snapshot 
cursor, we could wait for other nodes that might want to read the same data and 
process all of them with a single scan. This is an optimization, obviously.
h2. Implementation

The implementation will have to be split into several parts, because we need:
 * Support for snapshot streaming in RAFT state machine.
 * Storage API for this type of scan.
 * Every storage must implement the new scan method.
 * Streamer itself should be implemented, along with a specific logic in write 
commands.

  was:
Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots 
of data. This is not always a good idea. First of all, for persistent data is 
already stored somewhere and can be read at any time. Second, for volatile 
storage this requirement is just absurd.

So, a "rebalance snapshot" should be streamed from one node to another instead 
of being written to a storage. What's good is that this approach can be 
implemented independently from the storage engine (with few adjustments to 
storage API, of course).
h2. General idea

Once a "rebalance snapshot" operation is triggered, we open a special type of 
cursor from the partition storage, that is able to give us all versioned chains 
in {_}some fixed order{_}. Every time the next chain has been read, it's 
remembered as the last read (let's call it{{ lastRowId}} for now). Then all 
versions for the specific row id should be sent to receiver node in "Oldest to 
Newest" order to simplify insertion.

This works fine without concurrent load. To account for that we need to have a 
additional collection of row ids, associated with a snapshot. Let's call it 
{{{}overwrittenRowIds{}}}.

With this in mind, every write command should look similar to this:

 
{noformat}
for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
  try (var lock = rebalanceSnaphot.lock()) {
if (rowId <= rebalanceSnaphot.lastRowId())
  continue;

if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
  continue;

rebalanceSnapshot.sendRowToReceiver(rowId);
  }
}

// Now modification can be freely performed.
// Snapshot itself will skip everything from the "overwrittenRowIds" 
collection.{noformat}
NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
Their commit will be replicated later from the RAFT log.

 

NOTE: receiving side will have to rebuild indexes during the rebalancing. Just 
like it works in Ignite 2.x.

NOTE: Technically it is possible to have several nodes entering the cluster 
that require a full rebalance. So, while triggering a rebalance snapshot 
cursor, we could wait for other nodes that might want to read the same data and 
process all of them with a single scan. Thi

[jira] [Created] (IGNITE-17083) Universal full rebalance procedure for MV storage

2022-06-02 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17083:
--

 Summary: Universal full rebalance procedure for MV storage
 Key: IGNITE-17083
 URL: https://issues.apache.org/jira/browse/IGNITE-17083
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Canonical way to make "full rebalance" in RAFT is to have a persisted snapshots 
of data. This is not always a good idea. First of all, for persistent data is 
already stored somewhere and can be read at any time. Second, for volatile 
storage this requirement is just absurd.

So, a "rebalance snapshot" should be streamed from one node to another instead 
of being written to a storage. What's good is that this approach can be 
implemented independently from the storage engine (with few adjustments to 
storage API, of course).
h2. General idea

Once a "rebalance snapshot" operation is triggered, we open a special type of 
cursor from the partition storage, that is able to give us all versioned chains 
in {_}some fixed order{_}. Every time the next chain has been read, it's 
remembered as the last read (let's call it{{ lastRowId}} for now). Then all 
versions for the specific row id should be sent to receiver node in "Oldest to 
Newest" order to simplify insertion.

This works fine without concurrent load. To account for that we need to have a 
additional collection of row ids, associated with a snapshot. Let's call it 
{{{}overwrittenRowIds{}}}.

With this in mind, every write command should look similar to this:

 
{noformat}
for (var rebalanceSnaphot : ongoingRebalanceSnapshots) {
  try (var lock = rebalanceSnaphot.lock()) {
if (rowId <= rebalanceSnaphot.lastRowId())
  continue;

if (!rebalanceSnaphot.overwrittenRowIds().put(rowId))
  continue;

rebalanceSnapshot.sendRowToReceiver(rowId);
  }
}

// Now modification can be freely performed.
// Snapshot itself will skip everything from the "overwrittenRowIds" 
collection.{noformat}
NOTE: rebalance snapshot scan must also return uncommitted write intentions. 
Their commit will be replicated later from the RAFT log.

 

NOTE: receiving side will have to rebuild indexes during the rebalancing. Just 
like it works in Ignite 2.x.

NOTE: Technically it is possible to have several nodes entering the cluster 
that require a full rebalance. So, while triggering a rebalance snapshot 
cursor, we could wait for other nodes that might want to read the same data and 
process all of them with a single scan. This is an optimization, obviously.
h2. Implementation

The implementation will have to be split into several parts, because we need:
 * Support for snapshot streaming in RAFT state machine.
 * Storage API for this type of scan.
 * Every storage must implement the new scan method.
 * Streamer itself should be implemented, along with a specific logic in write 
commands.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (IGNITE-17081) Implement checkpointIndex for RocksDB

2022-06-02 Thread Ivan Bessonov (Jira)

Ivan Bessonov created IGNITE-17081:
--

 Summary: Implement checkpointIndex for RocksDB
 Key: IGNITE-17081
 URL: https://issues.apache.org/jira/browse/IGNITE-17081
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Bessonov


Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
prerequisites.

Please also familiarize yourself with 
https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
the description is continued from there.

For RocksDB based storage the recovery process is trivial, because RocksDB has 
its own WAL. So, for testing purposes, it would be enough to just store update 
index in meta column family.

Immediately we have a write amplification issue, on top of possible performance 
degradation. Obvious solution is inherently bad and needs to be improved.
h2. General idea & implementation

Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
breaks RocksDB recovery procedure, we need to take measures to avoid it.

The only feasible way to do so is to use DBOptions#setAtomicFlush in 
conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
all column families consistently, if you have batches that cover several CFs. 
Basically, {{acquireConsistencyLock()}} would create a thread-local write 
batch, that's applied on locks release. Most of RocksDbMvPartitionStorage will 
be affected by this change.

NOTE: I believe that scans with unapplied batches should be prohibited for now  
(gladly, there's a WriteBatchInterface#count() to check). I don't see any 
practical value and a proper way of implementing it, considering how spread-out 
in time the scan process is.
h2. Callbacks and RAFT snapshots

Simply storing and reading update index is easy. Reading committed index is 
more challenging, I propose caching it and update only from the closure, that 
can also be used by RAFT to truncate the log.

For a closure, there are several things to account for during the 
implementation:
 * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
atomic flush mode. And, once you have your first "completed" event ,you have a 
guarantee that *all* memtables are already persisted.
This allows easy tracking of RocksDB flushes, monitoring events alteration is 
all that's needed.
 * Unlike PDS implementation, here we will be writing updateIndex value into a 
memtable every time. This makes it harder to find persistedIndex values for 
partitions. Gladly, considering the events that we have, during the time 
between first "completed" and the very next "begin", the state on disk is fully 
consistent. And there's a way to read data from storage avoiding memtable 
completely - ReadOptions#setReadTier(PERSISTED_TIER).

Summarizing everything from the above, we should implement following protocol:

 
{code:java}
During table start: read latest values of update indexes. Store them in an 
in-memory structure.
Set "lastEventType = ON_FLUSH_COMPLETED;".

onFlushBegin:
  if (lastEventType == ON_FLUSH_BEGIN)
return;

  waitForLastAsyncUpdateIndexesRead();

  lastEventType = ON_FLUSH_BEGIN;

onFlushCompleted:
  if (lastEventType == ON_FLUSH_COMPLETED)
return;

  asyncReadUpdateIndexesFromDisk();

  lastEventType = ON_FLUSH_COMPLETED;{code}
Reading values from disk must be performed asynchronously to not stall flushing 
process. We don't control locks that RocksDb holds while calling listener's 
methods.

 

That asynchronous process would invoke closures that provide presisted 
updateIndex  values to other components.

NODE: One might say that we should call "waitForLastAsyncUpdateIndexesRead();" 
as late as possible just in case. But my implementation says calling it during 
the first event. This is fine. I noticed that column families are flushed in 
order of their internal ids. These ids correspond to a sequence number of CFs, 
and the "default" CF is always created first. This is the exact CF that we use 
to store meta. Maybe we're going to change this and create a separate meta CF. 
Only then we could start optimizing this part, and only if we'll have an actual 
proof that there's a stall in this exact place.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)