Lightweight version of partitions map exchange

2019-03-28 Thread Nikita Amelchev
Hello, Igniters!

I have implemented lightweight version of partitions map exchange for
the case when the baseline node leaves topology. [1]

If partitions are assigned according to the baseline topology and
server node leaves there's no actual need to perform distributed PME.
Every cluster will recalculate new affinity assignments and partition
states locally. There is no need to wait for partitions released and
PME will be started immediately.

I have benchmarked duration of PME under yardstick load. PME duration
was decreased up to 10 times and the maximum latency of transactions
was decreased up to 4-5 times. See details in Jira issue comments. [1]

Could some expert of PME take a look at my changes? [2]

1. https://issues.apache.org/jira/browse/IGNITE-9913
2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027

-- 
Best wishes,
Amelchev Nikita


Re: Lightweight version of partitions map exchange

2019-03-28 Thread Pavel Kovalenko
Hi Nikita,

Thank you for your work. This is great improvement. I'll take look on it in
next couple of days. Could you please run TC and provide MTCGA bot status
about this change?

чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :

> Hello, Igniters!
>
> I have implemented lightweight version of partitions map exchange for
> the case when the baseline node leaves topology. [1]
>
> If partitions are assigned according to the baseline topology and
> server node leaves there's no actual need to perform distributed PME.
> Every cluster will recalculate new affinity assignments and partition
> states locally. There is no need to wait for partitions released and
> PME will be started immediately.
>
> I have benchmarked duration of PME under yardstick load. PME duration
> was decreased up to 10 times and the maximum latency of transactions
> was decreased up to 4-5 times. See details in Jira issue comments. [1]
>
> Could some expert of PME take a look at my changes? [2]
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
> 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
>
> --
> Best wishes,
> Amelchev Nikita
>


Re: Lightweight version of partitions map exchange

2019-03-29 Thread Eduard Shangareev
Nikita,

It sounds cool. But I didn't get about in-memory caches. The baseline is
not used for their affinity calculation.
So, this improvement would be switched off for them or completely (when
such caches are presented), wouldn't it?

On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  wrote:

> Hi Nikita,
>
> Thank you for your work. This is great improvement. I'll take look on it in
> next couple of days. Could you please run TC and provide MTCGA bot status
> about this change?
>
> чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :
>
> > Hello, Igniters!
> >
> > I have implemented lightweight version of partitions map exchange for
> > the case when the baseline node leaves topology. [1]
> >
> > If partitions are assigned according to the baseline topology and
> > server node leaves there's no actual need to perform distributed PME.
> > Every cluster will recalculate new affinity assignments and partition
> > states locally. There is no need to wait for partitions released and
> > PME will be started immediately.
> >
> > I have benchmarked duration of PME under yardstick load. PME duration
> > was decreased up to 10 times and the maximum latency of transactions
> > was decreased up to 4-5 times. See details in Jira issue comments. [1]
> >
> > Could some expert of PME take a look at my changes? [2]
> >
> > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >
>


Re: Lightweight version of partitions map exchange

2019-03-29 Thread Nikita Amelchev
Pavel,
I have provided MTCGA bot status in Jira issue comments. [1]

Eduard,
Yes, for current implementation it will be distributed PME if
in-memory caches configured.

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev :
>
> Nikita,
>
> It sounds cool. But I didn't get about in-memory caches. The baseline is
> not used for their affinity calculation.
> So, this improvement would be switched off for them or completely (when
> such caches are presented), wouldn't it?
>
> On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  wrote:
>
> > Hi Nikita,
> >
> > Thank you for your work. This is great improvement. I'll take look on it in
> > next couple of days. Could you please run TC and provide MTCGA bot status
> > about this change?
> >
> > чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :
> >
> > > Hello, Igniters!
> > >
> > > I have implemented lightweight version of partitions map exchange for
> > > the case when the baseline node leaves topology. [1]
> > >
> > > If partitions are assigned according to the baseline topology and
> > > server node leaves there's no actual need to perform distributed PME.
> > > Every cluster will recalculate new affinity assignments and partition
> > > states locally. There is no need to wait for partitions released and
> > > PME will be started immediately.
> > >
> > > I have benchmarked duration of PME under yardstick load. PME duration
> > > was decreased up to 10 times and the maximum latency of transactions
> > > was decreased up to 4-5 times. See details in Jira issue comments. [1]
> > >
> > > Could some expert of PME take a look at my changes? [2]
> > >
> > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > > 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
> > >
> > > --
> > > Best wishes,
> > > Amelchev Nikita
> > >
> >



-- 
Best wishes,
Amelchev Nikita


Re: Lightweight version of partitions map exchange

2019-05-24 Thread Nikita Amelchev
Hello, Igniters!

I am working on the implementation of lightweight PME for the case of
a BLT node leave. [1]

There is a question: whether to allow lightweight PME if the cluster
has MOVING partitions?

The problems that may happen if allow:
 - Nodes can differently select the primary node from current OWNING backups.
 - One part of nodes can mark a partition as LOST and another one as OWNING.

We can take states of the partitions from the node2part map. The root
cause of those problems is that when rebalancing ends (get the last
message), it updates partition state of the local node to OWNING (and
schedules partitions resend). This may lead to different affinity
re-calculations on nodes.

I see two solutions:

1. Nodes will store “moving-owning” transition of partitions state
until the rebalancing ends. Each node will locally recalculate the
affinity on node left event.
2. The coordinator will calculate affinity and send "full map"  to
nodes. In this case, nodes still should wait for topology change event
(to get correct topology in discovery).

If disallow lightweight PME when the cluster has MOVING partitions -
there are no problems and it works fine.

Any thoughts?

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev :
>
> Pavel,
> I have provided MTCGA bot status in Jira issue comments. [1]
>
> Eduard,
> Yes, for current implementation it will be distributed PME if
> in-memory caches configured.
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
>
> пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev :
> >
> > Nikita,
> >
> > It sounds cool. But I didn't get about in-memory caches. The baseline is
> > not used for their affinity calculation.
> > So, this improvement would be switched off for them or completely (when
> > such caches are presented), wouldn't it?
> >
> > On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  wrote:
> >
> > > Hi Nikita,
> > >
> > > Thank you for your work. This is great improvement. I'll take look on it 
> > > in
> > > next couple of days. Could you please run TC and provide MTCGA bot status
> > > about this change?
> > >
> > > чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :
> > >
> > > > Hello, Igniters!
> > > >
> > > > I have implemented lightweight version of partitions map exchange for
> > > > the case when the baseline node leaves topology. [1]
> > > >
> > > > If partitions are assigned according to the baseline topology and
> > > > server node leaves there's no actual need to perform distributed PME.
> > > > Every cluster will recalculate new affinity assignments and partition
> > > > states locally. There is no need to wait for partitions released and
> > > > PME will be started immediately.
> > > >
> > > > I have benchmarked duration of PME under yardstick load. PME duration
> > > > was decreased up to 10 times and the maximum latency of transactions
> > > > was decreased up to 4-5 times. See details in Jira issue comments. [1]
> > > >
> > > > Could some expert of PME take a look at my changes? [2]
> > > >
> > > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > > > 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita
> > > >
> > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita


Re: Lightweight version of partitions map exchange

2019-05-30 Thread Maxim Muzafarov
Igniters,


I've looked through Nikita's changes and I think for the current issue
[1] we should not allow the existence of MOVING partitions in the
cluster (it must be stable) to run the lightweight PME on BLT node
leave event occurred to achieve truly unlocked operations and here are
my thoughts why.

In general, as Nikita mentioned above, the existence of MOVING
partitions in the cluster means that the rebalance procedure is
currently running. It owns cache partitions locally and sends in the
background (with additional timeout) the actual statuses of his local
partitions to the coordinator node. So, we will always have a lag
between local node partition states and all other cluster nodes
partitions states. This lag can be very huge since previous
#scheduleResendPartitions() is cancelled when a new cache group
rebalance finished. Without the fair partition states synchronization
(without full PME) and in case of local affinity recalculation on BLT
node leave event, other nodes will mark such partitions LOST in most
of the cases, which in fact are present in the cluster and saved on
some node under checkpoint. I see that it cannot be solved by saving
transition states of such partitions on each node.

As for the case when the coordinator will calculate affinity and send
"full map" to other nodes, I think it is better here to focus on
designing a new lightweight PME when the rebalancing process finishes.
Сurrently full distributed PME will occur anyway by the coordinator by
sending CacheAffinityChaneMessage, but I think we can avoid it here,
since no new MOVING or OWNING node partition states are introduced and
all the previous mappings are still valid. We don't need a distributed
PME if we will leave partition primaries on those nodes where they
were, just set correct partition statuses via a light discovery
message.

So, my plan here can be:
Phase 1. Lightweight PME on BLT node leave on a stable cluster (no
MOVING partitions);
Phase 2. Lightweight PME on BLT node finishes its rebalance procedure.

Folks, Nikita,
WDYT?

[1] https://issues.apache.org/jira/browse/IGNITE-9913

On Fri, 24 May 2019 at 13:31, Nikita Amelchev  wrote:
>
> Hello, Igniters!
>
> I am working on the implementation of lightweight PME for the case of
> a BLT node leave. [1]
>
> There is a question: whether to allow lightweight PME if the cluster
> has MOVING partitions?
>
> The problems that may happen if allow:
>  - Nodes can differently select the primary node from current OWNING backups.
>  - One part of nodes can mark a partition as LOST and another one as OWNING.
>
> We can take states of the partitions from the node2part map. The root
> cause of those problems is that when rebalancing ends (get the last
> message), it updates partition state of the local node to OWNING (and
> schedules partitions resend). This may lead to different affinity
> re-calculations on nodes.
>
> I see two solutions:
>
> 1. Nodes will store “moving-owning” transition of partitions state
> until the rebalancing ends. Each node will locally recalculate the
> affinity on node left event.
> 2. The coordinator will calculate affinity and send "full map"  to
> nodes. In this case, nodes still should wait for topology change event
> (to get correct topology in discovery).
>
> If disallow lightweight PME when the cluster has MOVING partitions -
> there are no problems and it works fine.
>
> Any thoughts?
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
>
> пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev :
> >
> > Pavel,
> > I have provided MTCGA bot status in Jira issue comments. [1]
> >
> > Eduard,
> > Yes, for current implementation it will be distributed PME if
> > in-memory caches configured.
> >
> > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> >
> > пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev 
> > :
> > >
> > > Nikita,
> > >
> > > It sounds cool. But I didn't get about in-memory caches. The baseline is
> > > not used for their affinity calculation.
> > > So, this improvement would be switched off for them or completely (when
> > > such caches are presented), wouldn't it?
> > >
> > > On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  
> > > wrote:
> > >
> > > > Hi Nikita,
> > > >
> > > > Thank you for your work. This is great improvement. I'll take look on 
> > > > it in
> > > > next couple of days. Could you please run TC and provide MTCGA bot 
> > > > status
> > > > about this change?
> > > >
> > > > чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :
> > > >
> > > > > Hello, Igniters!

Re: Lightweight version of partitions map exchange

2019-06-05 Thread Nikita Amelchev
mentation it will be distributed PME if
> > > in-memory caches configured.
> > >
> > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > >
> > > пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev 
> > > :
> > > >
> > > > Nikita,
> > > >
> > > > It sounds cool. But I didn't get about in-memory caches. The baseline is
> > > > not used for their affinity calculation.
> > > > So, this improvement would be switched off for them or completely (when
> > > > such caches are presented), wouldn't it?
> > > >
> > > > On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  
> > > > wrote:
> > > >
> > > > > Hi Nikita,
> > > > >
> > > > > Thank you for your work. This is great improvement. I'll take look on 
> > > > > it in
> > > > > next couple of days. Could you please run TC and provide MTCGA bot 
> > > > > status
> > > > > about this change?
> > > > >
> > > > > чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :
> > > > >
> > > > > > Hello, Igniters!
> > > > > >
> > > > > > I have implemented lightweight version of partitions map exchange 
> > > > > > for
> > > > > > the case when the baseline node leaves topology. [1]
> > > > > >
> > > > > > If partitions are assigned according to the baseline topology and
> > > > > > server node leaves there's no actual need to perform distributed 
> > > > > > PME.
> > > > > > Every cluster will recalculate new affinity assignments and 
> > > > > > partition
> > > > > > states locally. There is no need to wait for partitions released and
> > > > > > PME will be started immediately.
> > > > > >
> > > > > > I have benchmarked duration of PME under yardstick load. PME 
> > > > > > duration
> > > > > > was decreased up to 10 times and the maximum latency of transactions
> > > > > > was decreased up to 4-5 times. See details in Jira issue comments. 
> > > > > > [1]
> > > > > >
> > > > > > Could some expert of PME take a look at my changes? [2]
> > > > > >
> > > > > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > > > > > 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
> > > > > >
> > > > > > --
> > > > > > Best wishes,
> > > > > > Amelchev Nikita
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Best wishes,
> > > Amelchev Nikita
> >
> >
> >
> > --
> > Best wishes,
> > Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita


Re: Lightweight version of partitions map exchange

2019-07-01 Thread Nikita Amelchev
mark a partition as LOST and another one as 
> > > OWNING.
> > >
> > > We can take states of the partitions from the node2part map. The root
> > > cause of those problems is that when rebalancing ends (get the last
> > > message), it updates partition state of the local node to OWNING (and
> > > schedules partitions resend). This may lead to different affinity
> > > re-calculations on nodes.
> > >
> > > I see two solutions:
> > >
> > > 1. Nodes will store “moving-owning” transition of partitions state
> > > until the rebalancing ends. Each node will locally recalculate the
> > > affinity on node left event.
> > > 2. The coordinator will calculate affinity and send "full map"  to
> > > nodes. In this case, nodes still should wait for topology change event
> > > (to get correct topology in discovery).
> > >
> > > If disallow lightweight PME when the cluster has MOVING partitions -
> > > there are no problems and it works fine.
> > >
> > > Any thoughts?
> > >
> > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > >
> > > пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev :
> > > >
> > > > Pavel,
> > > > I have provided MTCGA bot status in Jira issue comments. [1]
> > > >
> > > > Eduard,
> > > > Yes, for current implementation it will be distributed PME if
> > > > in-memory caches configured.
> > > >
> > > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > > >
> > > > пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev 
> > > > :
> > > > >
> > > > > Nikita,
> > > > >
> > > > > It sounds cool. But I didn't get about in-memory caches. The baseline 
> > > > > is
> > > > > not used for their affinity calculation.
> > > > > So, this improvement would be switched off for them or completely 
> > > > > (when
> > > > > such caches are presented), wouldn't it?
> > > > >
> > > > > On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  
> > > > > wrote:
> > > > >
> > > > > > Hi Nikita,
> > > > > >
> > > > > > Thank you for your work. This is great improvement. I'll take look 
> > > > > > on it in
> > > > > > next couple of days. Could you please run TC and provide MTCGA bot 
> > > > > > status
> > > > > > about this change?
> > > > > >
> > > > > > чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :
> > > > > >
> > > > > > > Hello, Igniters!
> > > > > > >
> > > > > > > I have implemented lightweight version of partitions map exchange 
> > > > > > > for
> > > > > > > the case when the baseline node leaves topology. [1]
> > > > > > >
> > > > > > > If partitions are assigned according to the baseline topology and
> > > > > > > server node leaves there's no actual need to perform distributed 
> > > > > > > PME.
> > > > > > > Every cluster will recalculate new affinity assignments and 
> > > > > > > partition
> > > > > > > states locally. There is no need to wait for partitions released 
> > > > > > > and
> > > > > > > PME will be started immediately.
> > > > > > >
> > > > > > > I have benchmarked duration of PME under yardstick load. PME 
> > > > > > > duration
> > > > > > > was decreased up to 10 times and the maximum latency of 
> > > > > > > transactions
> > > > > > > was decreased up to 4-5 times. See details in Jira issue 
> > > > > > > comments. [1]
> > > > > > >
> > > > > > > Could some expert of PME take a look at my changes? [2]
> > > > > > >
> > > > > > > 1. https://issues.apache.org/jira/browse/IGNITE-9913
> > > > > > > 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
> > > > > > >
> > > > > > > --
> > > > > > > Best wishes,
> > > > > > > Amelchev Nikita
> > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita
> > >
> > >
> > >
> > > --
> > > Best wishes,
> > > Amelchev Nikita
>
>
>
> --
> Best wishes,
> Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita


Re: Lightweight version of partitions map exchange

2019-07-09 Thread Ivan Rakov
after
rebalancing.

чт, 30 мая 2019 г. в 19:55, Maxim Muzafarov :

Igniters,


I've looked through Nikita's changes and I think for the current issue
[1] we should not allow the existence of MOVING partitions in the
cluster (it must be stable) to run the lightweight PME on BLT node
leave event occurred to achieve truly unlocked operations and here are
my thoughts why.

In general, as Nikita mentioned above, the existence of MOVING
partitions in the cluster means that the rebalance procedure is
currently running. It owns cache partitions locally and sends in the
background (with additional timeout) the actual statuses of his local
partitions to the coordinator node. So, we will always have a lag
between local node partition states and all other cluster nodes
partitions states. This lag can be very huge since previous
#scheduleResendPartitions() is cancelled when a new cache group
rebalance finished. Without the fair partition states synchronization
(without full PME) and in case of local affinity recalculation on BLT
node leave event, other nodes will mark such partitions LOST in most
of the cases, which in fact are present in the cluster and saved on
some node under checkpoint. I see that it cannot be solved by saving
transition states of such partitions on each node.

As for the case when the coordinator will calculate affinity and send
"full map" to other nodes, I think it is better here to focus on
designing a new lightweight PME when the rebalancing process finishes.
Сurrently full distributed PME will occur anyway by the coordinator by
sending CacheAffinityChaneMessage, but I think we can avoid it here,
since no new MOVING or OWNING node partition states are introduced and
all the previous mappings are still valid. We don't need a distributed
PME if we will leave partition primaries on those nodes where they
were, just set correct partition statuses via a light discovery
message.

So, my plan here can be:
Phase 1. Lightweight PME on BLT node leave on a stable cluster (no
MOVING partitions);
Phase 2. Lightweight PME on BLT node finishes its rebalance procedure.

Folks, Nikita,
WDYT?

[1] https://issues.apache.org/jira/browse/IGNITE-9913

On Fri, 24 May 2019 at 13:31, Nikita Amelchev 
 wrote:

Hello, Igniters!

I am working on the implementation of lightweight PME for the case of
a BLT node leave. [1]

There is a question: whether to allow lightweight PME if the cluster
has MOVING partitions?

The problems that may happen if allow:
  - Nodes can differently select the primary node from current 
OWNING backups.
  - One part of nodes can mark a partition as LOST and another one 
as OWNING.


We can take states of the partitions from the node2part map. The root
cause of those problems is that when rebalancing ends (get the last
message), it updates partition state of the local node to OWNING (and
schedules partitions resend). This may lead to different affinity
re-calculations on nodes.

I see two solutions:

1. Nodes will store “moving-owning” transition of partitions state
until the rebalancing ends. Each node will locally recalculate the
affinity on node left event.
2. The coordinator will calculate affinity and send "full map"  to
nodes. In this case, nodes still should wait for topology change 
event

(to get correct topology in discovery).

If disallow lightweight PME when the cluster has MOVING partitions -
there are no problems and it works fine.

Any thoughts?

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev :

Pavel,
I have provided MTCGA bot status in Jira issue comments. [1]

Eduard,
Yes, for current implementation it will be distributed PME if
in-memory caches configured.

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev 
:

Nikita,

It sounds cool. But I didn't get about in-memory caches. The 
baseline is

not used for their affinity calculation.
So, this improvement would be switched off for them or 
completely (when

such caches are presented), wouldn't it?

On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko 
 wrote:



Hi Nikita,

Thank you for your work. This is great improvement. I'll take 
look on it in
next couple of days. Could you please run TC and provide MTCGA 
bot status

about this change?

чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev 
:



Hello, Igniters!

I have implemented lightweight version of partitions map 
exchange for

the case when the baseline node leaves topology. [1]

If partitions are assigned according to the baseline topology and
server node leaves there's no actual need to perform 
distributed PME.
Every cluster will recalculate new affinity assignments and 
partition
states locally. There is no need to wait for partitions 
released and

PME will be started immediately.

I have benchmarked duration of PME under yardstick load. PME 
duration
was decreased up to 10 times and the maximum latency of 
transac

Re: Lightweight version of partitions map exchange

2019-07-09 Thread Ivan Rakov
in the cluster and saved on
some node under checkpoint. I see that it cannot be solved by saving
transition states of such partitions on each node.

As for the case when the coordinator will calculate affinity and send
"full map" to other nodes, I think it is better here to focus on
designing a new lightweight PME when the rebalancing process finishes.
Сurrently full distributed PME will occur anyway by the coordinator by
sending CacheAffinityChaneMessage, but I think we can avoid it here,
since no new MOVING or OWNING node partition states are introduced and
all the previous mappings are still valid. We don't need a distributed
PME if we will leave partition primaries on those nodes where they
were, just set correct partition statuses via a light discovery
message.

So, my plan here can be:
Phase 1. Lightweight PME on BLT node leave on a stable cluster (no
MOVING partitions);
Phase 2. Lightweight PME on BLT node finishes its rebalance procedure.

Folks, Nikita,
WDYT?

[1] https://issues.apache.org/jira/browse/IGNITE-9913

On Fri, 24 May 2019 at 13:31, Nikita Amelchev  wrote:

Hello, Igniters!

I am working on the implementation of lightweight PME for the case of
a BLT node leave. [1]

There is a question: whether to allow lightweight PME if the cluster
has MOVING partitions?

The problems that may happen if allow:
  - Nodes can differently select the primary node from current OWNING backups.
  - One part of nodes can mark a partition as LOST and another one as OWNING.

We can take states of the partitions from the node2part map. The root
cause of those problems is that when rebalancing ends (get the last
message), it updates partition state of the local node to OWNING (and
schedules partitions resend). This may lead to different affinity
re-calculations on nodes.

I see two solutions:

1. Nodes will store “moving-owning” transition of partitions state
until the rebalancing ends. Each node will locally recalculate the
affinity on node left event.
2. The coordinator will calculate affinity and send "full map"  to
nodes. In this case, nodes still should wait for topology change event
(to get correct topology in discovery).

If disallow lightweight PME when the cluster has MOVING partitions -
there are no problems and it works fine.

Any thoughts?

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev :

Pavel,
I have provided MTCGA bot status in Jira issue comments. [1]

Eduard,
Yes, for current implementation it will be distributed PME if
in-memory caches configured.

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev :

Nikita,

It sounds cool. But I didn't get about in-memory caches. The baseline is
not used for their affinity calculation.
So, this improvement would be switched off for them or completely (when
such caches are presented), wouldn't it?

On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  wrote:


Hi Nikita,

Thank you for your work. This is great improvement. I'll take look on it in
next couple of days. Could you please run TC and provide MTCGA bot status
about this change?

чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :


Hello, Igniters!

I have implemented lightweight version of partitions map exchange for
the case when the baseline node leaves topology. [1]

If partitions are assigned according to the baseline topology and
server node leaves there's no actual need to perform distributed PME.
Every cluster will recalculate new affinity assignments and partition
states locally. There is no need to wait for partitions released and
PME will be started immediately.

I have benchmarked duration of PME under yardstick load. PME duration
was decreased up to 10 times and the maximum latency of transactions
was decreased up to 4-5 times. See details in Jira issue comments. [1]

Could some expert of PME take a look at my changes? [2]

1. https://issues.apache.org/jira/browse/IGNITE-9913
2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027

--
Best wishes,
Amelchev Nikita




--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita





Re: Lightweight version of partitions map exchange

2019-07-09 Thread Ivan Rakov
 on BLT node
leave event occurred to achieve truly unlocked operations and here are
my thoughts why.

In general, as Nikita mentioned above, the existence of MOVING
partitions in the cluster means that the rebalance procedure is
currently running. It owns cache partitions locally and sends in the
background (with additional timeout) the actual statuses of his local
partitions to the coordinator node. So, we will always have a lag
between local node partition states and all other cluster nodes
partitions states. This lag can be very huge since previous
#scheduleResendPartitions() is cancelled when a new cache group
rebalance finished. Without the fair partition states synchronization
(without full PME) and in case of local affinity recalculation on BLT
node leave event, other nodes will mark such partitions LOST in most
of the cases, which in fact are present in the cluster and saved on
some node under checkpoint. I see that it cannot be solved by saving
transition states of such partitions on each node.

As for the case when the coordinator will calculate affinity and send
"full map" to other nodes, I think it is better here to focus on
designing a new lightweight PME when the rebalancing process finishes.
Сurrently full distributed PME will occur anyway by the coordinator by
sending CacheAffinityChaneMessage, but I think we can avoid it here,
since no new MOVING or OWNING node partition states are introduced and
all the previous mappings are still valid. We don't need a distributed
PME if we will leave partition primaries on those nodes where they
were, just set correct partition statuses via a light discovery
message.

So, my plan here can be:
Phase 1. Lightweight PME on BLT node leave on a stable cluster (no
MOVING partitions);
Phase 2. Lightweight PME on BLT node finishes its rebalance procedure.

Folks, Nikita,
WDYT?

[1] https://issues.apache.org/jira/browse/IGNITE-9913

On Fri, 24 May 2019 at 13:31, Nikita Amelchev  wrote:

Hello, Igniters!

I am working on the implementation of lightweight PME for the case of
a BLT node leave. [1]

There is a question: whether to allow lightweight PME if the cluster
has MOVING partitions?

The problems that may happen if allow:
  - Nodes can differently select the primary node from current OWNING backups.
  - One part of nodes can mark a partition as LOST and another one as OWNING.

We can take states of the partitions from the node2part map. The root
cause of those problems is that when rebalancing ends (get the last
message), it updates partition state of the local node to OWNING (and
schedules partitions resend). This may lead to different affinity
re-calculations on nodes.

I see two solutions:

1. Nodes will store “moving-owning” transition of partitions state
until the rebalancing ends. Each node will locally recalculate the
affinity on node left event.
2. The coordinator will calculate affinity and send "full map"  to
nodes. In this case, nodes still should wait for topology change event
(to get correct topology in discovery).

If disallow lightweight PME when the cluster has MOVING partitions -
there are no problems and it works fine.

Any thoughts?

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev :

Pavel,
I have provided MTCGA bot status in Jira issue comments. [1]

Eduard,
Yes, for current implementation it will be distributed PME if
in-memory caches configured.

1. https://issues.apache.org/jira/browse/IGNITE-9913

пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev :

Nikita,

It sounds cool. But I didn't get about in-memory caches. The baseline is
not used for their affinity calculation.
So, this improvement would be switched off for them or completely (when
such caches are presented), wouldn't it?

On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  wrote:


Hi Nikita,

Thank you for your work. This is great improvement. I'll take look on it in
next couple of days. Could you please run TC and provide MTCGA bot status
about this change?

чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev :


Hello, Igniters!

I have implemented lightweight version of partitions map exchange for
the case when the baseline node leaves topology. [1]

If partitions are assigned according to the baseline topology and
server node leaves there's no actual need to perform distributed PME.
Every cluster will recalculate new affinity assignments and partition
states locally. There is no need to wait for partitions released and
PME will be started immediately.

I have benchmarked duration of PME under yardstick load. PME duration
was decreased up to 10 times and the maximum latency of transactions
was decreased up to 4-5 times. See details in Jira issue comments. [1]

Could some expert of PME take a look at my changes? [2]

1. https://issues.apache.org/jira/browse/IGNITE-9913
2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027

--
Best wishes,
Amelchev Nikita




--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita



--
Best wishes,
Amelchev Nikita





Re: Lightweight version of partitions map exchange

2019-07-22 Thread Alexei Scherbakov
Nodes can differently select the primary node from current OWNING backups.
>  - One part of nodes can mark a partition as LOST and another one as OWNING.
>
> We can take states of the partitions from the node2part map. The root
> cause of those problems is that when rebalancing ends (get the last
> message), it updates partition state of the local node to OWNING (and
> schedules partitions resend). This may lead to different affinity
> re-calculations on nodes.
>
> I see two solutions:
>
> 1. Nodes will store “moving-owning” transition of partitions state
> until the rebalancing ends. Each node will locally recalculate the
> affinity on node left event.
> 2. The coordinator will calculate affinity and send "full map"  to
> nodes. In this case, nodes still should wait for topology change event
> (to get correct topology in discovery).
>
> If disallow lightweight PME when the cluster has MOVING partitions -
> there are no problems and it works fine.
>
> Any thoughts?
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
>
> пт, 29 мар. 2019 г. в 15:00, Nikita Amelchev  
> :
>
> Pavel,
> I have provided MTCGA bot status in Jira issue comments. [1]
>
> Eduard,
> Yes, for current implementation it will be distributed PME if
> in-memory caches configured.
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
>
> пт, 29 мар. 2019 г. в 14:49, Eduard Shangareev  
> :
>
> Nikita,
>
> It sounds cool. But I didn't get about in-memory caches. The baseline is
> not used for their affinity calculation.
> So, this improvement would be switched off for them or completely (when
> such caches are presented), wouldn't it?
>
> On Thu, Mar 28, 2019 at 3:14 PM Pavel Kovalenko  
>  wrote:
>
>
> Hi Nikita,
>
> Thank you for your work. This is great improvement. I'll take look on it in
> next couple of days. Could you please run TC and provide MTCGA bot status
> about this change?
>
> чт, 28 мар. 2019 г. в 14:29, Nikita Amelchev  
> :
>
>
> Hello, Igniters!
>
> I have implemented lightweight version of partitions map exchange for
> the case when the baseline node leaves topology. [1]
>
> If partitions are assigned according to the baseline topology and
> server node leaves there's no actual need to perform distributed PME.
> Every cluster will recalculate new affinity assignments and partition
> states locally. There is no need to wait for partitions released and
> PME will be started immediately.
>
> I have benchmarked duration of PME under yardstick load. PME duration
> was decreased up to 10 times and the maximum latency of transactions
> was decreased up to 4-5 times. See details in Jira issue comments. [1]
>
> Could some expert of PME take a look at my changes? [2]
>
> 1. https://issues.apache.org/jira/browse/IGNITE-9913
> 2. https://reviews.ignite.apache.org/ignite/review/IGNT-CR-1027
>
> --
> Best wishes,
> Amelchev Nikita
>
>
> --
> Best wishes,
> Amelchev Nikita
>
> --
> Best wishes,
> Amelchev Nikita
>
> --
> Best wishes,
> Amelchev Nikita
>
>

-- 

Best regards,
Alexei Scherbakov