Ivan,

I agree with a combined approach: threshold for small partitions and count
of update for partition that outgrew it.
This helps to avoid partitions that update not frequently.

Reading of a big WAL piece (more than 100Gb) it can happen, when a client
configured it intentionally.
There are no doubts we can to read it, otherwise WAL space was not
configured that too large.

I don't see a connection optimization of iterator and issue in atomic
protocol.
Reordering in WAL, that happened in checkpoint where counter was not
changing, is an extremely rare case and the issue will not solve for
generic case, this should be fixed in bound of protocol.

I think we can modify the heuristic so
1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
reduce it to 500)
2) Select only that partition for historical rebalance where difference
between counters less that partition size.

Also implement mentioned optimization for historical iterator, that may
reduce a time on reading large WAL interval.

On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov <ivan.glu...@gmail.com> wrote:

> Hi Vladislav,
>
> Thanks for raising this topic.
> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
> is controversial. Assuming that the default number of partitions is 1024,
> cache should contain a really huge amount of data in order to make WAL
> delta rebalancing possible. In fact, it's currently disabled for most
> production cases, which makes rebalancing of persistent caches unreasonably
> long.
>
> I think, your approach [1] makes much more sense than the current
> heuristic, let's move forward with the proposed solution.
>
> Though, there are some other corner cases, e.g. this one:
> - Configured size of WAL archive is big (>100 GB)
> - Cache has small partitions (e.g. 1000 entries)
> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> - There is another cache with very frequent updates which allocate >99% of
> WAL
> In such scenario we may need to iterate over >100 GB of WAL in order to
> fetch <1% of needed updates. Even though the amount of network traffic is
> still optimized, it would be more effective to transfer partitions with
> ~1000 entries fully instead of reading >100 GB of WAL.
>
> I want to highlight that your heuristic definitely makes the situation
> better, but due to possible corner cases we should keep the fallback lever
> to restrict or limit historical rebalance as before. Probably, it would be
> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
> default value (1000, 500 or even 0) and apply your heuristic only for
> partitions with bigger size.
>
> Regarding case [2]: it looks like an improvement that can mitigate some
> corner cases (including the one that I have described). I'm ok with it as
> long as it takes data updates reordering on backup nodes into account. We
> don't track skipped updates for atomic caches. As a result, detection of
> the absence of updates between two checkpoint markers with the same
> partition counter can be false positive.
>
> --
> Best Regards,
> Ivan Rakov
>
> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov <vldpyat...@gmail.com>
> wrote:
>
> > Hi guys,
> >
> > I want to implement a more honest heuristic for historical rebalance.
> > Before, a cluster makes a choice between the historical rebalance or not
> it
> > only from a partition size. This threshold more known by a name of
> property
> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> > It might prevent a historical rebalance when a partition is too small,
> but
> > not if WAL contains more updates than a size of partition, historical
> > rebalance still can be chosen.
> > There is a ticket where need to implement more fair heuristic[1].
> >
> > My idea for implementation is need to estimate a size of data which will
> be
> > transferred owe network. In other word if need to rebalance a part of WAL
> > that contains N updates, for recover a partition on another node, which
> > have to contain M rows at all, need chooses a historical rebalance on the
> > case where N < M (WAL history should be presented as well).
> >
> > This approach is easy implemented, because a coordinator node has the
> size
> > of partitions and counters' interval. But in this case cluster still can
> > find not many updates in too long WAL history. I assume a possibility to
> > work around it, if rebalance historical iterator will not handle
> > checkpoints where not contains updates of particular cache. Checkpoints
> can
> > skip if counters for the cache (maybe even a specific partitions) was not
> > changed between it and next one.
> >
> > Ticket for improvement rebalance historical iterator[2]
> >
> > I want to hear a view of community on the thought above.
> > Maybe anyone has another opinion?
> >
> > [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> > [2]: https://issues.apache.org/jira/browse/IGNITE-13254
> >
> > --
> > Vladislav Pyatkov
> >
>


-- 
Vladislav Pyatkov

Reply via email to