Alex, thanks for the summary and proposal. Anton, Ivan and others who took part in this discussion, what're your thoughts? I see this rolling-upgrades-based approach as a reasonable solution. Even though a node shutdown is expected, the procedure doesn't lead to the cluster outage meaning it can be utilized for 24x7 production environments.
- Denis On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk <alexey.goncha...@gmail.com> wrote: > Created a ticket for the first stage of this improvement. This can be a > first change towards the online mode suggested by Sergey and Anton. > https://issues.apache.org/jira/browse/IGNITE-12263 > > пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk <alexey.goncha...@gmail.com>: > > > Maxim, > > > > Having a cluster-wide lock for a cache does not improve availability of > > the solution. A user cannot defragment a cache if the cache is involved > in > > a mission-critical operation, so having a lock on such a cache is > > equivalent to the whole cluster shutdown. > > > > We should decide between either a single offline node or a more complex > > fully online solution. > > > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <mmu...@apache.org>: > > > >> Igniters, > >> > >> This thread seems to be endless, but we if some kind of cache group > >> distributed write lock (exclusive for some of the internal Ignite > >> process) will be introduced? I think it will help to solve a batch of > >> problems, like: > >> > >> 1. defragmentation of all cache group partitions on the local node > >> without concurrent updates. > >> 2. improve data loading with data streamer isolation mode [1]. It > >> seems we should not allow concurrent updates to cache if we on `fast > >> data load` step. > >> 3. recovery from a snapshot without cache stop\start actions > >> > >> > >> [1] https://issues.apache.org/jira/browse/IGNITE-11793 > >> > >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <skoz...@gridgain.com> > wrote: > >> > > >> > Hi > >> > > >> > I'm not sure that node offline is a best way to do that. > >> > Cons: > >> > - different caches may have different defragmentation but we force to > >> stop > >> > whole node > >> > - offline node is a maintenance operation will require to add +1 > >> backup to > >> > reduce the risk of data loss > >> > - baseline auto adjustment? > >> > - impact to index rebuild? > >> > - cache configuration changes (or destroy) during node offline > >> > > >> > What about other ways without node stop? E.g. make cache group on a > node > >> > offline? Add *defrag <cache_group> *command to control.sh to force > start > >> > rebalance internally in the node with expected impact to performance. > >> > > >> > > >> > > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <a...@apache.org> > wrote: > >> > > >> > > Alexey, > >> > > As for me, it does not matter will it be IEP, umbrella or a single > >> issue. > >> > > The most important thing is Assignee :) > >> > > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < > >> > > alexey.goncha...@gmail.com> > >> > > wrote: > >> > > > >> > > > Anton, do you think we should file a single ticket for this or > >> should we > >> > > go > >> > > > with an IEP? As of now, the change does not look big enough for an > >> IEP > >> > > for > >> > > > me. > >> > > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <a...@apache.org>: > >> > > > > >> > > > > Alexey, > >> > > > > > >> > > > > Sounds good to me. > >> > > > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < > >> > > > > alexey.goncha...@gmail.com> > >> > > > > wrote: > >> > > > > > >> > > > > > Anton, > >> > > > > > > >> > > > > > Switching a partition to and from the SHRINKING state will > >> require > >> > > > > > intricate synchronizations in order to properly determine the > >> start > >> > > > > > position for historical rebalance without PME. > >> > > > > > > >> > > > > > I would still go with an offline-node approach, but instead of > >> > > cleaning > >> > > > > the > >> > > > > > persistence, we can do effective defragmentation when the node > >> is > >> > > > offline > >> > > > > > because we are sure that there is no concurrent load. After > the > >> > > > > > defragmentation completes, we bring the node back to the > >> cluster and > >> > > > > > historical rebalance will kick in automatically. It will still > >> > > require > >> > > > > > manual node restarts, but since the data is not removed, there > >> are no > >> > > > > > additional risks. Also, this will be an excellent solution for > >> those > >> > > > who > >> > > > > > can afford downtime and execute the defragment command on all > >> nodes > >> > > in > >> > > > > the > >> > > > > > cluster simultaneously - this will be the fastest way > possible. > >> > > > > > > >> > > > > > --AG > >> > > > > > > >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <a...@apache.org > >: > >> > > > > > > >> > > > > > > Alexei, > >> > > > > > > >> stopping fragmented node and removing partition data, > then > >> > > > starting > >> > > > > it > >> > > > > > > again > >> > > > > > > > >> > > > > > > That's exactly what we're doing to solve the fragmentation > >> issue. > >> > > > > > > The problem here is that we have to perform N/B > >> restart-rebalance > >> > > > > > > operations (N - cluster size, B - backups count) and it > takes > >> a lot > >> > > > of > >> > > > > > time > >> > > > > > > with risks to lose the data. > >> > > > > > > > >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < > >> > > > > > > alexey.scherbak...@gmail.com> wrote: > >> > > > > > > > >> > > > > > > > Probably this should be allowed to do using public API, > >> actually > >> > > > this > >> > > > > > is > >> > > > > > > > same as manual rebalancing. > >> > > > > > > > > >> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < > >> > > > > > > > alexey.scherbak...@gmail.com>: > >> > > > > > > > > >> > > > > > > > > The poor man's solution for the problem would be > stopping > >> > > > > fragmented > >> > > > > > > node > >> > > > > > > > > and removing partition data, then starting it again > >> allowing > >> > > full > >> > > > > > state > >> > > > > > > > > transfer already without deletes. > >> > > > > > > > > Rinse and repeat for all owners. > >> > > > > > > > > > >> > > > > > > > > Anton Vinogradov, would this work for you as workaround > ? > >> > > > > > > > > > >> > > > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov < > >> a...@apache.org > >> > > >: > >> > > > > > > > > > >> > > > > > > > >> Alexey, > >> > > > > > > > >> > >> > > > > > > > >> Let's combine your and Ivan's proposals. > >> > > > > > > > >> > >> > > > > > > > >> >> vacuum command, which acquires exclusive table lock, > >> so no > >> > > > > > > concurrent > >> > > > > > > > >> activities on the table are possible. > >> > > > > > > > >> and > >> > > > > > > > >> >> Could the problem be solved by stopping a node which > >> needs > >> > > to > >> > > > > be > >> > > > > > > > >> defragmented, clearing persistence files and restarting > >> the > >> > > > node? > >> > > > > > > > >> >> After rebalancing the node will receive all data > back > >> > > without > >> > > > > > > > >> fragmentation. > >> > > > > > > > >> > >> > > > > > > > >> How about to have special partition state SHRINKING? > >> > > > > > > > >> This state should mean that partition unavailable for > >> reads > >> > > and > >> > > > > > > updates > >> > > > > > > > >> but > >> > > > > > > > >> should keep it's update-counters and should not be > >> marked as > >> > > > lost, > >> > > > > > > > renting > >> > > > > > > > >> or evicted. > >> > > > > > > > >> At this state we able to iterate over the partition and > >> apply > >> > > > it's > >> > > > > > > > entries > >> > > > > > > > >> to another file in a compact way. > >> > > > > > > > >> Indices should be updated during the copy-on-shrink > >> procedure > >> > > or > >> > > > > at > >> > > > > > > the > >> > > > > > > > >> shrink completion. > >> > > > > > > > >> Once shrank file is ready we should replace the > original > >> > > > partition > >> > > > > > > file > >> > > > > > > > >> with it and mark it as MOVING which will start the > >> historical > >> > > > > > > rebalance. > >> > > > > > > > >> Shrinking should be performed during the low activity > >> periods, > >> > > > but > >> > > > > > > even > >> > > > > > > > in > >> > > > > > > > >> case we found that activity was high and historical > >> rebalance > >> > > is > >> > > > > not > >> > > > > > > > >> suitable we may just remove the file and use regular > >> rebalance > >> > > > to > >> > > > > > > > restore > >> > > > > > > > >> the partition (this will also lead to shrink). > >> > > > > > > > >> > >> > > > > > > > >> BTW, seems, we able to implement partition shrink in a > >> cheap > >> > > > way. > >> > > > > > > > >> We may just use rebalancing code to apply fat > partition's > >> > > > entries > >> > > > > to > >> > > > > > > the > >> > > > > > > > >> new file. > >> > > > > > > > >> So, 3 stages here: local rebalance, indices update and > >> global > >> > > > > > > historical > >> > > > > > > > >> rebalance. > >> > > > > > > > >> > >> > > > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk < > >> > > > > > > > >> alexey.goncha...@gmail.com> wrote: > >> > > > > > > > >> > >> > > > > > > > >> > Anton, > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > >> The solution which Anton suggested does not > look > >> easy > >> > > > > > because > >> > > > > > > it > >> > > > > > > > >> will > >> > > > > > > > >> > > most likely significantly hurt performance > >> > > > > > > > >> > > Mostly agree here, but what drop do we expect? What > >> price > >> > > do > >> > > > > we > >> > > > > > > > ready > >> > > > > > > > >> to > >> > > > > > > > >> > > pay? > >> > > > > > > > >> > > Not sure, but seems some vendors ready to pay, for > >> > > example, > >> > > > 5% > >> > > > > > > drop > >> > > > > > > > >> for > >> > > > > > > > >> > > this. > >> > > > > > > > >> > > >> > > > > > > > >> > 5% may be a big drop for some use-cases, so I think > we > >> > > should > >> > > > > look > >> > > > > > > at > >> > > > > > > > >> how > >> > > > > > > > >> > to improve performance, not how to make it worse. > >> > > > > > > > >> > > >> > > > > > > > >> > > >> > > > > > > > >> > > > >> > > > > > > > >> > > >> it is hard to maintain a data structure to > choose > >> "page > >> > > > > from > >> > > > > > > > >> free-list > >> > > > > > > > >> > > with enough space closest to the beginning of the > >> file". > >> > > > > > > > >> > > We can just split each free-list bucket to the > >> couple and > >> > > > use > >> > > > > > > first > >> > > > > > > > >> for > >> > > > > > > > >> > > pages in the first half of the file and the second > >> for the > >> > > > > last. > >> > > > > > > > >> > > Only two buckets required here since, during the > file > >> > > > shrink, > >> > > > > > > first > >> > > > > > > > >> > > bucket's window will be shrank too. > >> > > > > > > > >> > > Seems, this give us the same price on put, just use > >> the > >> > > > first > >> > > > > > > bucket > >> > > > > > > > >> in > >> > > > > > > > >> > > case it's not empty. > >> > > > > > > > >> > > Remove price (with merge) will be increased, of > >> course. > >> > > > > > > > >> > > > >> > > > > > > > >> > > The compromise solution is to have priority put (to > >> the > >> > > > first > >> > > > > > path > >> > > > > > > > of > >> > > > > > > > >> the > >> > > > > > > > >> > > file), with keeping removal as is, and schedulable > >> > > per-page > >> > > > > > > > migration > >> > > > > > > > >> for > >> > > > > > > > >> > > the rest of the data during the low activity > period. > >> > > > > > > > >> > > > >> > > > > > > > >> > Free lists are large and slow by themselves, it is > >> expensive > >> > > > to > >> > > > > > > > >> checkpoint > >> > > > > > > > >> > and read them on start, so as a long-term solution I > >> would > >> > > > look > >> > > > > > into > >> > > > > > > > >> > removing them. Moreover, not sure if adding yet > another > >> > > > > background > >> > > > > > > > >> process > >> > > > > > > > >> > will improve the codebase reliability and simplicity. > >> > > > > > > > >> > > >> > > > > > > > >> > If we want to go the hard path, I would look at free > >> page > >> > > > > tracking > >> > > > > > > > >> bitmap - > >> > > > > > > > >> > a special bitmask page, where each page in an > adjacent > >> block > >> > > > is > >> > > > > > > marked > >> > > > > > > > >> as 0 > >> > > > > > > > >> > if it has free space more than a certain configurable > >> > > > threshold > >> > > > > > > (say, > >> > > > > > > > >> 80%) > >> > > > > > > > >> > - free, and 1 if less (full). Some vendors have > >> successfully > >> > > > > > > > implemented > >> > > > > > > > >> > this approach, which looks much more promising, but > >> harder > >> > > to > >> > > > > > > > implement. > >> > > > > > > > >> > > >> > > > > > > > >> > --AG > >> > > > > > > > >> > > >> > > > > > > > >> > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > -- > >> > > > > > > > > > >> > > > > > > > > Best regards, > >> > > > > > > > > Alexei Scherbakov > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > -- > >> > > > > > > > > >> > > > > > > > Best regards, > >> > > > > > > > Alexei Scherbakov > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >> > -- > >> > Sergey Kozlov > >> > GridGain Systems > >> > www.gridgain.com > >> > > >