Created a ticket for the first stage of this improvement. This can be a first change towards the online mode suggested by Sergey and Anton. https://issues.apache.org/jira/browse/IGNITE-12263
пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk <alexey.goncha...@gmail.com>: > Maxim, > > Having a cluster-wide lock for a cache does not improve availability of > the solution. A user cannot defragment a cache if the cache is involved in > a mission-critical operation, so having a lock on such a cache is > equivalent to the whole cluster shutdown. > > We should decide between either a single offline node or a more complex > fully online solution. > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <mmu...@apache.org>: > >> Igniters, >> >> This thread seems to be endless, but we if some kind of cache group >> distributed write lock (exclusive for some of the internal Ignite >> process) will be introduced? I think it will help to solve a batch of >> problems, like: >> >> 1. defragmentation of all cache group partitions on the local node >> without concurrent updates. >> 2. improve data loading with data streamer isolation mode [1]. It >> seems we should not allow concurrent updates to cache if we on `fast >> data load` step. >> 3. recovery from a snapshot without cache stop\start actions >> >> >> [1] https://issues.apache.org/jira/browse/IGNITE-11793 >> >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <skoz...@gridgain.com> wrote: >> > >> > Hi >> > >> > I'm not sure that node offline is a best way to do that. >> > Cons: >> > - different caches may have different defragmentation but we force to >> stop >> > whole node >> > - offline node is a maintenance operation will require to add +1 >> backup to >> > reduce the risk of data loss >> > - baseline auto adjustment? >> > - impact to index rebuild? >> > - cache configuration changes (or destroy) during node offline >> > >> > What about other ways without node stop? E.g. make cache group on a node >> > offline? Add *defrag <cache_group> *command to control.sh to force start >> > rebalance internally in the node with expected impact to performance. >> > >> > >> > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <a...@apache.org> wrote: >> > >> > > Alexey, >> > > As for me, it does not matter will it be IEP, umbrella or a single >> issue. >> > > The most important thing is Assignee :) >> > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < >> > > alexey.goncha...@gmail.com> >> > > wrote: >> > > >> > > > Anton, do you think we should file a single ticket for this or >> should we >> > > go >> > > > with an IEP? As of now, the change does not look big enough for an >> IEP >> > > for >> > > > me. >> > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <a...@apache.org>: >> > > > >> > > > > Alexey, >> > > > > >> > > > > Sounds good to me. >> > > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < >> > > > > alexey.goncha...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Anton, >> > > > > > >> > > > > > Switching a partition to and from the SHRINKING state will >> require >> > > > > > intricate synchronizations in order to properly determine the >> start >> > > > > > position for historical rebalance without PME. >> > > > > > >> > > > > > I would still go with an offline-node approach, but instead of >> > > cleaning >> > > > > the >> > > > > > persistence, we can do effective defragmentation when the node >> is >> > > > offline >> > > > > > because we are sure that there is no concurrent load. After the >> > > > > > defragmentation completes, we bring the node back to the >> cluster and >> > > > > > historical rebalance will kick in automatically. It will still >> > > require >> > > > > > manual node restarts, but since the data is not removed, there >> are no >> > > > > > additional risks. Also, this will be an excellent solution for >> those >> > > > who >> > > > > > can afford downtime and execute the defragment command on all >> nodes >> > > in >> > > > > the >> > > > > > cluster simultaneously - this will be the fastest way possible. >> > > > > > >> > > > > > --AG >> > > > > > >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <a...@apache.org>: >> > > > > > >> > > > > > > Alexei, >> > > > > > > >> stopping fragmented node and removing partition data, then >> > > > starting >> > > > > it >> > > > > > > again >> > > > > > > >> > > > > > > That's exactly what we're doing to solve the fragmentation >> issue. >> > > > > > > The problem here is that we have to perform N/B >> restart-rebalance >> > > > > > > operations (N - cluster size, B - backups count) and it takes >> a lot >> > > > of >> > > > > > time >> > > > > > > with risks to lose the data. >> > > > > > > >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < >> > > > > > > alexey.scherbak...@gmail.com> wrote: >> > > > > > > >> > > > > > > > Probably this should be allowed to do using public API, >> actually >> > > > this >> > > > > > is >> > > > > > > > same as manual rebalancing. >> > > > > > > > >> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < >> > > > > > > > alexey.scherbak...@gmail.com>: >> > > > > > > > >> > > > > > > > > The poor man's solution for the problem would be stopping >> > > > > fragmented >> > > > > > > node >> > > > > > > > > and removing partition data, then starting it again >> allowing >> > > full >> > > > > > state >> > > > > > > > > transfer already without deletes. >> > > > > > > > > Rinse and repeat for all owners. >> > > > > > > > > >> > > > > > > > > Anton Vinogradov, would this work for you as workaround ? >> > > > > > > > > >> > > > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov < >> a...@apache.org >> > > >: >> > > > > > > > > >> > > > > > > > >> Alexey, >> > > > > > > > >> >> > > > > > > > >> Let's combine your and Ivan's proposals. >> > > > > > > > >> >> > > > > > > > >> >> vacuum command, which acquires exclusive table lock, >> so no >> > > > > > > concurrent >> > > > > > > > >> activities on the table are possible. >> > > > > > > > >> and >> > > > > > > > >> >> Could the problem be solved by stopping a node which >> needs >> > > to >> > > > > be >> > > > > > > > >> defragmented, clearing persistence files and restarting >> the >> > > > node? >> > > > > > > > >> >> After rebalancing the node will receive all data back >> > > without >> > > > > > > > >> fragmentation. >> > > > > > > > >> >> > > > > > > > >> How about to have special partition state SHRINKING? >> > > > > > > > >> This state should mean that partition unavailable for >> reads >> > > and >> > > > > > > updates >> > > > > > > > >> but >> > > > > > > > >> should keep it's update-counters and should not be >> marked as >> > > > lost, >> > > > > > > > renting >> > > > > > > > >> or evicted. >> > > > > > > > >> At this state we able to iterate over the partition and >> apply >> > > > it's >> > > > > > > > entries >> > > > > > > > >> to another file in a compact way. >> > > > > > > > >> Indices should be updated during the copy-on-shrink >> procedure >> > > or >> > > > > at >> > > > > > > the >> > > > > > > > >> shrink completion. >> > > > > > > > >> Once shrank file is ready we should replace the original >> > > > partition >> > > > > > > file >> > > > > > > > >> with it and mark it as MOVING which will start the >> historical >> > > > > > > rebalance. >> > > > > > > > >> Shrinking should be performed during the low activity >> periods, >> > > > but >> > > > > > > even >> > > > > > > > in >> > > > > > > > >> case we found that activity was high and historical >> rebalance >> > > is >> > > > > not >> > > > > > > > >> suitable we may just remove the file and use regular >> rebalance >> > > > to >> > > > > > > > restore >> > > > > > > > >> the partition (this will also lead to shrink). >> > > > > > > > >> >> > > > > > > > >> BTW, seems, we able to implement partition shrink in a >> cheap >> > > > way. >> > > > > > > > >> We may just use rebalancing code to apply fat partition's >> > > > entries >> > > > > to >> > > > > > > the >> > > > > > > > >> new file. >> > > > > > > > >> So, 3 stages here: local rebalance, indices update and >> global >> > > > > > > historical >> > > > > > > > >> rebalance. >> > > > > > > > >> >> > > > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk < >> > > > > > > > >> alexey.goncha...@gmail.com> wrote: >> > > > > > > > >> >> > > > > > > > >> > Anton, >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > > >> The solution which Anton suggested does not look >> easy >> > > > > > because >> > > > > > > it >> > > > > > > > >> will >> > > > > > > > >> > > most likely significantly hurt performance >> > > > > > > > >> > > Mostly agree here, but what drop do we expect? What >> price >> > > do >> > > > > we >> > > > > > > > ready >> > > > > > > > >> to >> > > > > > > > >> > > pay? >> > > > > > > > >> > > Not sure, but seems some vendors ready to pay, for >> > > example, >> > > > 5% >> > > > > > > drop >> > > > > > > > >> for >> > > > > > > > >> > > this. >> > > > > > > > >> > >> > > > > > > > >> > 5% may be a big drop for some use-cases, so I think we >> > > should >> > > > > look >> > > > > > > at >> > > > > > > > >> how >> > > > > > > > >> > to improve performance, not how to make it worse. >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > > >> > > > > > > > >> > > >> it is hard to maintain a data structure to choose >> "page >> > > > > from >> > > > > > > > >> free-list >> > > > > > > > >> > > with enough space closest to the beginning of the >> file". >> > > > > > > > >> > > We can just split each free-list bucket to the >> couple and >> > > > use >> > > > > > > first >> > > > > > > > >> for >> > > > > > > > >> > > pages in the first half of the file and the second >> for the >> > > > > last. >> > > > > > > > >> > > Only two buckets required here since, during the file >> > > > shrink, >> > > > > > > first >> > > > > > > > >> > > bucket's window will be shrank too. >> > > > > > > > >> > > Seems, this give us the same price on put, just use >> the >> > > > first >> > > > > > > bucket >> > > > > > > > >> in >> > > > > > > > >> > > case it's not empty. >> > > > > > > > >> > > Remove price (with merge) will be increased, of >> course. >> > > > > > > > >> > > >> > > > > > > > >> > > The compromise solution is to have priority put (to >> the >> > > > first >> > > > > > path >> > > > > > > > of >> > > > > > > > >> the >> > > > > > > > >> > > file), with keeping removal as is, and schedulable >> > > per-page >> > > > > > > > migration >> > > > > > > > >> for >> > > > > > > > >> > > the rest of the data during the low activity period. >> > > > > > > > >> > > >> > > > > > > > >> > Free lists are large and slow by themselves, it is >> expensive >> > > > to >> > > > > > > > >> checkpoint >> > > > > > > > >> > and read them on start, so as a long-term solution I >> would >> > > > look >> > > > > > into >> > > > > > > > >> > removing them. Moreover, not sure if adding yet another >> > > > > background >> > > > > > > > >> process >> > > > > > > > >> > will improve the codebase reliability and simplicity. >> > > > > > > > >> > >> > > > > > > > >> > If we want to go the hard path, I would look at free >> page >> > > > > tracking >> > > > > > > > >> bitmap - >> > > > > > > > >> > a special bitmask page, where each page in an adjacent >> block >> > > > is >> > > > > > > marked >> > > > > > > > >> as 0 >> > > > > > > > >> > if it has free space more than a certain configurable >> > > > threshold >> > > > > > > (say, >> > > > > > > > >> 80%) >> > > > > > > > >> > - free, and 1 if less (full). Some vendors have >> successfully >> > > > > > > > implemented >> > > > > > > > >> > this approach, which looks much more promising, but >> harder >> > > to >> > > > > > > > implement. >> > > > > > > > >> > >> > > > > > > > >> > --AG >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > >> > > > > > > > > Best regards, >> > > > > > > > > Alexei Scherbakov >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > >> > > > > > > > Best regards, >> > > > > > > > Alexei Scherbakov >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > -- >> > Sergey Kozlov >> > GridGain Systems >> > www.gridgain.com >> >