Alexey I'm ok for the suggested way in [1]
1. https://issues.apache.org/jira/browse/IGNITE-12263 On Tue, Oct 8, 2019 at 9:59 PM Denis Magda <[email protected]> wrote: > Anton, > > Seems like we have a name for the defragmentation mode with a downtime - > Rolling Defrag ) > > - > Denis > > > On Mon, Oct 7, 2019 at 11:04 PM Anton Vinogradov <[email protected]> wrote: > > > Denis, > > > > I like the idea that defragmentation is just an additional step on a node > > (re)start like we perform PDS recovery now. > > We may just use special key to specify node should defragment persistence > > on (re)start. > > Defragmentation can be the part of Rolling Upgrade in this case :) > > It seems to be not a problem to restart nodes one-by-one, this will "eat" > > only one backup guarantee. > > > > On Mon, Oct 7, 2019 at 8:28 PM Denis Magda <[email protected]> wrote: > > > > > Alex, thanks for the summary and proposal. Anton, Ivan and others who > > took > > > part in this discussion, what're your thoughts? I see this > > > rolling-upgrades-based approach as a reasonable solution. Even though a > > > node shutdown is expected, the procedure doesn't lead to the cluster > > outage > > > meaning it can be utilized for 24x7 production environments. > > > > > > - > > > Denis > > > > > > > > > On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk < > > > [email protected]> > > > wrote: > > > > > > > Created a ticket for the first stage of this improvement. This can > be a > > > > first change towards the online mode suggested by Sergey and Anton. > > > > https://issues.apache.org/jira/browse/IGNITE-12263 > > > > > > > > пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk < > > [email protected] > > > >: > > > > > > > > > Maxim, > > > > > > > > > > Having a cluster-wide lock for a cache does not improve > availability > > of > > > > > the solution. A user cannot defragment a cache if the cache is > > involved > > > > in > > > > > a mission-critical operation, so having a lock on such a cache is > > > > > equivalent to the whole cluster shutdown. > > > > > > > > > > We should decide between either a single offline node or a more > > complex > > > > > fully online solution. > > > > > > > > > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <[email protected]>: > > > > > > > > > >> Igniters, > > > > >> > > > > >> This thread seems to be endless, but we if some kind of cache > group > > > > >> distributed write lock (exclusive for some of the internal Ignite > > > > >> process) will be introduced? I think it will help to solve a batch > > of > > > > >> problems, like: > > > > >> > > > > >> 1. defragmentation of all cache group partitions on the local node > > > > >> without concurrent updates. > > > > >> 2. improve data loading with data streamer isolation mode [1]. It > > > > >> seems we should not allow concurrent updates to cache if we on > `fast > > > > >> data load` step. > > > > >> 3. recovery from a snapshot without cache stop\start actions > > > > >> > > > > >> > > > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11793 > > > > >> > > > > >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <[email protected]> > > > > wrote: > > > > >> > > > > > >> > Hi > > > > >> > > > > > >> > I'm not sure that node offline is a best way to do that. > > > > >> > Cons: > > > > >> > - different caches may have different defragmentation but we > > force > > > to > > > > >> stop > > > > >> > whole node > > > > >> > - offline node is a maintenance operation will require to add > +1 > > > > >> backup to > > > > >> > reduce the risk of data loss > > > > >> > - baseline auto adjustment? > > > > >> > - impact to index rebuild? > > > > >> > - cache configuration changes (or destroy) during node offline > > > > >> > > > > > >> > What about other ways without node stop? E.g. make cache group > on > > a > > > > node > > > > >> > offline? Add *defrag <cache_group> *command to control.sh to > force > > > > start > > > > >> > rebalance internally in the node with expected impact to > > > performance. > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <[email protected] > > > > > > wrote: > > > > >> > > > > > >> > > Alexey, > > > > >> > > As for me, it does not matter will it be IEP, umbrella or a > > single > > > > >> issue. > > > > >> > > The most important thing is Assignee :) > > > > >> > > > > > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk < > > > > >> > > [email protected]> > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Anton, do you think we should file a single ticket for this > or > > > > >> should we > > > > >> > > go > > > > >> > > > with an IEP? As of now, the change does not look big enough > > for > > > an > > > > >> IEP > > > > >> > > for > > > > >> > > > me. > > > > >> > > > > > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <[email protected] > >: > > > > >> > > > > > > > >> > > > > Alexey, > > > > >> > > > > > > > > >> > > > > Sounds good to me. > > > > >> > > > > > > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk < > > > > >> > > > > [email protected]> > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > > > Anton, > > > > >> > > > > > > > > > >> > > > > > Switching a partition to and from the SHRINKING state > will > > > > >> require > > > > >> > > > > > intricate synchronizations in order to properly > determine > > > the > > > > >> start > > > > >> > > > > > position for historical rebalance without PME. > > > > >> > > > > > > > > > >> > > > > > I would still go with an offline-node approach, but > > instead > > > of > > > > >> > > cleaning > > > > >> > > > > the > > > > >> > > > > > persistence, we can do effective defragmentation when > the > > > node > > > > >> is > > > > >> > > > offline > > > > >> > > > > > because we are sure that there is no concurrent load. > > After > > > > the > > > > >> > > > > > defragmentation completes, we bring the node back to the > > > > >> cluster and > > > > >> > > > > > historical rebalance will kick in automatically. It will > > > still > > > > >> > > require > > > > >> > > > > > manual node restarts, but since the data is not removed, > > > there > > > > >> are no > > > > >> > > > > > additional risks. Also, this will be an excellent > solution > > > for > > > > >> those > > > > >> > > > who > > > > >> > > > > > can afford downtime and execute the defragment command > on > > > all > > > > >> nodes > > > > >> > > in > > > > >> > > > > the > > > > >> > > > > > cluster simultaneously - this will be the fastest way > > > > possible. > > > > >> > > > > > > > > > >> > > > > > --AG > > > > >> > > > > > > > > > >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov < > > > [email protected] > > > > >: > > > > >> > > > > > > > > > >> > > > > > > Alexei, > > > > >> > > > > > > >> stopping fragmented node and removing partition > data, > > > > then > > > > >> > > > starting > > > > >> > > > > it > > > > >> > > > > > > again > > > > >> > > > > > > > > > > >> > > > > > > That's exactly what we're doing to solve the > > fragmentation > > > > >> issue. > > > > >> > > > > > > The problem here is that we have to perform N/B > > > > >> restart-rebalance > > > > >> > > > > > > operations (N - cluster size, B - backups count) and > it > > > > takes > > > > >> a lot > > > > >> > > > of > > > > >> > > > > > time > > > > >> > > > > > > with risks to lose the data. > > > > >> > > > > > > > > > > >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov < > > > > >> > > > > > > [email protected]> wrote: > > > > >> > > > > > > > > > > >> > > > > > > > Probably this should be allowed to do using public > > API, > > > > >> actually > > > > >> > > > this > > > > >> > > > > > is > > > > >> > > > > > > > same as manual rebalancing. > > > > >> > > > > > > > > > > > >> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov < > > > > >> > > > > > > > [email protected]>: > > > > >> > > > > > > > > > > > >> > > > > > > > > The poor man's solution for the problem would be > > > > stopping > > > > >> > > > > fragmented > > > > >> > > > > > > node > > > > >> > > > > > > > > and removing partition data, then starting it > again > > > > >> allowing > > > > >> > > full > > > > >> > > > > > state > > > > >> > > > > > > > > transfer already without deletes. > > > > >> > > > > > > > > Rinse and repeat for all owners. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Anton Vinogradov, would this work for you as > > > workaround > > > > ? > > > > >> > > > > > > > > > > > > >> > > > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov < > > > > >> [email protected] > > > > >> > > >: > > > > >> > > > > > > > > > > > > >> > > > > > > > >> Alexey, > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> Let's combine your and Ivan's proposals. > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> >> vacuum command, which acquires exclusive table > > > lock, > > > > >> so no > > > > >> > > > > > > concurrent > > > > >> > > > > > > > >> activities on the table are possible. > > > > >> > > > > > > > >> and > > > > >> > > > > > > > >> >> Could the problem be solved by stopping a node > > > which > > > > >> needs > > > > >> > > to > > > > >> > > > > be > > > > >> > > > > > > > >> defragmented, clearing persistence files and > > > restarting > > > > >> the > > > > >> > > > node? > > > > >> > > > > > > > >> >> After rebalancing the node will receive all > data > > > > back > > > > >> > > without > > > > >> > > > > > > > >> fragmentation. > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> How about to have special partition state > > SHRINKING? > > > > >> > > > > > > > >> This state should mean that partition unavailable > > for > > > > >> reads > > > > >> > > and > > > > >> > > > > > > updates > > > > >> > > > > > > > >> but > > > > >> > > > > > > > >> should keep it's update-counters and should not > be > > > > >> marked as > > > > >> > > > lost, > > > > >> > > > > > > > renting > > > > >> > > > > > > > >> or evicted. > > > > >> > > > > > > > >> At this state we able to iterate over the > partition > > > and > > > > >> apply > > > > >> > > > it's > > > > >> > > > > > > > entries > > > > >> > > > > > > > >> to another file in a compact way. > > > > >> > > > > > > > >> Indices should be updated during the > copy-on-shrink > > > > >> procedure > > > > >> > > or > > > > >> > > > > at > > > > >> > > > > > > the > > > > >> > > > > > > > >> shrink completion. > > > > >> > > > > > > > >> Once shrank file is ready we should replace the > > > > original > > > > >> > > > partition > > > > >> > > > > > > file > > > > >> > > > > > > > >> with it and mark it as MOVING which will start > the > > > > >> historical > > > > >> > > > > > > rebalance. > > > > >> > > > > > > > >> Shrinking should be performed during the low > > activity > > > > >> periods, > > > > >> > > > but > > > > >> > > > > > > even > > > > >> > > > > > > > in > > > > >> > > > > > > > >> case we found that activity was high and > historical > > > > >> rebalance > > > > >> > > is > > > > >> > > > > not > > > > >> > > > > > > > >> suitable we may just remove the file and use > > regular > > > > >> rebalance > > > > >> > > > to > > > > >> > > > > > > > restore > > > > >> > > > > > > > >> the partition (this will also lead to shrink). > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> BTW, seems, we able to implement partition shrink > > in > > > a > > > > >> cheap > > > > >> > > > way. > > > > >> > > > > > > > >> We may just use rebalancing code to apply fat > > > > partition's > > > > >> > > > entries > > > > >> > > > > to > > > > >> > > > > > > the > > > > >> > > > > > > > >> new file. > > > > >> > > > > > > > >> So, 3 stages here: local rebalance, indices > update > > > and > > > > >> global > > > > >> > > > > > > historical > > > > >> > > > > > > > >> rebalance. > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey > Goncharuk < > > > > >> > > > > > > > >> [email protected]> wrote: > > > > >> > > > > > > > >> > > > > >> > > > > > > > >> > Anton, > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > >> The solution which Anton suggested does > not > > > > look > > > > >> easy > > > > >> > > > > > because > > > > >> > > > > > > it > > > > >> > > > > > > > >> will > > > > >> > > > > > > > >> > > most likely significantly hurt performance > > > > >> > > > > > > > >> > > Mostly agree here, but what drop do we > expect? > > > What > > > > >> price > > > > >> > > do > > > > >> > > > > we > > > > >> > > > > > > > ready > > > > >> > > > > > > > >> to > > > > >> > > > > > > > >> > > pay? > > > > >> > > > > > > > >> > > Not sure, but seems some vendors ready to > pay, > > > for > > > > >> > > example, > > > > >> > > > 5% > > > > >> > > > > > > drop > > > > >> > > > > > > > >> for > > > > >> > > > > > > > >> > > this. > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > 5% may be a big drop for some use-cases, so I > > think > > > > we > > > > >> > > should > > > > >> > > > > look > > > > >> > > > > > > at > > > > >> > > > > > > > >> how > > > > >> > > > > > > > >> > to improve performance, not how to make it > worse. > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > >> it is hard to maintain a data structure to > > > > choose > > > > >> "page > > > > >> > > > > from > > > > >> > > > > > > > >> free-list > > > > >> > > > > > > > >> > > with enough space closest to the beginning of > > the > > > > >> file". > > > > >> > > > > > > > >> > > We can just split each free-list bucket to > the > > > > >> couple and > > > > >> > > > use > > > > >> > > > > > > first > > > > >> > > > > > > > >> for > > > > >> > > > > > > > >> > > pages in the first half of the file and the > > > second > > > > >> for the > > > > >> > > > > last. > > > > >> > > > > > > > >> > > Only two buckets required here since, during > > the > > > > file > > > > >> > > > shrink, > > > > >> > > > > > > first > > > > >> > > > > > > > >> > > bucket's window will be shrank too. > > > > >> > > > > > > > >> > > Seems, this give us the same price on put, > just > > > use > > > > >> the > > > > >> > > > first > > > > >> > > > > > > bucket > > > > >> > > > > > > > >> in > > > > >> > > > > > > > >> > > case it's not empty. > > > > >> > > > > > > > >> > > Remove price (with merge) will be increased, > of > > > > >> course. > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > > The compromise solution is to have priority > put > > > (to > > > > >> the > > > > >> > > > first > > > > >> > > > > > path > > > > >> > > > > > > > of > > > > >> > > > > > > > >> the > > > > >> > > > > > > > >> > > file), with keeping removal as is, and > > > schedulable > > > > >> > > per-page > > > > >> > > > > > > > migration > > > > >> > > > > > > > >> for > > > > >> > > > > > > > >> > > the rest of the data during the low activity > > > > period. > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > >> > Free lists are large and slow by themselves, it > > is > > > > >> expensive > > > > >> > > > to > > > > >> > > > > > > > >> checkpoint > > > > >> > > > > > > > >> > and read them on start, so as a long-term > > solution > > > I > > > > >> would > > > > >> > > > look > > > > >> > > > > > into > > > > >> > > > > > > > >> > removing them. Moreover, not sure if adding yet > > > > another > > > > >> > > > > background > > > > >> > > > > > > > >> process > > > > >> > > > > > > > >> > will improve the codebase reliability and > > > simplicity. > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > If we want to go the hard path, I would look at > > > free > > > > >> page > > > > >> > > > > tracking > > > > >> > > > > > > > >> bitmap - > > > > >> > > > > > > > >> > a special bitmask page, where each page in an > > > > adjacent > > > > >> block > > > > >> > > > is > > > > >> > > > > > > marked > > > > >> > > > > > > > >> as 0 > > > > >> > > > > > > > >> > if it has free space more than a certain > > > configurable > > > > >> > > > threshold > > > > >> > > > > > > (say, > > > > >> > > > > > > > >> 80%) > > > > >> > > > > > > > >> > - free, and 1 if less (full). Some vendors have > > > > >> successfully > > > > >> > > > > > > > implemented > > > > >> > > > > > > > >> > this approach, which looks much more promising, > > but > > > > >> harder > > > > >> > > to > > > > >> > > > > > > > implement. > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > --AG > > > > >> > > > > > > > >> > > > > > >> > > > > > > > >> > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > -- > > > > >> > > > > > > > > > > > > >> > > > > > > > > Best regards, > > > > >> > > > > > > > > Alexei Scherbakov > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > -- > > > > >> > > > > > > > > > > > >> > > > > > > > Best regards, > > > > >> > > > > > > > Alexei Scherbakov > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > Sergey Kozlov > > > > >> > GridGain Systems > > > > >> > www.gridgain.com > > > > >> > > > > > > > > > > > > > > > -- Sergey Kozlov GridGain Systems www.gridgain.com
