Created a ticket for the first stage of this improvement. This can be a
first change towards the online mode suggested by Sergey and Anton.
https://issues.apache.org/jira/browse/IGNITE-12263

пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk <alexey.goncha...@gmail.com>:

> Maxim,
>
> Having a cluster-wide lock for a cache does not improve availability of
> the solution. A user cannot defragment a cache if the cache is involved in
> a mission-critical operation, so having a lock on such a cache is
> equivalent to the whole cluster shutdown.
>
> We should decide between either a single offline node or a more complex
> fully online solution.
>
> пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <mmu...@apache.org>:
>
>> Igniters,
>>
>> This thread seems to be endless, but we if some kind of cache group
>> distributed write lock (exclusive for some of the internal Ignite
>> process) will be introduced? I think it will help to solve a batch of
>> problems, like:
>>
>> 1. defragmentation of all cache group partitions on the local node
>> without concurrent updates.
>> 2. improve data loading with data streamer isolation mode [1]. It
>> seems we should not allow concurrent updates to cache if we on `fast
>> data load` step.
>> 3. recovery from a snapshot without cache stop\start actions
>>
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-11793
>>
>> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <skoz...@gridgain.com> wrote:
>> >
>> > Hi
>> >
>> > I'm not sure that node offline is a best way to do that.
>> > Cons:
>> >  - different caches may have different defragmentation but we force to
>> stop
>> > whole node
>> >  - offline node is a maintenance operation will require to add +1
>> backup to
>> > reduce the risk of data loss
>> >  - baseline auto adjustment?
>> >  - impact to index rebuild?
>> >  - cache configuration changes (or destroy) during node offline
>> >
>> > What about other ways without node stop? E.g. make cache group on a node
>> > offline? Add *defrag <cache_group> *command to control.sh to force start
>> > rebalance internally in the node with expected impact to performance.
>> >
>> >
>> >
>> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <a...@apache.org> wrote:
>> >
>> > > Alexey,
>> > > As for me, it does not matter will it be IEP, umbrella or a single
>> issue.
>> > > The most important thing is Assignee :)
>> > >
>> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk <
>> > > alexey.goncha...@gmail.com>
>> > > wrote:
>> > >
>> > > > Anton, do you think we should file a single ticket for this or
>> should we
>> > > go
>> > > > with an IEP? As of now, the change does not look big enough for an
>> IEP
>> > > for
>> > > > me.
>> > > >
>> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <a...@apache.org>:
>> > > >
>> > > > > Alexey,
>> > > > >
>> > > > > Sounds good to me.
>> > > > >
>> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk <
>> > > > > alexey.goncha...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Anton,
>> > > > > >
>> > > > > > Switching a partition to and from the SHRINKING state will
>> require
>> > > > > > intricate synchronizations in order to properly determine the
>> start
>> > > > > > position for historical rebalance without PME.
>> > > > > >
>> > > > > > I would still go with an offline-node approach, but instead of
>> > > cleaning
>> > > > > the
>> > > > > > persistence, we can do effective defragmentation when the node
>> is
>> > > > offline
>> > > > > > because we are sure that there is no concurrent load. After the
>> > > > > > defragmentation completes, we bring the node back to the
>> cluster and
>> > > > > > historical rebalance will kick in automatically. It will still
>> > > require
>> > > > > > manual node restarts, but since the data is not removed, there
>> are no
>> > > > > > additional risks. Also, this will be an excellent solution for
>> those
>> > > > who
>> > > > > > can afford downtime and execute the defragment command on all
>> nodes
>> > > in
>> > > > > the
>> > > > > > cluster simultaneously - this will be the fastest way possible.
>> > > > > >
>> > > > > > --AG
>> > > > > >
>> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <a...@apache.org>:
>> > > > > >
>> > > > > > > Alexei,
>> > > > > > > >> stopping fragmented node and removing partition data, then
>> > > > starting
>> > > > > it
>> > > > > > > again
>> > > > > > >
>> > > > > > > That's exactly what we're doing to solve the fragmentation
>> issue.
>> > > > > > > The problem here is that we have to perform N/B
>> restart-rebalance
>> > > > > > > operations (N - cluster size, B - backups count) and it takes
>> a lot
>> > > > of
>> > > > > > time
>> > > > > > > with risks to lose the data.
>> > > > > > >
>> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov <
>> > > > > > > alexey.scherbak...@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > Probably this should be allowed to do using public API,
>> actually
>> > > > this
>> > > > > > is
>> > > > > > > > same as manual rebalancing.
>> > > > > > > >
>> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov <
>> > > > > > > > alexey.scherbak...@gmail.com>:
>> > > > > > > >
>> > > > > > > > > The poor man's solution for the problem would be stopping
>> > > > > fragmented
>> > > > > > > node
>> > > > > > > > > and removing partition data, then starting it again
>> allowing
>> > > full
>> > > > > > state
>> > > > > > > > > transfer already without deletes.
>> > > > > > > > > Rinse and repeat for all owners.
>> > > > > > > > >
>> > > > > > > > > Anton Vinogradov, would this work for you as workaround ?
>> > > > > > > > >
>> > > > > > > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <
>> a...@apache.org
>> > > >:
>> > > > > > > > >
>> > > > > > > > >> Alexey,
>> > > > > > > > >>
>> > > > > > > > >> Let's combine your and Ivan's proposals.
>> > > > > > > > >>
>> > > > > > > > >> >> vacuum command, which acquires exclusive table lock,
>> so no
>> > > > > > > concurrent
>> > > > > > > > >> activities on the table are possible.
>> > > > > > > > >> and
>> > > > > > > > >> >> Could the problem be solved by stopping a node which
>> needs
>> > > to
>> > > > > be
>> > > > > > > > >> defragmented, clearing persistence files and restarting
>> the
>> > > > node?
>> > > > > > > > >> >> After rebalancing the node will receive all data back
>> > > without
>> > > > > > > > >> fragmentation.
>> > > > > > > > >>
>> > > > > > > > >> How about to have special partition state SHRINKING?
>> > > > > > > > >> This state should mean that partition unavailable for
>> reads
>> > > and
>> > > > > > > updates
>> > > > > > > > >> but
>> > > > > > > > >> should keep it's update-counters and should not be
>> marked as
>> > > > lost,
>> > > > > > > > renting
>> > > > > > > > >> or evicted.
>> > > > > > > > >> At this state we able to iterate over the partition and
>> apply
>> > > > it's
>> > > > > > > > entries
>> > > > > > > > >> to another file in a compact way.
>> > > > > > > > >> Indices should be updated during the copy-on-shrink
>> procedure
>> > > or
>> > > > > at
>> > > > > > > the
>> > > > > > > > >> shrink completion.
>> > > > > > > > >> Once shrank file is ready we should replace the original
>> > > > partition
>> > > > > > > file
>> > > > > > > > >> with it and mark it as MOVING which will start the
>> historical
>> > > > > > > rebalance.
>> > > > > > > > >> Shrinking should be performed during the low activity
>> periods,
>> > > > but
>> > > > > > > even
>> > > > > > > > in
>> > > > > > > > >> case we found that activity was high and historical
>> rebalance
>> > > is
>> > > > > not
>> > > > > > > > >> suitable we may just remove the file and use regular
>> rebalance
>> > > > to
>> > > > > > > > restore
>> > > > > > > > >> the partition (this will also lead to shrink).
>> > > > > > > > >>
>> > > > > > > > >> BTW, seems, we able to implement partition shrink in a
>> cheap
>> > > > way.
>> > > > > > > > >> We may just use rebalancing code to apply fat partition's
>> > > > entries
>> > > > > to
>> > > > > > > the
>> > > > > > > > >> new file.
>> > > > > > > > >> So, 3 stages here: local rebalance, indices update and
>> global
>> > > > > > > historical
>> > > > > > > > >> rebalance.
>> > > > > > > > >>
>> > > > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk <
>> > > > > > > > >> alexey.goncha...@gmail.com> wrote:
>> > > > > > > > >>
>> > > > > > > > >> > Anton,
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > > >>  The solution which Anton suggested does not look
>> easy
>> > > > > > because
>> > > > > > > it
>> > > > > > > > >> will
>> > > > > > > > >> > > most likely significantly hurt performance
>> > > > > > > > >> > > Mostly agree here, but what drop do we expect? What
>> price
>> > > do
>> > > > > we
>> > > > > > > > ready
>> > > > > > > > >> to
>> > > > > > > > >> > > pay?
>> > > > > > > > >> > > Not sure, but seems some vendors ready to pay, for
>> > > example,
>> > > > 5%
>> > > > > > > drop
>> > > > > > > > >> for
>> > > > > > > > >> > > this.
>> > > > > > > > >> >
>> > > > > > > > >> > 5% may be a big drop for some use-cases, so I think we
>> > > should
>> > > > > look
>> > > > > > > at
>> > > > > > > > >> how
>> > > > > > > > >> > to improve performance, not how to make it worse.
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > >
>> > > > > > > > >> > > >> it is hard to maintain a data structure to choose
>> "page
>> > > > > from
>> > > > > > > > >> free-list
>> > > > > > > > >> > > with enough space closest to the beginning of the
>> file".
>> > > > > > > > >> > > We can just split each free-list bucket to the
>> couple and
>> > > > use
>> > > > > > > first
>> > > > > > > > >> for
>> > > > > > > > >> > > pages in the first half of the file and the second
>> for the
>> > > > > last.
>> > > > > > > > >> > > Only two buckets required here since, during the file
>> > > > shrink,
>> > > > > > > first
>> > > > > > > > >> > > bucket's window will be shrank too.
>> > > > > > > > >> > > Seems, this give us the same price on put, just use
>> the
>> > > > first
>> > > > > > > bucket
>> > > > > > > > >> in
>> > > > > > > > >> > > case it's not empty.
>> > > > > > > > >> > > Remove price (with merge) will be increased, of
>> course.
>> > > > > > > > >> > >
>> > > > > > > > >> > > The compromise solution is to have priority put (to
>> the
>> > > > first
>> > > > > > path
>> > > > > > > > of
>> > > > > > > > >> the
>> > > > > > > > >> > > file), with keeping removal as is, and schedulable
>> > > per-page
>> > > > > > > > migration
>> > > > > > > > >> for
>> > > > > > > > >> > > the rest of the data during the low activity period.
>> > > > > > > > >> > >
>> > > > > > > > >> > Free lists are large and slow by themselves, it is
>> expensive
>> > > > to
>> > > > > > > > >> checkpoint
>> > > > > > > > >> > and read them on start, so as a long-term solution I
>> would
>> > > > look
>> > > > > > into
>> > > > > > > > >> > removing them. Moreover, not sure if adding yet another
>> > > > > background
>> > > > > > > > >> process
>> > > > > > > > >> > will improve the codebase reliability and simplicity.
>> > > > > > > > >> >
>> > > > > > > > >> > If we want to go the hard path, I would look at free
>> page
>> > > > > tracking
>> > > > > > > > >> bitmap -
>> > > > > > > > >> > a special bitmask page, where each page in an adjacent
>> block
>> > > > is
>> > > > > > > marked
>> > > > > > > > >> as 0
>> > > > > > > > >> > if it has free space more than a certain configurable
>> > > > threshold
>> > > > > > > (say,
>> > > > > > > > >> 80%)
>> > > > > > > > >> > - free, and 1 if less (full). Some vendors have
>> successfully
>> > > > > > > > implemented
>> > > > > > > > >> > this approach, which looks much more promising, but
>> harder
>> > > to
>> > > > > > > > implement.
>> > > > > > > > >> >
>> > > > > > > > >> > --AG
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > > Alexei Scherbakov
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > >
>> > > > > > > > Best regards,
>> > > > > > > > Alexei Scherbakov
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > Sergey Kozlov
>> > GridGain Systems
>> > www.gridgain.com
>>
>

Reply via email to