Alexey,

Sounds good to me.

On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk <alexey.goncha...@gmail.com>
wrote:

> Anton,
>
> Switching a partition to and from the SHRINKING state will require
> intricate synchronizations in order to properly determine the start
> position for historical rebalance without PME.
>
> I would still go with an offline-node approach, but instead of cleaning the
> persistence, we can do effective defragmentation when the node is offline
> because we are sure that there is no concurrent load. After the
> defragmentation completes, we bring the node back to the cluster and
> historical rebalance will kick in automatically. It will still require
> manual node restarts, but since the data is not removed, there are no
> additional risks. Also, this will be an excellent solution for those who
> can afford downtime and execute the defragment command on all nodes in the
> cluster simultaneously - this will be the fastest way possible.
>
> --AG
>
> пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov <a...@apache.org>:
>
> > Alexei,
> > >> stopping fragmented node and removing partition data, then starting it
> > again
> >
> > That's exactly what we're doing to solve the fragmentation issue.
> > The problem here is that we have to perform N/B restart-rebalance
> > operations (N - cluster size, B - backups count) and it takes a lot of
> time
> > with risks to lose the data.
> >
> > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov <
> > alexey.scherbak...@gmail.com> wrote:
> >
> > > Probably this should be allowed to do using public API, actually this
> is
> > > same as manual rebalancing.
> > >
> > > пт, 27 сент. 2019 г. в 17:40, Alexei Scherbakov <
> > > alexey.scherbak...@gmail.com>:
> > >
> > > > The poor man's solution for the problem would be stopping fragmented
> > node
> > > > and removing partition data, then starting it again allowing full
> state
> > > > transfer already without deletes.
> > > > Rinse and repeat for all owners.
> > > >
> > > > Anton Vinogradov, would this work for you as workaround ?
> > > >
> > > > чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <a...@apache.org>:
> > > >
> > > >> Alexey,
> > > >>
> > > >> Let's combine your and Ivan's proposals.
> > > >>
> > > >> >> vacuum command, which acquires exclusive table lock, so no
> > concurrent
> > > >> activities on the table are possible.
> > > >> and
> > > >> >> Could the problem be solved by stopping a node which needs to be
> > > >> defragmented, clearing persistence files and restarting the node?
> > > >> >> After rebalancing the node will receive all data back without
> > > >> fragmentation.
> > > >>
> > > >> How about to have special partition state SHRINKING?
> > > >> This state should mean that partition unavailable for reads and
> > updates
> > > >> but
> > > >> should keep it's update-counters and should not be marked as lost,
> > > renting
> > > >> or evicted.
> > > >> At this state we able to iterate over the partition and apply it's
> > > entries
> > > >> to another file in a compact way.
> > > >> Indices should be updated during the copy-on-shrink procedure or at
> > the
> > > >> shrink completion.
> > > >> Once shrank file is ready we should replace the original partition
> > file
> > > >> with it and mark it as MOVING which will start the historical
> > rebalance.
> > > >> Shrinking should be performed during the low activity periods, but
> > even
> > > in
> > > >> case we found that activity was high and historical rebalance is not
> > > >> suitable we may just remove the file and use regular rebalance to
> > > restore
> > > >> the partition (this will also lead to shrink).
> > > >>
> > > >> BTW, seems, we able to implement partition shrink in a cheap way.
> > > >> We may just use rebalancing code to apply fat partition's entries to
> > the
> > > >> new file.
> > > >> So, 3 stages here: local rebalance, indices update and global
> > historical
> > > >> rebalance.
> > > >>
> > > >> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk <
> > > >> alexey.goncha...@gmail.com> wrote:
> > > >>
> > > >> > Anton,
> > > >> >
> > > >> >
> > > >> > > >>  The solution which Anton suggested does not look easy
> because
> > it
> > > >> will
> > > >> > > most likely significantly hurt performance
> > > >> > > Mostly agree here, but what drop do we expect? What price do we
> > > ready
> > > >> to
> > > >> > > pay?
> > > >> > > Not sure, but seems some vendors ready to pay, for example, 5%
> > drop
> > > >> for
> > > >> > > this.
> > > >> >
> > > >> > 5% may be a big drop for some use-cases, so I think we should look
> > at
> > > >> how
> > > >> > to improve performance, not how to make it worse.
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > >> it is hard to maintain a data structure to choose "page from
> > > >> free-list
> > > >> > > with enough space closest to the beginning of the file".
> > > >> > > We can just split each free-list bucket to the couple and use
> > first
> > > >> for
> > > >> > > pages in the first half of the file and the second for the last.
> > > >> > > Only two buckets required here since, during the file shrink,
> > first
> > > >> > > bucket's window will be shrank too.
> > > >> > > Seems, this give us the same price on put, just use the first
> > bucket
> > > >> in
> > > >> > > case it's not empty.
> > > >> > > Remove price (with merge) will be increased, of course.
> > > >> > >
> > > >> > > The compromise solution is to have priority put (to the first
> path
> > > of
> > > >> the
> > > >> > > file), with keeping removal as is, and schedulable per-page
> > > migration
> > > >> for
> > > >> > > the rest of the data during the low activity period.
> > > >> > >
> > > >> > Free lists are large and slow by themselves, it is expensive to
> > > >> checkpoint
> > > >> > and read them on start, so as a long-term solution I would look
> into
> > > >> > removing them. Moreover, not sure if adding yet another background
> > > >> process
> > > >> > will improve the codebase reliability and simplicity.
> > > >> >
> > > >> > If we want to go the hard path, I would look at free page tracking
> > > >> bitmap -
> > > >> > a special bitmask page, where each page in an adjacent block is
> > marked
> > > >> as 0
> > > >> > if it has free space more than a certain configurable threshold
> > (say,
> > > >> 80%)
> > > >> > - free, and 1 if less (full). Some vendors have successfully
> > > implemented
> > > >> > this approach, which looks much more promising, but harder to
> > > implement.
> > > >> >
> > > >> > --AG
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > > >
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> > >
> >
>

Reply via email to