The poor man's solution for the problem would be stopping fragmented node
and removing partition data, then starting it again allowing full state
transfer already without deletes.
Rinse and repeat for all owners.

Anton Vinogradov, would this work for you as workaround ?

чт, 19 сент. 2019 г. в 13:03, Anton Vinogradov <a...@apache.org>:

> Alexey,
>
> Let's combine your and Ivan's proposals.
>
> >> vacuum command, which acquires exclusive table lock, so no concurrent
> activities on the table are possible.
> and
> >> Could the problem be solved by stopping a node which needs to be
> defragmented, clearing persistence files and restarting the node?
> >> After rebalancing the node will receive all data back without
> fragmentation.
>
> How about to have special partition state SHRINKING?
> This state should mean that partition unavailable for reads and updates but
> should keep it's update-counters and should not be marked as lost, renting
> or evicted.
> At this state we able to iterate over the partition and apply it's entries
> to another file in a compact way.
> Indices should be updated during the copy-on-shrink procedure or at the
> shrink completion.
> Once shrank file is ready we should replace the original partition file
> with it and mark it as MOVING which will start the historical rebalance.
> Shrinking should be performed during the low activity periods, but even in
> case we found that activity was high and historical rebalance is not
> suitable we may just remove the file and use regular rebalance to restore
> the partition (this will also lead to shrink).
>
> BTW, seems, we able to implement partition shrink in a cheap way.
> We may just use rebalancing code to apply fat partition's entries to the
> new file.
> So, 3 stages here: local rebalance, indices update and global historical
> rebalance.
>
> On Thu, Sep 19, 2019 at 11:43 AM Alexey Goncharuk <
> alexey.goncha...@gmail.com> wrote:
>
> > Anton,
> >
> >
> > > >>  The solution which Anton suggested does not look easy because it
> will
> > > most likely significantly hurt performance
> > > Mostly agree here, but what drop do we expect? What price do we ready
> to
> > > pay?
> > > Not sure, but seems some vendors ready to pay, for example, 5% drop for
> > > this.
> >
> > 5% may be a big drop for some use-cases, so I think we should look at how
> > to improve performance, not how to make it worse.
> >
> >
> > >
> > > >> it is hard to maintain a data structure to choose "page from
> free-list
> > > with enough space closest to the beginning of the file".
> > > We can just split each free-list bucket to the couple and use first for
> > > pages in the first half of the file and the second for the last.
> > > Only two buckets required here since, during the file shrink, first
> > > bucket's window will be shrank too.
> > > Seems, this give us the same price on put, just use the first bucket in
> > > case it's not empty.
> > > Remove price (with merge) will be increased, of course.
> > >
> > > The compromise solution is to have priority put (to the first path of
> the
> > > file), with keeping removal as is, and schedulable per-page migration
> for
> > > the rest of the data during the low activity period.
> > >
> > Free lists are large and slow by themselves, it is expensive to
> checkpoint
> > and read them on start, so as a long-term solution I would look into
> > removing them. Moreover, not sure if adding yet another background
> process
> > will improve the codebase reliability and simplicity.
> >
> > If we want to go the hard path, I would look at free page tracking
> bitmap -
> > a special bitmask page, where each page in an adjacent block is marked
> as 0
> > if it has free space more than a certain configurable threshold (say,
> 80%)
> > - free, and 1 if less (full). Some vendors have successfully implemented
> > this approach, which looks much more promising, but harder to implement.
> >
> > --AG
> >
>


-- 

Best regards,
Alexei Scherbakov

Reply via email to