Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Alex Parvulescu Thu, 06 Aug 2015 07:23:28 -0700

Hi Julian,

On Thu, Aug 6, 2015 at 3:14 PM, Julian Sedding <jsedd...@gmail.com> wrote:


> Hi Alex
>
> See inline.
>
> On Wed, Aug 5, 2015 at 7:57 PM, Alex Parvulescu
> <alex.parvule...@gmail.com> wrote:
> > Hi,
> >
> > see inline
> >
> > On Wed, Aug 5, 2015 at 5:45 PM, Julian Sedding <jsedd...@gmail.com>
> wrote:
> >
> >> Hi Alex
> >>
> >> Thanks for your comments.
> >>
> >> On Wed, Aug 5, 2015 at 3:48 PM, Alex Parvulescu
> >> <alex.parvule...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Just a few clarifications on the error you see
> >> >
> >> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> >> > the previous checkpoint as stored in /:async/async. Of course this
> >> > checkpoint is not present in the copied NodeStore and thus cannot be
> >> > retrieved.
> >> >
> >> > The error comes from DocumentMk trying to parse the reference
> checkpoint
> >> > value. Basically what fails here is 'Revision.fromString' receiving a
> >> > malformed checkpoint value because it comes from the SegmentMk. The
> quick
> >> > fix is to manually remove the properties on the "/:async" hidden node.
> >> This
> >> > will indeed trigger a full reindex, but will help you getting over
> this
> >> > issue.
> >>
> >> Agreed. In this case parsing the revision is the first thing that
> >> fails. When copying DNS to SNS a similar situation would arise,
> >> because no snapshot with the provided ID exists.
> >>
> >>
> > [alex] Not really, as the SegmentMk will not fail (no
> > IllegalArgumentException), but simply log a warning the checkpoint
> doesn't
> > exist and perform a full reindex. So in this regard it is a bit more
> > lenient :)
>
> Ok, I didn't know that SegmentMK is more lenient here. Should we make
> DocumentMK degrade gracefully as well? Currently the AsyncIndex does
> not recover by itself. It would be more robust if it did.
>

I think this falls under a 'nice to have' rather than a really needed
change. we are dealing with a specific case here and ideally the sidegrade
process would take care of removing the checkpoint reference (or set it to
a new value, depending on availability).


>
> >
> >
> >
> >> >
> >> >> IMHO it would be desirable to (optionally) copy the checkpoints as
> >> > well. In the case of AsyncIndexUpdate, having the checkpoint can save
> >> > a full re-index.
> >> >
> >> > This is very tricky, as the 2 representations of checkpoints between
> >> > SegmentMk and DocumentMk are quite different. I would strongly suggest
> >> > going for the reindex, after all you'd only migrate once, so you can
> >> > prepare for this lengthy process.
> >>
> >> I'm experimenting with the following approach:
> >> * retrieve the first checkpoint and copy the NodeState tree at that
> >> revision (available via CheckpointMBean impls)
> >> * after copying the tree, merge and create a checkpoint (expiration
> >> time can be calculated)
> >> * rinse and repeat until the head revision is reached
> >>
> >> My aim is to reduce the critical path for migrating one NodeStore
> >> (incl JR2) to another. Indexing (especially async indexing) takes is a
> >> big part of the time, so if I can move that out of the critical path,
> >> it can save a lot of downtime.
> >>
> >
> > [alex] interesting approach. I would only reduce this to the 'current'
> > indexed checkpoint (the async reference). So you'd migrate that over
> first
> > as the head state, create a checkpoint based on it (let' call it 'c0').
> > then diff&apply the SegmentMk head state on top of this. update the async
> > property to point to c0 and you might be good.
>
> Absolutely, only copying the checkpoints that are actually needed makes
> sense.
>
> Thinking out loud: it may be faster to run the async-index in the copy
> process, based on the diff from the source NodeStore between the
> checkpoint and the head. That should be feasible, right?
>

It should be doable, yes. but if I understand correctly, you'd like the
overall duration to reduce, not increase :) running the asyncs as sync
would only add more time to the migration.



>
> >
> >
> >
> >>
> >> My current approach for a migration from JR2 to MongoMK is to:
> >> * copy JR2 to TarMK (TarMK is a lot faster for creating indexes etc.
> >> than MongoMK)
> >> * repeat JR2 to TarMK copy every week or every 24h using incremental
> >> copy. this saves on CommitHook execution time - in theory this can
> >> reduce the time for one run to a single full repository traversal.
> >> * finally on the day when the systems should be switched over, run a
> >> last JR2 to TarMK and then a TarMK to MongoMK copy. this is the
> >> critical path.
> >>
> >
> > [alex] Always going through the SegmentMk seems a bit convoluted. Why not
> > do the migration once, then apply the diffs on top of MongoMk directly
> > (AFAIK we have support for incremental updates now)? Are the 24h diffs so
> > big that it makes it unusable/unacceptable to go to MongoMk directly?
> (I'd
> > like to see this backed by some numbers).
>
> We definitely need numbers. I aim to do some experiments after my
> holidays and provide some numbers at the beginning of September.
>
> Regarding incremental upgrades definitely yield a huge benefit on the
> critical path with SegmentMK, don't know about DocumentMK yet.
>
> Regarding incremental upgrades and have some numbers. The scenario is
> the migration from JR2 (TarPM) to Oak TarMK, copying 2.6mio regular
> nodes and 5.7mio versions (versions are copied via commit editor, see
> OAK-2776 "copy all referenced versions"). The first source repository
> is 23 days older than the second source repository, i.e. the delta is
> ~3 weeks of content editing of a live website.
>
> initial upgrade
> - copy time: ~6min (2.6mio regular nodes) + ~30min (5.7mio referenced
> versions)
> - index creation (synchronous indexes): ~2h 20min
> - total time: ~2h 57min
>
> incremental upgrade (with OAK-3163 applied)
> - copy/compare time: ~9min (2.6mio regular nodes) + ~6.5min (0.7mio
> new/modified referenced versions)
> - index creation (synchronous indexes): ~7min
> - total time: ~23min
>
> finally copy from TarMK to MongoMK
> - total: ~3h 7min
>


thanks for sharing the numbers!

Can you also add some info about the async indexes? How long does it take
for them to finish?

One quick item to consider is reevaluating the indexes that you are
building and maintaining during the main and incremental upgrades. there's
no greater waste of resources during migration than to build a few indexes
only to throw them away later when the repo starts up. depending on what
tools you use, I strongly suggest removing indexes as early as possible
(even if you mark an index as disabled, the TarMk -> MongoMK copy will
still move the unneeded content over) .


best,
alex


>
> >
> >
> > hope this helps,
> > alex
>
> Regards
> Julian
>
> >
> >
> >
> >
> >> Due to the above, copying at least the checkpoint of the async index
> >> will likely speed up the critical path. Of course measuring execution
> >> times will provide the definitive answer to this question.
> >>
> >> Regards
> >> Julian
> >>
> >> >
> >> > best,
> >> > alex
> >> >
> >> >
> >> > On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <jsedd...@gmail.com>
> >> wrote:
> >> >
> >> >> Hi all
> >> >>
> >> >> I am working on a scenario, where I need to copy a SegmentNodeStore
> >> >> (TarMK) to a DocumentNodeStore (MongoDB).
> >> >>
> >> >> It is pretty straight forward to simply copy the NodeStore via the
> >> >> API. No problems here.
> >> >>
> >> >> In a recent experiment I successfully copied the NodeStore and got an
> >> >> exception in the logs (stacktrace below the email).
> >> >>
> >> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
> >> >> the previous checkpoint as stored in /:async/async. Of course this
> >> >> checkpoint is not present in the copied NodeStore and thus cannot be
> >> >> retrieved.
> >> >>
> >> >> IMHO it would be desirable to (optionally) copy the checkpoints as
> >> >> well. In the case of AsyncIndexUpdate, having the checkpoint can save
> >> >> a full re-index.
> >> >>
> >> >> The question that remains is how the internal state of
> >> >> AsyncIndexUpdate should be modified:
> >> >> * implementing the logic in oak-upgrade would be pragmatic, but
> >> >> distributes knowledge about AsyncIndexUpdate implementation details
> to
> >> >> different modules
> >> >> * having a CommitHook/Editor in oak-core that can be used in
> >> >> oak-upgrade might be cleaner, but would only get used in oak-upgrade
> >> >>
> >> >> Other ideas and opinions regarding this feature are more than
> welcome!
> >> >>
> >> >> Regards
> >> >> Julian
> >> >>
> >> >>
> >> >> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
> >> >> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
> >> >> during job execution of
> >> >> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
> >> >> 91f7e218-6cf5-4a44-a324-f094c29898e6
> >> >> java.lang.IllegalArgumentException:
> 91f7e218-6cf5-4a44-a324-f094c29898e6
> >> >>         at
> >> >>
> >>
> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
> >> >>         at
> >> >>
> >>
> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
> >> >>         at
> >> >>
> >>
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
> >> >>         at
> >> >>
> >>
> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
> >> >>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> >>         at
> >> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> >>         at java.lang.Thread.run(Thread.java:745)
> >> >>
> >>
>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Reply via email to