Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Julian Sedding Thu, 06 Aug 2015 06:14:52 -0700

Hi Alex

See inline.


On Wed, Aug 5, 2015 at 7:57 PM, Alex Parvulescu
<alex.parvule...@gmail.com> wrote:
> Hi,
>
> see inline
>
> On Wed, Aug 5, 2015 at 5:45 PM, Julian Sedding <jsedd...@gmail.com> wrote:
>
>> Hi Alex
>>
>> Thanks for your comments.
>>
>> On Wed, Aug 5, 2015 at 3:48 PM, Alex Parvulescu
>> <alex.parvule...@gmail.com> wrote:
>> > Hi,
>> >
>> > Just a few clarifications on the error you see
>> >
>> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
>> > the previous checkpoint as stored in /:async/async. Of course this
>> > checkpoint is not present in the copied NodeStore and thus cannot be
>> > retrieved.
>> >
>> > The error comes from DocumentMk trying to parse the reference checkpoint
>> > value. Basically what fails here is 'Revision.fromString' receiving a
>> > malformed checkpoint value because it comes from the SegmentMk. The quick
>> > fix is to manually remove the properties on the "/:async" hidden node.
>> This
>> > will indeed trigger a full reindex, but will help you getting over this
>> > issue.
>>
>> Agreed. In this case parsing the revision is the first thing that
>> fails. When copying DNS to SNS a similar situation would arise,
>> because no snapshot with the provided ID exists.
>>
>>
> [alex] Not really, as the SegmentMk will not fail (no
> IllegalArgumentException), but simply log a warning the checkpoint doesn't
> exist and perform a full reindex. So in this regard it is a bit more
> lenient :)

Ok, I didn't know that SegmentMK is more lenient here. Should we make
DocumentMK degrade gracefully as well? Currently the AsyncIndex does
not recover by itself. It would be more robust if it did.

>
>
>
>> >
>> >> IMHO it would be desirable to (optionally) copy the checkpoints as
>> > well. In the case of AsyncIndexUpdate, having the checkpoint can save
>> > a full re-index.
>> >
>> > This is very tricky, as the 2 representations of checkpoints between
>> > SegmentMk and DocumentMk are quite different. I would strongly suggest
>> > going for the reindex, after all you'd only migrate once, so you can
>> > prepare for this lengthy process.
>>
>> I'm experimenting with the following approach:
>> * retrieve the first checkpoint and copy the NodeState tree at that
>> revision (available via CheckpointMBean impls)
>> * after copying the tree, merge and create a checkpoint (expiration
>> time can be calculated)
>> * rinse and repeat until the head revision is reached
>>
>> My aim is to reduce the critical path for migrating one NodeStore
>> (incl JR2) to another. Indexing (especially async indexing) takes is a
>> big part of the time, so if I can move that out of the critical path,
>> it can save a lot of downtime.
>>
>
> [alex] interesting approach. I would only reduce this to the 'current'
> indexed checkpoint (the async reference). So you'd migrate that over first
> as the head state, create a checkpoint based on it (let' call it 'c0').
> then diff&apply the SegmentMk head state on top of this. update the async
> property to point to c0 and you might be good.

Absolutely, only copying the checkpoints that are actually needed makes sense.

Thinking out loud: it may be faster to run the async-index in the copy
process, based on the diff from the source NodeStore between the
checkpoint and the head. That should be feasible, right?

>
>
>
>>
>> My current approach for a migration from JR2 to MongoMK is to:
>> * copy JR2 to TarMK (TarMK is a lot faster for creating indexes etc.
>> than MongoMK)
>> * repeat JR2 to TarMK copy every week or every 24h using incremental
>> copy. this saves on CommitHook execution time - in theory this can
>> reduce the time for one run to a single full repository traversal.
>> * finally on the day when the systems should be switched over, run a
>> last JR2 to TarMK and then a TarMK to MongoMK copy. this is the
>> critical path.
>>
>
> [alex] Always going through the SegmentMk seems a bit convoluted. Why not
> do the migration once, then apply the diffs on top of MongoMk directly
> (AFAIK we have support for incremental updates now)? Are the 24h diffs so
> big that it makes it unusable/unacceptable to go to MongoMk directly? (I'd
> like to see this backed by some numbers).

We definitely need numbers. I aim to do some experiments after my
holidays and provide some numbers at the beginning of September.

Regarding incremental upgrades definitely yield a huge benefit on the
critical path with SegmentMK, don't know about DocumentMK yet.

Regarding incremental upgrades and have some numbers. The scenario is
the migration from JR2 (TarPM) to Oak TarMK, copying 2.6mio regular
nodes and 5.7mio versions (versions are copied via commit editor, see
OAK-2776 "copy all referenced versions"). The first source repository
is 23 days older than the second source repository, i.e. the delta is
~3 weeks of content editing of a live website.

initial upgrade
- copy time: ~6min (2.6mio regular nodes) + ~30min (5.7mio referenced versions)
- index creation (synchronous indexes): ~2h 20min
- total time: ~2h 57min

incremental upgrade (with OAK-3163 applied)
- copy/compare time: ~9min (2.6mio regular nodes) + ~6.5min (0.7mio
new/modified referenced versions)
- index creation (synchronous indexes): ~7min
- total time: ~23min

finally copy from TarMK to MongoMK
- total: ~3h 7min

>
>
> hope this helps,
> alex

Regards
Julian

>
>
>
>
>> Due to the above, copying at least the checkpoint of the async index
>> will likely speed up the critical path. Of course measuring execution
>> times will provide the definitive answer to this question.
>>
>> Regards
>> Julian
>>
>> >
>> > best,
>> > alex
>> >
>> >
>> > On Wed, Aug 5, 2015 at 3:35 PM, Julian Sedding <jsedd...@gmail.com>
>> wrote:
>> >
>> >> Hi all
>> >>
>> >> I am working on a scenario, where I need to copy a SegmentNodeStore
>> >> (TarMK) to a DocumentNodeStore (MongoDB).
>> >>
>> >> It is pretty straight forward to simply copy the NodeStore via the
>> >> API. No problems here.
>> >>
>> >> In a recent experiment I successfully copied the NodeStore and got an
>> >> exception in the logs (stacktrace below the email).
>> >>
>> >> My interpretation is that the AsyncIndexUpdate is trying to retrieve
>> >> the previous checkpoint as stored in /:async/async. Of course this
>> >> checkpoint is not present in the copied NodeStore and thus cannot be
>> >> retrieved.
>> >>
>> >> IMHO it would be desirable to (optionally) copy the checkpoints as
>> >> well. In the case of AsyncIndexUpdate, having the checkpoint can save
>> >> a full re-index.
>> >>
>> >> The question that remains is how the internal state of
>> >> AsyncIndexUpdate should be modified:
>> >> * implementing the logic in oak-upgrade would be pragmatic, but
>> >> distributes knowledge about AsyncIndexUpdate implementation details to
>> >> different modules
>> >> * having a CommitHook/Editor in oak-core that can be used in
>> >> oak-upgrade might be cleaner, but would only get used in oak-upgrade
>> >>
>> >> Other ideas and opinions regarding this feature are more than welcome!
>> >>
>> >> Regards
>> >> Julian
>> >>
>> >>
>> >> 05.08.2015 00:03:19.133 *ERROR* [pool-6-thread-2]
>> >> org.apache.sling.commons.scheduler.impl.QuartzScheduler Exception
>> >> during job execution of
>> >> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@471e4b4b :
>> >> 91f7e218-6cf5-4a44-a324-f094c29898e6
>> >> java.lang.IllegalArgumentException: 91f7e218-6cf5-4a44-a324-f094c29898e6
>> >>         at
>> >>
>> org.apache.jackrabbit.oak.plugins.document.Revision.fromString(Revision.java:236)
>> >>         at
>> >>
>> org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.retrieve(DocumentNodeStore.java:1570)
>> >>         at
>> >>
>> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:279)
>> >>         at
>> >>
>> org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
>> >>         at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >>         at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >>         at java.lang.Thread.run(Thread.java:745)
>> >>
>>

Re: Checkpoints and copying NodeStore instances (aka RepositorySidegrade)

Reply via email to