Re: [akka-user] Strategies for recovering on persistence sharding journal corruption

Konrad Malawski Fri, 26 Jun 2015 05:42:35 -0700

We have ideas that might help to improve from the current recovery style,
but nothing directly planned.
It's related to this ticket: https://github.com/akka/akka/issues/17837 but
still quite researchy.


Currently you'll need to clean the persistent data before restarting
sharding.

On Thu, Jun 25, 2015 at 6:27 PM, Diego Martinoia <diego.martin...@gmail.com>
wrote:

> (I mean, in a sense during the corruption we already have a bit of
> downtime, but this only prevents new instances to be created and
> rebalanced, the "old" cluster will keep working and accepting requests, so
> it's not technically a "downtime")
>
> Also on this topic: do you reckon that dumping a snapshot of the cluster
> right before restart could be of any use, or having already separated the
> journals for data and metadata will make it so that the data is consistent
> at restart?
>
> (Sorry if my questions are a bit vague, we are trying to understand all
> the possibilities and their implications).
>
> Thanks,
>
> D.
>
>
> On Thursday, June 25, 2015 at 5:21:31 PM UTC+1, Diego Martinoia wrote:
>>
>> I understand.
>>
>> Is there (or is it planned to be in the future) any (even convoluted) way
>> to recover without downtime in a production environment? Or is there any
>> way to write a journal plugin that may lead allow to recovering without
>> downtime?
>>
>> Thanks,
>>
>> D.
>>
>> On Thursday, June 25, 2015 at 1:36:38 PM UTC+1, Patrik Nordwall wrote:
>>>
>>>
>>>
>>> On Tue, Jun 23, 2015 at 11:12 AM, Diego Martinoia <diego.m...@gmail.com>
>>> wrote:
>>>
>>>> Hi Patrick,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Cleaning all the metadata was the option I was thinking about too. I'm
>>>> just unsure whether there may be some race conditions between the wiping
>>>> out of the metadata and the spawning of new actors / new instances in the
>>>> cluster that may lead to a split brain on restart. What do you think?
>>>>
>>>
>>> You can only do this when you do a clean stop of all nodes. Remove the
>>> data when all nodes are stopped. First thereafter you can start them up
>>> again.
>>>
>>>
>>>>
>>>> Thanks,
>>>>
>>>> D.
>>>>
>>>> On Tuesday, June 23, 2015 at 7:43:06 AM UTC+1, Patrik Nordwall wrote:
>>>>>
>>>>> It is safe to remove all data that the shard coordinator stored when
>>>>> you restart the cluster. Stop all nodes, remove the data and then start
>>>>> them again.
>>>>>
>>>>> You should probably investigate why your data got corrupt. The usual
>>>>> suspect is that you got multiple writers to the same persistenceId, i.e.
>>>>> you have split the cluster into two separate clusters. That can happen if
>>>>> you use auto-down.
>>>>>
>>>>> Cheers,
>>>>> Patrik
>>>>>
>>>>> On Mon, Jun 22, 2015 at 2:09 PM, Diego Martinoia <diego.m...@ocado.com
>>>>> > wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> While using the cluster sharding extension, you have to provide some
>>>>>> sort of persistence journal so that the plugin can store its metadata
>>>>>> (ShardRegionAllocated, etc...).
>>>>>>
>>>>>> These metadata are used when new actors are instantiated / moved
>>>>>> across nodes to recover from their frozen state.
>>>>>>
>>>>>> Suppose that for any reason your journal becomes corrupted (loses one
>>>>>> entry, duplicates an entry, whatever). This leads to pretty bad 
>>>>>> exceptions
>>>>>> at the actor's startup (Persistence recovery failure), possibly 
>>>>>> terminating
>>>>>> the whole region if not correctly handled.
>>>>>>
>>>>>> What is the best way to manage this scenario? (I'm asking for ideas
>>>>>> at any level of the stack, from the supervisor's policy to some sort of
>>>>>> intervention directly on the journal)
>>>>>>
>>>>>> Any ideas welcome!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> D.
>>>>>>
>>>>>> Notice:  This email is confidential and may contain copyright
>>>>>> material of members of the Ocado Group. Opinions and views expressed in
>>>>>> this message may not necessarily reflect the opinions and views of the
>>>>>> members of the Ocado Group.
>>>>>>
>>>>>>
>>>>>>
>>>>>> If you are not the intended recipient, please notify us immediately
>>>>>> and delete all copies of this message. Please note that it is your
>>>>>> responsibility to scan this message for viruses.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Fetch and Sizzle are trading names of Speciality Stores Limited, a
>>>>>> member of the Ocado Group.
>>>>>>
>>>>>>
>>>>>>
>>>>>> References to the “Ocado Group” are to Ocado Group plc (registered in
>>>>>> England and Wales with number 7098618) and its subsidiary undertakings 
>>>>>> (as
>>>>>> that expression is defined in the Companies Act 2006) from time to time.
>>>>>> The registered office of Ocado Group plc is Titan Court, 3 Bishops 
>>>>>> Square,
>>>>>> Hatfield Business Park, Hatfield, Herts. AL10 9NE.
>>>>>>
>>>>>> --
>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>> >>>>>>>>>> Check the FAQ:
>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>> >>>>>>>>>> Search the archives:
>>>>>> https://groups.google.com/group/akka-user
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Akka User List" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to akka-user+...@googlegroups.com.
>>>>>> To post to this group, send email to akka...@googlegroups.com.
>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Patrik Nordwall
>>>>> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
>>>>> Twitter: @patriknw
>>>>>
>>>>>   --
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ:
>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>> >>>>>>>>>> Search the archives:
>>>> https://groups.google.com/group/akka-user
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to akka-user+...@googlegroups.com.
>>>> To post to this group, send email to akka...@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Patrik Nordwall
>>> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
>>> Twitter: @patriknw
>>>
>>>   --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Cheers,
Konrad 'ktoso' Malawski
Akka <http://akka.io/> @ Typesafe <http://typesafe.com/>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Strategies for recovering on persistence sharding journal corruption

Reply via email to