[ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-19 Thread Kostis Fardelas
Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-19 Thread Goncalo Borges
Hi Kostis...
That is a tale from the dark side. Glad you recover it and that you were 
willing to doc it all up, and share it. Thank you for that,
Can I also ask which tool did you use to recover the leveldb?
Cheers
Goncalo

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
Fardelas [dante1...@gmail.com]
Sent: 20 October 2016 09:09
To: ceph-users
Subject: [ceph-users] Surviving a ceph cluster outage: the hard way

Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-19 Thread Kostis Fardelas
We pulled leveldb from upstream and fired leveldb.RepairDB against the
OSD omap directory using a simple python script. Ultimately, that
didn't make things forward. We resorted to check every object's
timestamp/md5sum/attributes on the crashed OSD against the replicas in
the cluster and at last took the way of discarding the journal, when
we concluded with as much confidence as possible that we would not
lose data.

It would be really useful at that moment if we had a tool to inspect
the journal's contents of the crashed OSD and limit the scope of the
verification process.

On 20 October 2016 at 08:15, Goncalo Borges
 wrote:
> Hi Kostis...
> That is a tale from the dark side. Glad you recover it and that you were 
> willing to doc it all up, and share it. Thank you for that,
> Can I also ask which tool did you use to recover the leveldb?
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
> Fardelas [dante1...@gmail.com]
> Sent: 20 October 2016 09:09
> To: ceph-users
> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>
> Hello cephers,
> this is the blog post on our Ceph cluster's outage we experienced some
> weeks ago and about how we managed to revive the cluster and our
> clients's data.
>
> I hope it will prove useful for anyone who will find himself/herself
> in a similar position. Thanks for everyone on the ceph-users and
> ceph-devel lists who contributed to our inquiries during
> troubleshooting.
>
> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-20 Thread Kris Gillespie
Kostis,

Excellent article mate. This is the kind of war story that can really help 
people out. Learning through (others) adversity.

Kris


> On 20 Oct 2016, at 00:09, Kostis Fardelas  wrote:
> 
> Hello cephers,
> this is the blog post on our Ceph cluster's outage we experienced some
> weeks ago and about how we managed to revive the cluster and our
> clients's data.
> 
> I hope it will prove useful for anyone who will find himself/herself
> in a similar position. Thanks for everyone on the ceph-users and
> ceph-devel lists who contributed to our inquiries during
> troubleshooting.
> 
> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
> 
> Regards,
> Kostis
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-20 Thread mj

Hi,

Interesting reading!

Any chance you could state some of your lessons (if any) you learned..?

I can, for example, imagine your situation would have been much better 
with a replication factor of three instead of two..?


MJ

On 10/20/2016 12:09 AM, Kostis Fardelas wrote:

Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.

I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users and
ceph-devel lists who contributed to our inquiries during
troubleshooting.

https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/

Regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-24 Thread Dan Jakubiec
Thanks Kostis, great read.  

We also had a Ceph disaster back in August and a lot of this experience looked 
familiar.  Sadly, in the end we were not able to recover our cluster but glad 
to hear that you were successful.

LevelDB corruptions were one of our big problems.  Your note below about 
running RepairDB from Python is interesting.  At the time we were looking for a 
Ceph tool to run LevelDB repairs in order to get our OSDs back up and couldn't 
find one.  I felt like this is something that should be in the standard toolkit.

Would be great to see this added some day, but in the meantime I will remember 
this option exists.  If you still have the Python script, perhaps you could 
post it as an example?

Thanks!

-- Dan


> On Oct 20, 2016, at 01:42, Kostis Fardelas  wrote:
> 
> We pulled leveldb from upstream and fired leveldb.RepairDB against the
> OSD omap directory using a simple python script. Ultimately, that
> didn't make things forward. We resorted to check every object's
> timestamp/md5sum/attributes on the crashed OSD against the replicas in
> the cluster and at last took the way of discarding the journal, when
> we concluded with as much confidence as possible that we would not
> lose data.
> 
> It would be really useful at that moment if we had a tool to inspect
> the journal's contents of the crashed OSD and limit the scope of the
> verification process.
> 
> On 20 October 2016 at 08:15, Goncalo Borges
>  wrote:
>> Hi Kostis...
>> That is a tale from the dark side. Glad you recover it and that you were 
>> willing to doc it all up, and share it. Thank you for that,
>> Can I also ask which tool did you use to recover the leveldb?
>> Cheers
>> Goncalo
>> 
>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
>> Fardelas [dante1...@gmail.com]
>> Sent: 20 October 2016 09:09
>> To: ceph-users
>> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>> 
>> Hello cephers,
>> this is the blog post on our Ceph cluster's outage we experienced some
>> weeks ago and about how we managed to revive the cluster and our
>> clients's data.
>> 
>> I hope it will prove useful for anyone who will find himself/herself
>> in a similar position. Thanks for everyone on the ceph-users and
>> ceph-devel lists who contributed to our inquiries during
>> troubleshooting.
>> 
>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>> 
>> Regards,
>> Kostis
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-26 Thread Kostis Fardelas
It is not more than a three line script. You will also need leveldb's
code in your working directory:

```
#!/usr/bin/python2

import leveldb
leveldb.RepairDB('./omap')
```

I totally agree that we need more repair tools to be officially
available and also tools that provide better insight to components
that are a "black box" for the operator right now ie the journal

On 24 October 2016 at 19:36, Dan Jakubiec  wrote:
> Thanks Kostis, great read.
>
> We also had a Ceph disaster back in August and a lot of this experience 
> looked familiar.  Sadly, in the end we were not able to recover our cluster 
> but glad to hear that you were successful.
>
> LevelDB corruptions were one of our big problems.  Your note below about 
> running RepairDB from Python is interesting.  At the time we were looking for 
> a Ceph tool to run LevelDB repairs in order to get our OSDs back up and 
> couldn't find one.  I felt like this is something that should be in the 
> standard toolkit.
>
> Would be great to see this added some day, but in the meantime I will 
> remember this option exists.  If you still have the Python script, perhaps 
> you could post it as an example?
>
> Thanks!
>
> -- Dan
>
>
>> On Oct 20, 2016, at 01:42, Kostis Fardelas  wrote:
>>
>> We pulled leveldb from upstream and fired leveldb.RepairDB against the
>> OSD omap directory using a simple python script. Ultimately, that
>> didn't make things forward. We resorted to check every object's
>> timestamp/md5sum/attributes on the crashed OSD against the replicas in
>> the cluster and at last took the way of discarding the journal, when
>> we concluded with as much confidence as possible that we would not
>> lose data.
>>
>> It would be really useful at that moment if we had a tool to inspect
>> the journal's contents of the crashed OSD and limit the scope of the
>> verification process.
>>
>> On 20 October 2016 at 08:15, Goncalo Borges
>>  wrote:
>>> Hi Kostis...
>>> That is a tale from the dark side. Glad you recover it and that you were 
>>> willing to doc it all up, and share it. Thank you for that,
>>> Can I also ask which tool did you use to recover the leveldb?
>>> Cheers
>>> Goncalo
>>> ________
>>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
>>> Fardelas [dante1...@gmail.com]
>>> Sent: 20 October 2016 09:09
>>> To: ceph-users
>>> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>>>
>>> Hello cephers,
>>> this is the blog post on our Ceph cluster's outage we experienced some
>>> weeks ago and about how we managed to revive the cluster and our
>>> clients's data.
>>>
>>> I hope it will prove useful for anyone who will find himself/herself
>>> in a similar position. Thanks for everyone on the ceph-users and
>>> ceph-devel lists who contributed to our inquiries during
>>> troubleshooting.
>>>
>>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>>>
>>> Regards,
>>> Kostis
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Surviving a ceph cluster outage: the hard way

2016-10-27 Thread kefu chai
On Thu, Oct 27, 2016 at 1:26 PM, Kostis Fardelas  wrote:
> It is not more than a three line script. You will also need leveldb's
> code in your working directory:
>
> ```
> #!/usr/bin/python2
>
> import leveldb
> leveldb.RepairDB('./omap')
> ```
>
> I totally agree that we need more repair tools to be officially
> available and also tools that provide better insight to components
> that are a "black box" for the operator right now ie the journal
>
> On 24 October 2016 at 19:36, Dan Jakubiec  wrote:
>> Thanks Kostis, great read.
>>
>> We also had a Ceph disaster back in August and a lot of this experience 
>> looked familiar.  Sadly, in the end we were not able to recover our cluster 
>> but glad to hear that you were successful.
>>
>> LevelDB corruptions were one of our big problems.  Your note below about 
>> running RepairDB from Python is interesting.  At the time we were looking 
>> for a Ceph tool to run LevelDB repairs in order to get our OSDs back up and 
>> couldn't find one.  I felt like this is something that should be in the 
>> standard toolkit.
>>
>> Would be great to see this added some day, but in the meantime I will 
>> remember this option exists.  If you still have the Python script, perhaps 
>> you could post it as an example?

i just logged this feature on http://tracker.ceph.com/issues/17730, so
we don't forgot it!

>>
>> Thanks!
>>
>> -- Dan
>>
>>
>>> On Oct 20, 2016, at 01:42, Kostis Fardelas  wrote:
>>>
>>> We pulled leveldb from upstream and fired leveldb.RepairDB against the
>>> OSD omap directory using a simple python script. Ultimately, that
>>> didn't make things forward. We resorted to check every object's
>>> timestamp/md5sum/attributes on the crashed OSD against the replicas in
>>> the cluster and at last took the way of discarding the journal, when
>>> we concluded with as much confidence as possible that we would not
>>> lose data.
>>>
>>> It would be really useful at that moment if we had a tool to inspect
>>> the journal's contents of the crashed OSD and limit the scope of the
>>> verification process.
>>>
>>> On 20 October 2016 at 08:15, Goncalo Borges
>>>  wrote:
>>>> Hi Kostis...
>>>> That is a tale from the dark side. Glad you recover it and that you were 
>>>> willing to doc it all up, and share it. Thank you for that,
>>>> Can I also ask which tool did you use to recover the leveldb?
>>>> Cheers
>>>> Goncalo
>>>> 
>>>> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis 
>>>> Fardelas [dante1...@gmail.com]
>>>> Sent: 20 October 2016 09:09
>>>> To: ceph-users
>>>> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>>>>
>>>> Hello cephers,
>>>> this is the blog post on our Ceph cluster's outage we experienced some
>>>> weeks ago and about how we managed to revive the cluster and our
>>>> clients's data.
>>>>
>>>> I hope it will prove useful for anyone who will find himself/herself
>>>> in a similar position. Thanks for everyone on the ceph-users and
>>>> ceph-devel lists who contributed to our inquiries during
>>>> troubleshooting.
>>>>
>>>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>>>>
>>>> Regards,
>>>> Kostis
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com