Re: [ceph-users] Mimic offline problem

Goktug Yildirim Wed, 03 Oct 2018 06:43:34 -0700

Hello,

It seems nothing has changed.


OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
<https://paste.ubuntu.com/p/MtvTr5HYW4/>
OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
<https://paste.ubuntu.com/p/7Sx64xGzkR/>


> On 3 Oct 2018, at 14:27, Darius Kasparavičius <daz...@gmail.com> wrote:
> 
> Hello,
> 
> 
> You can also reduce the osd map updates by adding this to your ceph
> config file. "osd crush update on start = false". This should remove
> and update that is generated when osd starts.
> 
> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> "root=default"]} v 0) v1
> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0
> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> 'osd.47' initial_weight 3.6396 at location
> {host=SRV-SEKUARK8,root=default}
> 2018-10-03 14:03:22.250 7fe1615e0700  1
> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> enabled
> 
> 
> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> <goktug.yildi...@gmail.com> wrote:
>> 
>> Hi Sage,
>> 
>> Thank you for your response. Now I am sure this incident is going to be 
>> resolved.
>> 
>> The problem started when 7 server crashed same time and they came back after 
>> ~5 minutes.
>> 
>> Two of our 3 mon services were restarted in this crash. Since mon services 
>> are enabled they should be started nearly at the same time. I dont know if 
>> this makes any difference but some of the guys on IRC told it is required 
>> that they start in order not at the same time. Otherwise it could break 
>> things badly.
>> 
>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have 
>> so many STUCK request and our cluster can not heal itself.
>> 
>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the 
>> flag we see so many STUCKS and SLOW OPS in 1 hour.
>> I/O load on all of my OSD disks are at around %95 utilization and never 
>> ends. CPU and RAM usage are OK.
>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>> 
>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the 
>> eliminate older PG's and leaving cluster with good ones.
>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our 
>> problem.
>> 
>> Of course we have to save %100 of data. But we feel like even saving %50 of 
>> our data will be make us very happy right now.
>> 
>> This is what happens when the cluster starts. I believe it explains the 
>> whole story very nicely.
>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>> 
>> This is our ceph.conf:
>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>> 
>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>> 
>> This is pg dump:
>> https://paste.ubuntu.com/p/zYqsN5T95h/
>> 
>> This is iostat & perf top:
>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>> 
>> This strace output of ceph-osd:
>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>> 
>> This is OSD log (default debug):
>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>> 
>> This is leader MON log (default debug):
>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>> 
>> These are OSDs failed to start. Total number is 58.
>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> https://paste.ubuntu.com/p/smyT6Y976b/
>> 
>> 
>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 
>> 20.
>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>> 
>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore 
>> = 20.
>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>> 
>> As far as I understand OSD catchs up with the mon epoch and exceeds mon 
>> epoch somehow??
>> 
>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 
>> e60196@2018-09-28 23:57:08.251119
>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 
>> build_initial_pg_history 66.f8 created 60196
>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - 
>> loading and decoding 0x19da8400
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 
>> to_process <> waiting <> waiting_peering {}
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 
>> OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 
>> NullEvt +create_info) prio 255 cost 10 e72642) queued
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 
>> to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 
>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)> 
>> waiting <> waiting_peering {}
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 
>> OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 
>> NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 
>> 39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 
>> ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) 
>> [150] r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 
>> peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 
>> epoch_requested: 72642 NullEvt +create_info
>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru 
>> want 72642 <= queued 72642, currently 72206
>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process empty q, 
>> waiting
>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196 50012 
>> bytes
>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 - 
>> loading and decoding 0x19da8880
>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197 50012 
>> bytes
>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 - 
>> loading and decoding 0x19da9180
>> 
>> 
>> On 3 Oct 2018, at 05:14, Sage Weil <s...@newdream.net> wrote:
>> 
>> osd_find_best_info_ignore_history_les is a dangerous option and you should
>> only use it in very specific circumstances when directed by a developer.
>> In such cases it will allow a stuck PG to peer.  But you're not getting to
>> that point...you're seeing some sort of resource exhaustion.
>> 
>> The noup trick works when OSDs are way behind on maps and all need to
>> catch up.  The way to tell if they are behind is by looking at the 'ceph
>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch tha
>> t the mons have.  Were they really caught up when you unset noup?
>> 
>> I'm just catching up and haven't read the whole thread but I haven't seen
>> anything that explains why teh OSDs are dong lots of disk IO.  Catching up
>> on maps could explain it but not why they wouldn't peer once they were all
>> marked up...
>> 
>> sage
>> 
>> 
>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
>> 
>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
>> Is that be usefull here? There is such a less information about it.
>> 
>> Goktug Yildirim <goktug.yildi...@gmail.com> şunları yazdı (2 Eki 2018 22:11):
>> 
>> Hi,
>> 
>> Indeed I left ceph-disk to decide the wal and db partitions when I read 
>> somewhere that that will do the proper sizing.
>> For the blustore cache size I have plenty of RAM. I will increase 8GB for 
>> each and decide a more calculated number    after cluster settles.
>> 
>> For the osd map loading I’ve also figured it out. And it is in loop. For 
>> that reason I started cluster with noup flag and waited OSDs to reach the 
>> uptodate epoch number. After that I unset noup. But I did not pay attention 
>> to manager logs. Let me check it, thank you!
>> 
>> I am not forcing jmellac or anything else really. I have a very standard 
>> installation and no tweaks or tunings. All we ask for the stability versus 
>> speed from the begining. And here we are :/
>> 
>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <daz...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> 
>> I can see some issues from the osd log file. You have an extremely low
>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
>> recommend cranking up rocksdb cache size as much as possible. If you
>> have RAM you can also increase bluestores cache size for hdd. Default
>> is 1GB be as liberal as you can without getting OOM kills. You also
>> have lots of osd map loading and decoding in the log. Are you sure all
>> monitors/managers/osds are up to date? Plus make sure you aren't
>> forcing jemalloc loading. I had a funny interaction after upgrading to
>> mimic.
>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
>> <goktug.yildi...@gmail.com> wrote:
>> 
>> 
>> Hello Darius,
>> 
>> Thanks for reply!
>> 
>> The main problem is we can not query PGs. “ceph pg 67.54f query” does stucks 
>> and wait forever since OSD is unresponsive.
>> We are certain that OSD gets unresponsive as soon as it UP. And we are 
>> certain that OSD responds again after its disk utilization stops.
>> 
>> So we have a small test like that:
>> * Stop all OSDs (168 of them)
>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8 mins 
>> to finish. Only after that “ceph pg 67.54f query” works!
>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2 starts 
>> %95 disk utilization. This takes 17 minutes to finish.
>> * Now start OSD3 and it is the same. All OSDs start high I/O and it takes 25 
>> mins to settle.
>> * If you happen to start 5 of them at the same all of the OSDs start high 
>> I/O again. And it takes 1 hour to finish.
>> 
>> So in the light of these findings we flagged noup, started all OSDs. At 
>> first there was no I/O. After 10 minutes we unset noup. All of 168 OSD 
>> started to make high I/O. And we thought that if we wait long enough it will 
>> finish & OSDs will be responsive again. After 24hours they did not because 
>> I/O did not finish or even slowed down.
>> One can think that is a lot of data there to scan. But it is just 33TB.
>> 
>> So at short we dont know which PG is stuck so we can remove it.
>> 
>> However we met an weird thing half an hour ago. We exported the same PG from 
>> two different OSDs. One was 4.2GB and the other is 500KB! So we decided to 
>> export all OSDs for backup. Then we will delete strange sized ones and start 
>> the cluster all over. Maybe then we could solve the stucked or unfound PGs 
>> as you advise.
>> 
>> Any thought would be greatly appreciated.
>> 
>> 
>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <daz...@gmail.com> wrote:
>> 
>> Hello,
>> 
>> Currently you have 15 objects missing. I would recommend finding them
>> and making backups of them. Ditch all other osds that are failing to
>> start and concentrate on bringing online those that have missing
>> objects. Then slowly turn off nodown and noout on the cluster and see
>> if it stabilises. If it stabilises leave these setting if not turn
>> them back on.
>> Now get some of the pg's that are blocked and querry the pgs to check
>> why they are blocked. Try removing as much blocks as possible and then
>> remove the norebalance/norecovery flags and see if it starts to fix
>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
>> <morphinwith...@gmail.com> wrote:
>> 
>> 
>> One of ceph experts indicated that bluestore is somewhat preview tech
>> (as for Redhat).
>> So it could be best to checkout bluestore and rocksdb. There are some
>> tools to check health and also repair. But there are limited
>> documentation.
>> Anyone who has experince with it?
>> Anyone lead/help to a proper check would be great.
>> Goktug Yildirim <goktug.yildi...@gmail.com>, 1 Eki 2018 Pzt, 22:55
>> tarihinde şunu yazdı:
>> 
>> 
>> Hi all,
>> 
>> We have recently upgraded from luminous to mimic. It’s been 6 days since 
>> this cluster is offline. The long short story is here: 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
>> 
>> I’ve also CC’ed developers since I believe this is a bug. If this is not to 
>> correct way I apology and please let me know.
>> 
>> For the 6 days lots of thing happened and there were some outcomes about the 
>> problem. Some of them was misjudged and some of them are not looked deeper.
>> However the most certain diagnosis is this: each OSD causes very high disk 
>> I/O to its bluestore disk (WAL and DB are fine). After that OSDs become 
>> unresponsive or very very less responsive. For example "ceph tell osd.x 
>> version” stucks like for ever.
>> 
>> So due to unresponsive OSDs cluster does not settle. This is our problem!
>> 
>> This is the one we are very sure of. But we are not sure of the reason.
>> 
>> Here is the latest ceph status:
>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
>> 
>> This is the status after we started all of the OSDs 24 hours ago.
>> Some of the OSDs are not started. However it didnt make any difference when 
>> all of them was online.
>> 
>> Here is the debug=20 log of an OSD which is same for all others:
>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> As we figure out there is a loop pattern. I am sure it wont caught from eye.
>> 
>> This the full log the same OSD.
>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
>> 
>> Here is the strace of the same OSD process:
>> https://paste.ubuntu.com/p/8n2kTvwnG6/
>> 
>> Recently we hear more to uprade mimic. I hope none get hurts as we do. I am 
>> sure we have done lots of mistakes to let this happening. And this situation 
>> may be a example for other user and could be a potential bug for ceph 
>> developer.
>> 
>> Any help to figure out what is going on would be great.
>> 
>> Best Regards,
>> Goktug Yildirim
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic offline problem

Reply via email to