Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?
I marked the PGs complete using the cephstore tool and that fixed the issues with the gateways going down. They have been up for 2 days now without issue and made it through testing. I tried to extract the data from the failing server, but I was unable to import it. The failing server was on Hammer and I had upgraded to Jewel then Luminous shortly thereafter( I had all but the 4 PGs resolved before upgrade ). I dont know if the tool supports restoring to newer versions, which might be the issue. I assume this is part of the reasoning behind only upgrading healthy clusters. We may try to extract the data directly from the drives at this point, all of it is replaceable though. -Brent From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brent Kennedy Sent: Friday, January 12, 2018 10:27 AM To: 'David Turner' <drakonst...@gmail.com> Cc: 'Ceph Users' <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] 4 incomplete PGs causing RGW to go offline? Rgw.buckets ( which is where the data is being sent ). I am just surprised that a few incomplete PGs would grind three gateways to a halt. Granted, the incomplete part of a large hardware failure situation we had and having a min_size setting of 1 didn’t help the situation. We are not completely innocent, but I would hope that the system as a whole would work together to skip those incomplete PGs. Fixing them doesn’t appear to be an easy task at this point, hence why we haven’t fixed them yet(I wish that were easier, but I understand the counter argument ). -Brent From: David Turner [mailto:drakonst...@gmail.com] Sent: Thursday, January 11, 2018 8:22 PM To: Brent Kennedy <bkenn...@cfl.rr.com <mailto:bkenn...@cfl.rr.com> > Cc: Ceph Users <ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] 4 incomplete PGs causing RGW to go offline? Which pools are the incomplete PGs a part of? I would say it's very likely that if some of the RGW metadata was incomplete that the daemons wouldn't be happy. On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy <bkenn...@cfl.rr.com <mailto:bkenn...@cfl.rr.com> > wrote: We have 3 RadosGW servers running behind HAProxy to enable clients to connect to the ceph cluster like an amazon bucket. After all the failures and upgrade issues were resolved, I cannot get the RadosGW servers to stay online. They were upgraded to luminous, I even upgraded the OS to Ubuntu 16 on them ( before upgrading to Luminous ). They used to have apache on them as they ran Hammer and before that firefly. I removed apache before upgrading to Luminous. The start up and run for about 4-6 hours before all three start to go offline. Client traffic is light right now as we are just testing file read/write before we reactivate them ( they switched back to amazon while we fix them ). Could the 4 incomplete PGs be causing them to go offline? The last time I saw an issue like this was when recovery wasn’t working 100%, so it seems related since they haven’t been stable since we upgraded( but that was also after the failures we had, which is why I am not trying to specifically blame the upgrade ). When I look at the radosgw log, this is what I see ( the first 2 lines show up plenty before this, they are health checks by the haproxy server, the next two are file requests that 404 fail I am guessing, then the last one is me restarting the service ): 2018-01-11 20:14:36.640577 7f5826aa3700 1 == req done req=0x7f5826a9d1f0 op status=0 http_status=200 == 2018-01-11 20:14:36.640602 7f5826aa3700 1 civetweb: 0x56202c567000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - 2018-01-11 20:14:36.640835 7f5816282700 1 == req done req=0x7f581627c1f0 op status=0 http_status=200 == 2018-01-11 20:14:36.640859 7f5816282700 1 civetweb: 0x56202c61: 192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - 2018-01-11 20:14:36.761917 7f5835ac1700 1 == starting new request req=0x7f5835abb1f0 = 2018-01-11 20:14:36.763936 7f5835ac1700 1 == req done req=0x7f5835abb1f0 op status=0 http_status=404 == 2018-01-11 20:14:36.763983 7f5835ac1700 1 civetweb: 0x56202c4ce000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:36.772611 7f5808266700 1 == starting new request req=0x7f58082601f0 = 2018-01-11 20:14:36.773733 7f5808266700 1 == req done req=0x7f58082601f0 op status=0 http_status=404 == 2018-01-11 20:14:36.773769 7f5808266700 1 civetweb: 0x56202c6aa000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet
Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?
Rgw.buckets ( which is where the data is being sent ). I am just surprised that a few incomplete PGs would grind three gateways to a halt. Granted, the incomplete part of a large hardware failure situation we had and having a min_size setting of 1 didn’t help the situation. We are not completely innocent, but I would hope that the system as a whole would work together to skip those incomplete PGs. Fixing them doesn’t appear to be an easy task at this point, hence why we haven’t fixed them yet(I wish that were easier, but I understand the counter argument ). -Brent From: David Turner [mailto:drakonst...@gmail.com] Sent: Thursday, January 11, 2018 8:22 PM To: Brent Kennedy <bkenn...@cfl.rr.com> Cc: Ceph Users <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] 4 incomplete PGs causing RGW to go offline? Which pools are the incomplete PGs a part of? I would say it's very likely that if some of the RGW metadata was incomplete that the daemons wouldn't be happy. On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy <bkenn...@cfl.rr.com <mailto:bkenn...@cfl.rr.com> > wrote: We have 3 RadosGW servers running behind HAProxy to enable clients to connect to the ceph cluster like an amazon bucket. After all the failures and upgrade issues were resolved, I cannot get the RadosGW servers to stay online. They were upgraded to luminous, I even upgraded the OS to Ubuntu 16 on them ( before upgrading to Luminous ). They used to have apache on them as they ran Hammer and before that firefly. I removed apache before upgrading to Luminous. The start up and run for about 4-6 hours before all three start to go offline. Client traffic is light right now as we are just testing file read/write before we reactivate them ( they switched back to amazon while we fix them ). Could the 4 incomplete PGs be causing them to go offline? The last time I saw an issue like this was when recovery wasn’t working 100%, so it seems related since they haven’t been stable since we upgraded( but that was also after the failures we had, which is why I am not trying to specifically blame the upgrade ). When I look at the radosgw log, this is what I see ( the first 2 lines show up plenty before this, they are health checks by the haproxy server, the next two are file requests that 404 fail I am guessing, then the last one is me restarting the service ): 2018-01-11 20:14:36.640577 7f5826aa3700 1 == req done req=0x7f5826a9d1f0 op status=0 http_status=200 == 2018-01-11 20:14:36.640602 7f5826aa3700 1 civetweb: 0x56202c567000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - 2018-01-11 20:14:36.640835 7f5816282700 1 == req done req=0x7f581627c1f0 op status=0 http_status=200 == 2018-01-11 20:14:36.640859 7f5816282700 1 civetweb: 0x56202c61: 192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - 2018-01-11 20:14:36.761917 7f5835ac1700 1 == starting new request req=0x7f5835abb1f0 = 2018-01-11 20:14:36.763936 7f5835ac1700 1 == req done req=0x7f5835abb1f0 op status=0 http_status=404 == 2018-01-11 20:14:36.763983 7f5835ac1700 1 civetweb: 0x56202c4ce000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:36.772611 7f5808266700 1 == starting new request req=0x7f58082601f0 = 2018-01-11 20:14:36.773733 7f5808266700 1 == req done req=0x7f58082601f0 op status=0 http_status=404 == 2018-01-11 20:14:36.773769 7f5808266700 1 civetweb: 0x56202c6aa000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:38.163617 7f5836ac3700 1 == starting new request req=0x7f5836abd1f0 = 2018-01-11 20:14:38.165352 7f5836ac3700 1 == req done req=0x7f5836abd1f0 op status=0 http_status=404 == 2018-01-11 20:14:38.165401 7f5836ac3700 1 civetweb: 0x56202c4e2000: 192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD /Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:38.170551 7f5807a65700 1 == starting new request req=0x7f5807a5f1f0 = 2018-01-11 20:14:40.322236 7f58352c0700 1 == starting new request req=0x7f58352ba1f0 = 2018-01-11 20:14:40.323468 7f5834abf700 1 == starting new request req=0x7f5834ab91f0 = 2018-01-11 20:14:41.643365 7f58342be700 1 == starting new request req=0x7f58342b81f0 = 2018-01-11 20:14:41.643358 7f58312b8700 1 == starting new request req=0x7f58312b21f0 = 2018-01-11 20:14:50.324196 7f5829aa97
Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?
Which pools are the incomplete PGs a part of? I would say it's very likely that if some of the RGW metadata was incomplete that the daemons wouldn't be happy. On Thu, Jan 11, 2018, 6:17 PM Brent Kennedywrote: > We have 3 RadosGW servers running behind HAProxy to enable clients to > connect to the ceph cluster like an amazon bucket. After all the failures > and upgrade issues were resolved, I cannot get the RadosGW servers to stay > online. They were upgraded to luminous, I even upgraded the OS to Ubuntu > 16 on them ( before upgrading to Luminous ). They used to have apache on > them as they ran Hammer and before that firefly. I removed apache before > upgrading to Luminous. The start up and run for about 4-6 hours before all > three start to go offline. Client traffic is light right now as we are > just testing file read/write before we reactivate them ( they switched back > to amazon while we fix them ). > > > > Could the 4 incomplete PGs be causing them to go offline? The last time I > saw an issue like this was when recovery wasn’t working 100%, so it seems > related since they haven’t been stable since we upgraded( but that was also > after the failures we had, which is why I am not trying to specifically > blame the upgrade ). > > > > When I look at the radosgw log, this is what I see ( the first 2 lines > show up plenty before this, they are health checks by the haproxy server, > the next two are file requests that 404 fail I am guessing, then the last > one is me restarting the service ): > > > > 2018-01-11 20:14:36.640577 7f5826aa3700 1 == req done > req=0x7f5826a9d1f0 op status=0 http_status=200 == > > 2018-01-11 20:14:36.640602 7f5826aa3700 1 civetweb: 0x56202c567000: > 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - > > 2018-01-11 20:14:36.640835 7f5816282700 1 == req done > req=0x7f581627c1f0 op status=0 http_status=200 == > > 2018-01-11 20:14:36.640859 7f5816282700 1 civetweb: 0x56202c61: > 192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - > > 2018-01-11 20:14:36.761917 7f5835ac1700 1 == starting new request > req=0x7f5835abb1f0 = > > 2018-01-11 20:14:36.763936 7f5835ac1700 1 == req done > req=0x7f5835abb1f0 op status=0 http_status=404 == > > 2018-01-11 20:14:36.763983 7f5835ac1700 1 civetweb: 0x56202c4ce000: > 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD > /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - > aws-sdk-dotnet-35/2 > > .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO > > 2018-01-11 20:14:36.772611 7f5808266700 1 == starting new request > req=0x7f58082601f0 = > > 2018-01-11 20:14:36.773733 7f5808266700 1 == req done > req=0x7f58082601f0 op status=0 http_status=404 == > > 2018-01-11 20:14:36.773769 7f5808266700 1 civetweb: 0x56202c6aa000: > 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD > /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - > aws-sdk-dotnet-35/2 > > .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO > > 2018-01-11 20:14:38.163617 7f5836ac3700 1 == starting new request > req=0x7f5836abd1f0 = > > 2018-01-11 20:14:38.165352 7f5836ac3700 1 == req done > req=0x7f5836abd1f0 op status=0 http_status=404 == > > 2018-01-11 20:14:38.165401 7f5836ac3700 1 civetweb: 0x56202c4e2000: > 192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD > /Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 - > aws-sdk-dotnet-35/2 > > .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO > > 2018-01-11 20:14:38.170551 7f5807a65700 1 == starting new request > req=0x7f5807a5f1f0 = > > 2018-01-11 20:14:40.322236 7f58352c0700 1 == starting new request > req=0x7f58352ba1f0 = > > 2018-01-11 20:14:40.323468 7f5834abf700 1 == starting new request > req=0x7f5834ab91f0 = > > 2018-01-11 20:14:41.643365 7f58342be700 1 == starting new request > req=0x7f58342b81f0 = > > 2018-01-11 20:14:41.643358 7f58312b8700 1 == starting new request > req=0x7f58312b21f0 = > > 2018-01-11 20:14:50.324196 7f5829aa9700 1 == starting new request > req=0x7f5829aa31f0 = > > 2018-01-11 20:14:50.325622 7f58332bc700 1 == starting new request > req=0x7f58332b61f0 = > > 2018-01-11 20:14:51.645678 7f58362c2700 1 == starting new request > req=0x7f58362bc1f0 = > > 2018-01-11 20:14:51.645671 7f582e2b2700 1 == starting new request > req=0x7f582e2ac1f0 = > > 2018-01-11 20:15:00.326452 7f5815a81700 1 == starting new request > req=0x7f5815a7b1f0 = > > 2018-01-11 20:15:00.328787 7f5828aa7700 1 == starting new request > req=0x7f5828aa11f0 = > > 2018-01-11 20:15:01.648196 7f580ea73700 1 == starting new request > req=0x7f580ea6d1f0 = > > 2018-01-11 20:15:01.648698 7f5830ab7700 1 == starting new request > req=0x7f5830ab11f0 = > > 2018-01-11 20:15:10.328810 7f5832abb700 1 ==
[ceph-users] 4 incomplete PGs causing RGW to go offline?
We have 3 RadosGW servers running behind HAProxy to enable clients to connect to the ceph cluster like an amazon bucket. After all the failures and upgrade issues were resolved, I cannot get the RadosGW servers to stay online. They were upgraded to luminous, I even upgraded the OS to Ubuntu 16 on them ( before upgrading to Luminous ). They used to have apache on them as they ran Hammer and before that firefly. I removed apache before upgrading to Luminous. The start up and run for about 4-6 hours before all three start to go offline. Client traffic is light right now as we are just testing file read/write before we reactivate them ( they switched back to amazon while we fix them ). Could the 4 incomplete PGs be causing them to go offline? The last time I saw an issue like this was when recovery wasn't working 100%, so it seems related since they haven't been stable since we upgraded( but that was also after the failures we had, which is why I am not trying to specifically blame the upgrade ). When I look at the radosgw log, this is what I see ( the first 2 lines show up plenty before this, they are health checks by the haproxy server, the next two are file requests that 404 fail I am guessing, then the last one is me restarting the service ): 2018-01-11 20:14:36.640577 7f5826aa3700 1 == req done req=0x7f5826a9d1f0 op status=0 http_status=200 == 2018-01-11 20:14:36.640602 7f5826aa3700 1 civetweb: 0x56202c567000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - 2018-01-11 20:14:36.640835 7f5816282700 1 == req done req=0x7f581627c1f0 op status=0 http_status=200 == 2018-01-11 20:14:36.640859 7f5816282700 1 civetweb: 0x56202c61: 192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - - 2018-01-11 20:14:36.761917 7f5835ac1700 1 == starting new request req=0x7f5835abb1f0 = 2018-01-11 20:14:36.763936 7f5835ac1700 1 == req done req=0x7f5835abb1f0 op status=0 http_status=404 == 2018-01-11 20:14:36.763983 7f5835ac1700 1 civetweb: 0x56202c4ce000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:36.772611 7f5808266700 1 == starting new request req=0x7f58082601f0 = 2018-01-11 20:14:36.773733 7f5808266700 1 == req done req=0x7f58082601f0 op status=0 http_status=404 == 2018-01-11 20:14:36.773769 7f5808266700 1 civetweb: 0x56202c6aa000: 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:38.163617 7f5836ac3700 1 == starting new request req=0x7f5836abd1f0 = 2018-01-11 20:14:38.165352 7f5836ac3700 1 == req done req=0x7f5836abd1f0 op status=0 http_status=404 == 2018-01-11 20:14:38.165401 7f5836ac3700 1 civetweb: 0x56202c4e2000: 192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD /Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 - aws-sdk-dotnet-35/2 .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO 2018-01-11 20:14:38.170551 7f5807a65700 1 == starting new request req=0x7f5807a5f1f0 = 2018-01-11 20:14:40.322236 7f58352c0700 1 == starting new request req=0x7f58352ba1f0 = 2018-01-11 20:14:40.323468 7f5834abf700 1 == starting new request req=0x7f5834ab91f0 = 2018-01-11 20:14:41.643365 7f58342be700 1 == starting new request req=0x7f58342b81f0 = 2018-01-11 20:14:41.643358 7f58312b8700 1 == starting new request req=0x7f58312b21f0 = 2018-01-11 20:14:50.324196 7f5829aa9700 1 == starting new request req=0x7f5829aa31f0 = 2018-01-11 20:14:50.325622 7f58332bc700 1 == starting new request req=0x7f58332b61f0 = 2018-01-11 20:14:51.645678 7f58362c2700 1 == starting new request req=0x7f58362bc1f0 = 2018-01-11 20:14:51.645671 7f582e2b2700 1 == starting new request req=0x7f582e2ac1f0 = 2018-01-11 20:15:00.326452 7f5815a81700 1 == starting new request req=0x7f5815a7b1f0 = 2018-01-11 20:15:00.328787 7f5828aa7700 1 == starting new request req=0x7f5828aa11f0 = 2018-01-11 20:15:01.648196 7f580ea73700 1 == starting new request req=0x7f580ea6d1f0 = 2018-01-11 20:15:01.648698 7f5830ab7700 1 == starting new request req=0x7f5830ab11f0 = 2018-01-11 20:15:10.328810 7f5832abb700 1 == starting new request req=0x7f5832ab51f0 = 2018-01-11 20:15:10.329541 7f582f2b4700 1 == starting new request req=0x7f582f2ae1f0 = 2018-01-11 20:15:11.650655 7f582d2b0700 1 == starting new request req=0x7f582d2aa1f0 = 2018-01-11 20:15:11.651401 7f582aaab700 1 == starting new request req=0x7f582aaa51f0 = 2018-01-11 20:15:20.332032 7f582c2ae700 1 == starting new request req=0x7f582c2a81f0