Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

2018-01-16 Thread Brent Kennedy
I marked the PGs complete using the cephstore tool and that fixed the issues 
with the gateways going down.  They have been up for 2 days now without issue 
and made it through testing.  I tried to extract the data from the failing 
server, but I was unable to import it.  The failing server was on Hammer and I 
had upgraded to Jewel then Luminous shortly thereafter( I had all but the 4 PGs 
resolved before upgrade ).  I dont know if the tool supports restoring to newer 
versions, which might be the issue.  I assume this is part of the reasoning 
behind only upgrading healthy clusters.  We may try to extract the data 
directly from the drives at this point, all of it is replaceable though.

 

-Brent

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brent 
Kennedy
Sent: Friday, January 12, 2018 10:27 AM
To: 'David Turner' <drakonst...@gmail.com>
Cc: 'Ceph Users' <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

 

Rgw.buckets ( which is where the data is being sent ).  I am just surprised 
that a few incomplete PGs would grind three gateways to a halt.  Granted, the 
incomplete part of a large hardware failure situation we had and having a 
min_size setting of 1 didn’t help the situation.  We are not completely 
innocent, but I would hope that the system as a whole would work together to 
skip those incomplete PGs.  Fixing them doesn’t appear to be an easy task at 
this point, hence why we haven’t fixed them yet(I wish that were easier, but I 
understand the counter argument ).

 

-Brent

 

From: David Turner [mailto:drakonst...@gmail.com] 
Sent: Thursday, January 11, 2018 8:22 PM
To: Brent Kennedy <bkenn...@cfl.rr.com <mailto:bkenn...@cfl.rr.com> >
Cc: Ceph Users <ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

 

Which pools are the incomplete PGs a part of? I would say it's very likely that 
if some of the RGW metadata was incomplete that the daemons wouldn't be happy.

 

On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy <bkenn...@cfl.rr.com 
<mailto:bkenn...@cfl.rr.com> > wrote:

We have 3 RadosGW servers running behind HAProxy to enable clients to connect 
to the ceph cluster like an amazon bucket.  After all the failures and upgrade 
issues were resolved, I cannot get the RadosGW servers to stay online.  They 
were upgraded to luminous, I even upgraded the OS to Ubuntu 16 on them ( before 
upgrading to Luminous ).  They used to have apache on them as they ran Hammer 
and before that firefly.  I removed apache before upgrading to Luminous.  The 
start up and run for about 4-6 hours before all three start to go offline.  
Client traffic is light right now as we are just testing file read/write before 
we reactivate them ( they switched back to amazon while we fix them ).  

 

Could the 4 incomplete PGs be causing them to go offline?  The last time I saw 
an issue like this was when recovery wasn’t working 100%, so it seems related 
since they haven’t been stable since we upgraded( but that was also after the 
failures we had, which is why I am not trying to specifically blame the upgrade 
).

 

When I look at the radosgw log, this is what I see ( the first 2 lines show up 
plenty before this, they are health checks by the haproxy server, the next two 
are file requests that 404 fail I am guessing, then the last one is me 
restarting the service ):

 

2018-01-11 20:14:36.640577 7f5826aa3700  1 == req done req=0x7f5826a9d1f0 
op status=0 http_status=200 ==

2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000: 
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.640835 7f5816282700  1 == req done req=0x7f581627c1f0 
op status=0 http_status=200 ==

2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c61: 
192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.761917 7f5835ac1700  1 == starting new request 
req=0x7f5835abb1f0 =

2018-01-11 20:14:36.763936 7f5835ac1700  1 == req done req=0x7f5835abb1f0 
op status=0 http_status=404 ==

2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000: 
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD 
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - 
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:36.772611 7f5808266700  1 == starting new request 
req=0x7f58082601f0 =

2018-01-11 20:14:36.773733 7f5808266700  1 == req done req=0x7f58082601f0 
op status=0 http_status=404 ==

2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000: 
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD 
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - 
aws-sdk-dotnet

Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

2018-01-12 Thread Brent Kennedy
Rgw.buckets ( which is where the data is being sent ).  I am just surprised 
that a few incomplete PGs would grind three gateways to a halt.  Granted, the 
incomplete part of a large hardware failure situation we had and having a 
min_size setting of 1 didn’t help the situation.  We are not completely 
innocent, but I would hope that the system as a whole would work together to 
skip those incomplete PGs.  Fixing them doesn’t appear to be an easy task at 
this point, hence why we haven’t fixed them yet(I wish that were easier, but I 
understand the counter argument ).

 

-Brent

 

From: David Turner [mailto:drakonst...@gmail.com] 
Sent: Thursday, January 11, 2018 8:22 PM
To: Brent Kennedy <bkenn...@cfl.rr.com>
Cc: Ceph Users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

 

Which pools are the incomplete PGs a part of? I would say it's very likely that 
if some of the RGW metadata was incomplete that the daemons wouldn't be happy.

 

On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy <bkenn...@cfl.rr.com 
<mailto:bkenn...@cfl.rr.com> > wrote:

We have 3 RadosGW servers running behind HAProxy to enable clients to connect 
to the ceph cluster like an amazon bucket.  After all the failures and upgrade 
issues were resolved, I cannot get the RadosGW servers to stay online.  They 
were upgraded to luminous, I even upgraded the OS to Ubuntu 16 on them ( before 
upgrading to Luminous ).  They used to have apache on them as they ran Hammer 
and before that firefly.  I removed apache before upgrading to Luminous.  The 
start up and run for about 4-6 hours before all three start to go offline.  
Client traffic is light right now as we are just testing file read/write before 
we reactivate them ( they switched back to amazon while we fix them ).  

 

Could the 4 incomplete PGs be causing them to go offline?  The last time I saw 
an issue like this was when recovery wasn’t working 100%, so it seems related 
since they haven’t been stable since we upgraded( but that was also after the 
failures we had, which is why I am not trying to specifically blame the upgrade 
).

 

When I look at the radosgw log, this is what I see ( the first 2 lines show up 
plenty before this, they are health checks by the haproxy server, the next two 
are file requests that 404 fail I am guessing, then the last one is me 
restarting the service ):

 

2018-01-11 20:14:36.640577 7f5826aa3700  1 == req done req=0x7f5826a9d1f0 
op status=0 http_status=200 ==

2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000: 
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.640835 7f5816282700  1 == req done req=0x7f581627c1f0 
op status=0 http_status=200 ==

2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c61: 
192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.761917 7f5835ac1700  1 == starting new request 
req=0x7f5835abb1f0 =

2018-01-11 20:14:36.763936 7f5835ac1700  1 == req done req=0x7f5835abb1f0 
op status=0 http_status=404 ==

2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000: 
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD 
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - 
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:36.772611 7f5808266700  1 == starting new request 
req=0x7f58082601f0 =

2018-01-11 20:14:36.773733 7f5808266700  1 == req done req=0x7f58082601f0 
op status=0 http_status=404 ==

2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000: 
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD 
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 - 
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.163617 7f5836ac3700  1 == starting new request 
req=0x7f5836abd1f0 =

2018-01-11 20:14:38.165352 7f5836ac3700  1 == req done req=0x7f5836abd1f0 
op status=0 http_status=404 ==

2018-01-11 20:14:38.165401 7f5836ac3700  1 civetweb: 0x56202c4e2000: 
192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD 
/Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 - 
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.170551 7f5807a65700  1 == starting new request 
req=0x7f5807a5f1f0 =

2018-01-11 20:14:40.322236 7f58352c0700  1 == starting new request 
req=0x7f58352ba1f0 =

2018-01-11 20:14:40.323468 7f5834abf700  1 == starting new request 
req=0x7f5834ab91f0 =

2018-01-11 20:14:41.643365 7f58342be700  1 == starting new request 
req=0x7f58342b81f0 =

2018-01-11 20:14:41.643358 7f58312b8700  1 == starting new request 
req=0x7f58312b21f0 =

2018-01-11 20:14:50.324196 7f5829aa97

Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

2018-01-11 Thread David Turner
Which pools are the incomplete PGs a part of? I would say it's very likely
that if some of the RGW metadata was incomplete that the daemons wouldn't
be happy.

On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy  wrote:

> We have 3 RadosGW servers running behind HAProxy to enable clients to
> connect to the ceph cluster like an amazon bucket.  After all the failures
> and upgrade issues were resolved, I cannot get the RadosGW servers to stay
> online.  They were upgraded to luminous, I even upgraded the OS to Ubuntu
> 16 on them ( before upgrading to Luminous ).  They used to have apache on
> them as they ran Hammer and before that firefly.  I removed apache before
> upgrading to Luminous.  The start up and run for about 4-6 hours before all
> three start to go offline.  Client traffic is light right now as we are
> just testing file read/write before we reactivate them ( they switched back
> to amazon while we fix them ).
>
>
>
> Could the 4 incomplete PGs be causing them to go offline?  The last time I
> saw an issue like this was when recovery wasn’t working 100%, so it seems
> related since they haven’t been stable since we upgraded( but that was also
> after the failures we had, which is why I am not trying to specifically
> blame the upgrade ).
>
>
>
> When I look at the radosgw log, this is what I see ( the first 2 lines
> show up plenty before this, they are health checks by the haproxy server,
> the next two are file requests that 404 fail I am guessing, then the last
> one is me restarting the service ):
>
>
>
> 2018-01-11 20:14:36.640577 7f5826aa3700  1 == req done
> req=0x7f5826a9d1f0 op status=0 http_status=200 ==
>
> 2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000:
> 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -
>
> 2018-01-11 20:14:36.640835 7f5816282700  1 == req done
> req=0x7f581627c1f0 op status=0 http_status=200 ==
>
> 2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c61:
> 192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -
>
> 2018-01-11 20:14:36.761917 7f5835ac1700  1 == starting new request
> req=0x7f5835abb1f0 =
>
> 2018-01-11 20:14:36.763936 7f5835ac1700  1 == req done
> req=0x7f5835abb1f0 op status=0 http_status=404 ==
>
> 2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000:
> 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
> /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
> aws-sdk-dotnet-35/2
>
> .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO
>
> 2018-01-11 20:14:36.772611 7f5808266700  1 == starting new request
> req=0x7f58082601f0 =
>
> 2018-01-11 20:14:36.773733 7f5808266700  1 == req done
> req=0x7f58082601f0 op status=0 http_status=404 ==
>
> 2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000:
> 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
> /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
> aws-sdk-dotnet-35/2
>
> .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO
>
> 2018-01-11 20:14:38.163617 7f5836ac3700  1 == starting new request
> req=0x7f5836abd1f0 =
>
> 2018-01-11 20:14:38.165352 7f5836ac3700  1 == req done
> req=0x7f5836abd1f0 op status=0 http_status=404 ==
>
> 2018-01-11 20:14:38.165401 7f5836ac3700  1 civetweb: 0x56202c4e2000:
> 192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD
> /Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 -
> aws-sdk-dotnet-35/2
>
> .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO
>
> 2018-01-11 20:14:38.170551 7f5807a65700  1 == starting new request
> req=0x7f5807a5f1f0 =
>
> 2018-01-11 20:14:40.322236 7f58352c0700  1 == starting new request
> req=0x7f58352ba1f0 =
>
> 2018-01-11 20:14:40.323468 7f5834abf700  1 == starting new request
> req=0x7f5834ab91f0 =
>
> 2018-01-11 20:14:41.643365 7f58342be700  1 == starting new request
> req=0x7f58342b81f0 =
>
> 2018-01-11 20:14:41.643358 7f58312b8700  1 == starting new request
> req=0x7f58312b21f0 =
>
> 2018-01-11 20:14:50.324196 7f5829aa9700  1 == starting new request
> req=0x7f5829aa31f0 =
>
> 2018-01-11 20:14:50.325622 7f58332bc700  1 == starting new request
> req=0x7f58332b61f0 =
>
> 2018-01-11 20:14:51.645678 7f58362c2700  1 == starting new request
> req=0x7f58362bc1f0 =
>
> 2018-01-11 20:14:51.645671 7f582e2b2700  1 == starting new request
> req=0x7f582e2ac1f0 =
>
> 2018-01-11 20:15:00.326452 7f5815a81700  1 == starting new request
> req=0x7f5815a7b1f0 =
>
> 2018-01-11 20:15:00.328787 7f5828aa7700  1 == starting new request
> req=0x7f5828aa11f0 =
>
> 2018-01-11 20:15:01.648196 7f580ea73700  1 == starting new request
> req=0x7f580ea6d1f0 =
>
> 2018-01-11 20:15:01.648698 7f5830ab7700  1 == starting new request
> req=0x7f5830ab11f0 =
>
> 2018-01-11 20:15:10.328810 7f5832abb700  1 == 

[ceph-users] 4 incomplete PGs causing RGW to go offline?

2018-01-11 Thread Brent Kennedy
We have 3 RadosGW servers running behind HAProxy to enable clients to
connect to the ceph cluster like an amazon bucket.  After all the failures
and upgrade issues were resolved, I cannot get the RadosGW servers to stay
online.  They were upgraded to luminous, I even upgraded the OS to Ubuntu 16
on them ( before upgrading to Luminous ).  They used to have apache on them
as they ran Hammer and before that firefly.  I removed apache before
upgrading to Luminous.  The start up and run for about 4-6 hours before all
three start to go offline.  Client traffic is light right now as we are just
testing file read/write before we reactivate them ( they switched back to
amazon while we fix them ).  

 

Could the 4 incomplete PGs be causing them to go offline?  The last time I
saw an issue like this was when recovery wasn't working 100%, so it seems
related since they haven't been stable since we upgraded( but that was also
after the failures we had, which is why I am not trying to specifically
blame the upgrade ).

 

When I look at the radosgw log, this is what I see ( the first 2 lines show
up plenty before this, they are health checks by the haproxy server, the
next two are file requests that 404 fail I am guessing, then the last one is
me restarting the service ):

 

2018-01-11 20:14:36.640577 7f5826aa3700  1 == req done
req=0x7f5826a9d1f0 op status=0 http_status=200 ==

2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000:
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.640835 7f5816282700  1 == req done
req=0x7f581627c1f0 op status=0 http_status=200 ==

2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c61:
192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.761917 7f5835ac1700  1 == starting new request
req=0x7f5835abb1f0 =

2018-01-11 20:14:36.763936 7f5835ac1700  1 == req done
req=0x7f5835abb1f0 op status=0 http_status=404 ==

2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000:
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:36.772611 7f5808266700  1 == starting new request
req=0x7f58082601f0 =

2018-01-11 20:14:36.773733 7f5808266700  1 == req done
req=0x7f58082601f0 op status=0 http_status=404 ==

2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000:
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.163617 7f5836ac3700  1 == starting new request
req=0x7f5836abd1f0 =

2018-01-11 20:14:38.165352 7f5836ac3700  1 == req done
req=0x7f5836abd1f0 op status=0 http_status=404 ==

2018-01-11 20:14:38.165401 7f5836ac3700  1 civetweb: 0x56202c4e2000:
192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD
/Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 -
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.170551 7f5807a65700  1 == starting new request
req=0x7f5807a5f1f0 =

2018-01-11 20:14:40.322236 7f58352c0700  1 == starting new request
req=0x7f58352ba1f0 =

2018-01-11 20:14:40.323468 7f5834abf700  1 == starting new request
req=0x7f5834ab91f0 =

2018-01-11 20:14:41.643365 7f58342be700  1 == starting new request
req=0x7f58342b81f0 =

2018-01-11 20:14:41.643358 7f58312b8700  1 == starting new request
req=0x7f58312b21f0 =

2018-01-11 20:14:50.324196 7f5829aa9700  1 == starting new request
req=0x7f5829aa31f0 =

2018-01-11 20:14:50.325622 7f58332bc700  1 == starting new request
req=0x7f58332b61f0 =

2018-01-11 20:14:51.645678 7f58362c2700  1 == starting new request
req=0x7f58362bc1f0 =

2018-01-11 20:14:51.645671 7f582e2b2700  1 == starting new request
req=0x7f582e2ac1f0 =

2018-01-11 20:15:00.326452 7f5815a81700  1 == starting new request
req=0x7f5815a7b1f0 =

2018-01-11 20:15:00.328787 7f5828aa7700  1 == starting new request
req=0x7f5828aa11f0 =

2018-01-11 20:15:01.648196 7f580ea73700  1 == starting new request
req=0x7f580ea6d1f0 =

2018-01-11 20:15:01.648698 7f5830ab7700  1 == starting new request
req=0x7f5830ab11f0 =

2018-01-11 20:15:10.328810 7f5832abb700  1 == starting new request
req=0x7f5832ab51f0 =

2018-01-11 20:15:10.329541 7f582f2b4700  1 == starting new request
req=0x7f582f2ae1f0 =

2018-01-11 20:15:11.650655 7f582d2b0700  1 == starting new request
req=0x7f582d2aa1f0 =

2018-01-11 20:15:11.651401 7f582aaab700  1 == starting new request
req=0x7f582aaa51f0 =

2018-01-11 20:15:20.332032 7f582c2ae700  1 == starting new request
req=0x7f582c2a81f0