Re: [ceph-users] MDS flapping: how to increase MDS timeouts?

2017-01-30 Thread John Spray
On Mon, Jan 30, 2017 at 7:09 AM, Burkhard Linke
 wrote:
> Hi,
>
>
>
> On 01/26/2017 03:34 PM, John Spray wrote:
>>
>> On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
>>  wrote:
>>>
>>> HI,
>>>
>>>
>>> we are running two MDS servers in active/standby-replay setup. Recently
>>> we
>>> had to disconnect active MDS server, and failover to standby works as
>>> expected.
>>>
>>>
>>> The filesystem currently contains over 5 million files, so reading all
>>> the
>>> metadata information from the data pool took too long, since the
>>> information
>>> was not available on the OSD page caches. The MDS was timed out by the
>>> mons,
>>> and a failover switch to the former active MDS (which was available as
>>> standby again) happened. This MDS in turn had to read the metadata, again
>>> running into a timeout, failover, etc. I resolved the situation by
>>> disabling
>>> one of the MDS, which kept the mons from failing the now solely available
>>> MDS.
>>
>> The MDS does not re-read every inode on startup -- rather, it replays
>> its journal (the overall number of files in your system does not
>> factor into this).
>>
>>> So given a large filesystem, how do I prevent failover flapping between
>>> MDS
>>> instances that are in the rejoin state and reading the inode information?
>>
>> The monitor's decision to fail an unresponsive MDS is based on the MDS
>> not sending a beacon to the mon -- there is no limit on how long an
>> MDS is allowed to stay in a given state (such as rejoin).
>>
>> So there are two things to investigate here:
>>   * Why is the MDS taking so long to start?
>>   * Why is the MDS failing to send beacons to the monitor while it is
>> in whatever process that is taking it so long?
>
>
> Under normal operation our system has about 4.5-4.9 million active caps.
> Most of them (~ 4 millions) are associated to the machine running the
> nightly backups.
>
> I assume that during the rejoin phase, the MDS is renewing the clients'
> caps. We see massive amount of small I/O on the data pool (up to
> 30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to access the
> inode information to renew a cap? This would explain the high number of IOPS
> and why the rejoin phase can take up to 20 minutes.

Ah, I see.  You've identified the issue - the client is informing the
MDS about which inodes it has caps on, and the MDS is responding by
loading those inodes -- in order to dereference them it goes via the
data pool to read the backtrace on each of the inode objects.

This is not a great behaviour from the MDS: doing O(files with caps)
IOs, especially to the data pool, is not something we want to be doing
during failovers.

Things to try to mitigate this with the current code:
 * Using standby-replay daemons (if you're not already), so that the
standby has a better chance to already have the inodes in cache and
thereby avoid loading them
 * Increasing the MDS journal size ("mds log max segments") so that
the MDS will tend to keep a longer journal and have a better chance to
still have the inodes in the journal at the time the failover happens.
 * Decreasing "mds cache size" to limit the number of caps that can be
out there at any one time

I'll respond separately to ceph-devel about how we might change the
code to improve this case.

John



>
> Not sure about the second question, since the IOPS should not prevent
> beacons from reaching the monitors. We will have to move the MDS servers to
> different racks during this week. I'll try to bump up the debug level
> before.
>
>
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS flapping: how to increase MDS timeouts?

2017-01-29 Thread Burkhard Linke

Hi,


On 01/26/2017 03:34 PM, John Spray wrote:

On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
 wrote:

HI,


we are running two MDS servers in active/standby-replay setup. Recently we
had to disconnect active MDS server, and failover to standby works as
expected.


The filesystem currently contains over 5 million files, so reading all the
metadata information from the data pool took too long, since the information
was not available on the OSD page caches. The MDS was timed out by the mons,
and a failover switch to the former active MDS (which was available as
standby again) happened. This MDS in turn had to read the metadata, again
running into a timeout, failover, etc. I resolved the situation by disabling
one of the MDS, which kept the mons from failing the now solely available
MDS.

The MDS does not re-read every inode on startup -- rather, it replays
its journal (the overall number of files in your system does not
factor into this).


So given a large filesystem, how do I prevent failover flapping between MDS
instances that are in the rejoin state and reading the inode information?

The monitor's decision to fail an unresponsive MDS is based on the MDS
not sending a beacon to the mon -- there is no limit on how long an
MDS is allowed to stay in a given state (such as rejoin).

So there are two things to investigate here:
  * Why is the MDS taking so long to start?
  * Why is the MDS failing to send beacons to the monitor while it is
in whatever process that is taking it so long?


Under normal operation our system has about 4.5-4.9 million active caps. 
Most of them (~ 4 millions) are associated to the machine running the 
nightly backups.


I assume that during the rejoin phase, the MDS is renewing the clients' 
caps. We see massive amount of small I/O on the data pool (up to 
30.000-40.000 IOPS) during the rejoin phase. Does the MDS need to access 
the inode information to renew a cap? This would explain the high number 
of IOPS and why the rejoin phase can take up to 20 minutes.


Not sure about the second question, since the IOPS should not prevent 
beacons from reaching the monitors. We will have to move the MDS servers 
to different racks during this week. I'll try to bump up the debug level 
before.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS flapping: how to increase MDS timeouts?

2017-01-26 Thread John Spray
On Thu, Jan 26, 2017 at 8:18 AM, Burkhard Linke
 wrote:
> HI,
>
>
> we are running two MDS servers in active/standby-replay setup. Recently we
> had to disconnect active MDS server, and failover to standby works as
> expected.
>
>
> The filesystem currently contains over 5 million files, so reading all the
> metadata information from the data pool took too long, since the information
> was not available on the OSD page caches. The MDS was timed out by the mons,
> and a failover switch to the former active MDS (which was available as
> standby again) happened. This MDS in turn had to read the metadata, again
> running into a timeout, failover, etc. I resolved the situation by disabling
> one of the MDS, which kept the mons from failing the now solely available
> MDS.

The MDS does not re-read every inode on startup -- rather, it replays
its journal (the overall number of files in your system does not
factor into this).

> So given a large filesystem, how do I prevent failover flapping between MDS
> instances that are in the rejoin state and reading the inode information?

The monitor's decision to fail an unresponsive MDS is based on the MDS
not sending a beacon to the mon -- there is no limit on how long an
MDS is allowed to stay in a given state (such as rejoin).

So there are two things to investigate here:
 * Why is the MDS taking so long to start?
 * Why is the MDS failing to send beacons to the monitor while it is
in whatever process that is taking it so long?

The answer to both is likely to be found in an MDS log with the debug
level turned up, gathered as it starts up.

John


>
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS flapping: how to increase MDS timeouts?

2017-01-26 Thread Burkhard Linke

HI,


we are running two MDS servers in active/standby-replay setup. Recently 
we had to disconnect active MDS server, and failover to standby works as 
expected.



The filesystem currently contains over 5 million files, so reading all 
the metadata information from the data pool took too long, since the 
information was not available on the OSD page caches. The MDS was timed 
out by the mons, and a failover switch to the former active MDS (which 
was available as standby again) happened. This MDS in turn had to read 
the metadata, again running into a timeout, failover, etc. I resolved 
the situation by disabling one of the MDS, which kept the mons from 
failing the now solely available MDS.



So given a large filesystem, how do I prevent failover flapping between 
MDS instances that are in the rejoin state and reading the inode 
information?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com