Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-10-01 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Hi List,
> 
> We are planning to move a filesystem workload (currently nfs) to CephFS.
> It's around 29 TB. The unusual thing here is the amount of directories
> in use to host the files. In order to combat a "too many files in one
> directory" scenario a "let's make use of recursive directories" approach.
> Not ideal either. This workload is supposed to be moved to (Ceph) S3
> sometime in the future, but until then, it has to go to a shared
> filesystem ...
> 
> So what is unusual about this? The directory layout looks like this
> 
> /data/files/00/00/[0-8][0-9]/[0-9]/ from this point on there will be 7
> directories created to store 1 file.
> 
> Total amount of directories in a file path is 14. There are around 150 M
> files in 400 M directories.
> 
> The working set won't be big. Most files will just sit around and will
> not be touched. The active amount of files wil be a few thousand.
> 
> We are wondering if this kind of directory structure is suitable for
> CephFS. Might the MDS get difficulties with keeping up that many inodes
> / dentries or doesn't it care at all?
> 
> The amount of metadata overhead might be horrible, but we will test that
> out.

This awkward dataset is "live" ... and the MDS has been happily
crunching away so far. Peaking at 42.5 M caps. Multiple parallel rsyncs
(20+) to fill cephfs was no issue whatsover.

Thanks Nathan Fish and Burkhard Linke for sharing helpful MDS insight!

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Nathan Fish
Yes, definitely enable standby-replay. I saw sub-second failovers with
standby-replay, but when I restarted the new rank 0 (previously 0-s)
while the standby was syncing up to become 0-s, the failover took
several minutes. This was with ~30GiB of cache.

On Fri, Jul 26, 2019 at 12:41 PM Burkhard Linke
 wrote:
>
> Hi,
>
>
> one particular interesting point in setups with a large number of active
> files/caps is the failover.
>
>
> If your MDS fails (assuming single MDS, multiple MDS with multiple
> active ranks behave in the same way for _each_ rank), the monitors will
> detect the failure and update the mds map. CephFS clients will be
> notified about the update, connect to the new MDS the rank has failed
> over to (hopefully within the connect timeout...). They will also
> re-request all their currently active caps from the MDS to allow it to
> recreate the state of the point in time before the failure.
>
>
> And this is were things can get "interesting". Assuming a cold standby
> MDS, the MDS will receive the information about all active files and
> capabilities assigned to the various client. It also has to _stat_ all
> these files during the rejoin phase. And if million of files have to
> stat'ed, this may take time, put a lot of pressure on the metadata and
> data pools, and might even lead to timeouts and subsequent failure or
> failover to another MDS.
>
>
> We had some problems with this in the past, but it became better and
> less failure prone with every ceph release (great work, ceph
> developers!). Our current setup has up to 15 million cached inodes and
> several million caps in the worst case (during nightly backup). The caps
> per client limit in luminous/nautilus? helps a lot with reducing the
> number of active files and caps.
>
> Prior to nautilus we configured a secondary MDS as standby-replay, which
> allows it to cache the same inodes that were active on the primary.
> During rejoin the stat call can be served from cache, which makes the
> failover a lot faster and less demanding for the ceph cluster itself. In
> nautilus the setup for standby-replay has moved from a daemon feature to
> a filesystem feature (one spare MDS becomes designated standby-replay
> for a rank). But there are also other caveats like not selecting one of
> these as failover for another rank.
>
>
> So if you want to test cephfs for your use case, I would highly
> recommend to test failover, too. Both a controlled failover and an
> unexpected one. You may also want to use multiple active MDS, but my
> experience with these setups is limited.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Burkhard Linke

Hi,


one particular interesting point in setups with a large number of active 
files/caps is the failover.



If your MDS fails (assuming single MDS, multiple MDS with multiple 
active ranks behave in the same way for _each_ rank), the monitors will 
detect the failure and update the mds map. CephFS clients will be 
notified about the update, connect to the new MDS the rank has failed 
over to (hopefully within the connect timeout...). They will also 
re-request all their currently active caps from the MDS to allow it to 
recreate the state of the point in time before the failure.



And this is were things can get "interesting". Assuming a cold standby 
MDS, the MDS will receive the information about all active files and 
capabilities assigned to the various client. It also has to _stat_ all 
these files during the rejoin phase. And if million of files have to 
stat'ed, this may take time, put a lot of pressure on the metadata and 
data pools, and might even lead to timeouts and subsequent failure or 
failover to another MDS.



We had some problems with this in the past, but it became better and 
less failure prone with every ceph release (great work, ceph 
developers!). Our current setup has up to 15 million cached inodes and 
several million caps in the worst case (during nightly backup). The caps 
per client limit in luminous/nautilus? helps a lot with reducing the 
number of active files and caps.


Prior to nautilus we configured a secondary MDS as standby-replay, which 
allows it to cache the same inodes that were active on the primary. 
During rejoin the stat call can be served from cache, which makes the 
failover a lot faster and less demanding for the ceph cluster itself. In 
nautilus the setup for standby-replay has moved from a daemon feature to 
a filesystem feature (one spare MDS becomes designated standby-replay 
for a rank). But there are also other caveats like not selecting one of 
these as failover for another rank.



So if you want to test cephfs for your use case, I would highly 
recommend to test failover, too. Both a controlled failover and an 
unexpected one. You may also want to use multiple active MDS, but my 
experience with these setups is limited.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Nathan Fish
Ok, great. Some numbers for you:
I have a filesystem of 50 million files, 5.4 TB.
The data pool is on HDD OSDs with Optane DB/WAL, size=3.
The metadata pool (Optane OSDs) has 17GiB "stored", 20GiB "used", at
size=3. 5.18M objects.
When doing parallel rsyncs, with ~14M inodes open, the MDS cache goes
to about 40GiB but it remains stable. MDS CPU usage goes to about 400%
(4 cores worth, spread across 6-8 processes). Hope you find this
useful.

On Fri, Jul 26, 2019 at 11:05 AM Stefan Kooman  wrote:
>
> Quoting Nathan Fish (lordci...@gmail.com):
> > MDS CPU load is proportional to metadata ops/second. MDS RAM cache is
> > proportional to # of files (including directories) in the working set.
> > Metadata pool size is proportional to total # of files, plus
> > everything in the RAM cache. I have seen that the metadata pool can
> > balloon 8x between being idle, and having every inode open by a
> > client.
> > The main thing I'd recommend is getting SSD OSDs to dedicate to the
> > metadata pools, and SSDs for the HDD OSD's DB/WAL. NVMe if you can. If
> > you put that much metadata on only HDDs, it's going to be slow.
>
> Only SSD for OSD data pool and NVMe for metadata pool, so that should be
> fine. Besides the initial loading of that many files / directories this
> workload shouldn't be any problem.
>
> Thanks for your feedback.
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Stefan Kooman
Quoting Nathan Fish (lordci...@gmail.com):
> MDS CPU load is proportional to metadata ops/second. MDS RAM cache is
> proportional to # of files (including directories) in the working set.
> Metadata pool size is proportional to total # of files, plus
> everything in the RAM cache. I have seen that the metadata pool can
> balloon 8x between being idle, and having every inode open by a
> client.
> The main thing I'd recommend is getting SSD OSDs to dedicate to the
> metadata pools, and SSDs for the HDD OSD's DB/WAL. NVMe if you can. If
> you put that much metadata on only HDDs, it's going to be slow.

Only SSD for OSD data pool and NVMe for metadata pool, so that should be
fine. Besides the initial loading of that many files / directories this
workload shouldn't be any problem.

Thanks for your feedback.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Nathan Fish
MDS CPU load is proportional to metadata ops/second. MDS RAM cache is
proportional to # of files (including directories) in the working set.
Metadata pool size is proportional to total # of files, plus
everything in the RAM cache. I have seen that the metadata pool can
balloon 8x between being idle, and having every inode open by a
client.
The main thing I'd recommend is getting SSD OSDs to dedicate to the
metadata pools, and SSDs for the HDD OSD's DB/WAL. NVMe if you can. If
you put that much metadata on only HDDs, it's going to be slow.



On Fri, Jul 26, 2019 at 5:11 AM Stefan Kooman  wrote:
>
> Hi List,
>
> We are planning to move a filesystem workload (currently nfs) to CephFS.
> It's around 29 TB. The unusual thing here is the amount of directories
> in use to host the files. In order to combat a "too many files in one
> directory" scenario a "let's make use of recursive directories" approach.
> Not ideal either. This workload is supposed to be moved to (Ceph) S3
> sometime in the future, but until then, it has to go to a shared
> filesystem ...
>
> So what is unusual about this? The directory layout looks like this
>
> /data/files/00/00/[0-8][0-9]/[0-9]/ from this point on there will be 7
> directories created to store 1 file.
>
> Total amount of directories in a file path is 14. There are around 150 M
> files in 400 M directories.
>
> The working set won't be big. Most files will just sit around and will
> not be touched. The active amount of files wil be a few thousand.
>
> We are wondering if this kind of directory structure is suitable for
> CephFS. Might the MDS get difficulties with keeping up that many inodes
> / dentries or doesn't it care at all?
>
> The amount of metadata overhead might be horrible, but we will test that
> out.
>
> Thanks,
>
> Stefan
>
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Stefan Kooman
Hi List,

We are planning to move a filesystem workload (currently nfs) to CephFS.
It's around 29 TB. The unusual thing here is the amount of directories
in use to host the files. In order to combat a "too many files in one
directory" scenario a "let's make use of recursive directories" approach.
Not ideal either. This workload is supposed to be moved to (Ceph) S3
sometime in the future, but until then, it has to go to a shared
filesystem ...

So what is unusual about this? The directory layout looks like this

/data/files/00/00/[0-8][0-9]/[0-9]/ from this point on there will be 7
directories created to store 1 file.

Total amount of directories in a file path is 14. There are around 150 M
files in 400 M directories.

The working set won't be big. Most files will just sit around and will
not be touched. The active amount of files wil be a few thousand.

We are wondering if this kind of directory structure is suitable for
CephFS. Might the MDS get difficulties with keeping up that many inodes
/ dentries or doesn't it care at all?

The amount of metadata overhead might be horrible, but we will test that
out.

Thanks,

Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com