Re: [ceph-users] CephFS very unstable with many small files

Stijn De Weirdt Sun, 25 Feb 2018 23:21:07 -0800

hi,

can you give soem more details on the setup? number and size of osds.
are you using EC or not? and if so, what EC parameters?


thanks,

stijn

On 02/26/2018 08:15 AM, Linh Vu wrote:
> Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the 
> OSD nodes have 128GB each. Networking is 2x25Gbe.
> 
> 
> We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 
> 500-ish compute nodes. We have done stress testing with small files up to 2M 
> per directory as part of our acceptance testing, and encountered no problem.
> 
> ________________________________
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Oliver 
> Freyermuth <freyerm...@physik.uni-bonn.de>
> Sent: Monday, 26 February 2018 3:45:59 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] CephFS very unstable with many small files
> 
> Dear Cephalopodians,
> 
> in preparation for production, we have run very successful tests with large 
> sequential data,
> and just now a stress-test creating many small files on CephFS.
> 
> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 
> hosts with 32 OSDs each, running in EC k=4 m=2.
> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 
> 12.2.3.
> There are (at the moment) only two MDS's, one is active, the other standby.
> 
> For the test, we had 1120 client processes on 40 client machines (all 
> cephfs-fuse!) extract a tarball with 150k small files
> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a 
> separate subdirectory.
> 
> Things started out rather well (but expectedly slow), we had to increase
> mds_log_max_segments => 240
> mds_log_max_expiring => 160
> due to https://github.com/ceph/ceph/pull/18624
> and adjusted mds_cache_memory_limit to 4 GB.
> 
> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for 
> metadata) and so we have been careful with the cache
> (e.g. due to http://tracker.ceph.com/issues/22599 ).
> 
> After a while, we tested MDS failover and realized we entered a flip-flop 
> situation between the two MDS nodes we have.
> Increasing mds_beacon_grace to 240 helped with that.
> 
> Now, with about 100,000,000 objects written, we are in a disaster situation.
> First off, the MDS could not restart anymore - it required >40 GB of memory, 
> which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
> but join took many minutes:
> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
> and finally, 5 minutes later, OOM.
> 
> I stopped half of the stress-test tar's, which did not help - then I rebooted 
> half of the clients, which did help and let the MDS recover just fine.
> So it seems the client caps have been too many for the MDS to handle. I'm 
> unsure why "tar" would cause so many open file handles.
> Is there anything that can be configured to prevent this from happening?
> Now, I only lost some "stress test data", but later, it might be user's 
> data...
> 
> 
> In parallel, I had reinstalled one OSD host.
> It was backfilling well, but now, <24 hours later, before backfill has 
> finished, several OSD hosts enter OOM condition.
> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
> default bluestore cache size of 1 GB. However, it seems the processes are 
> using much more,
> up to several GBs until memory is exhausted. They then become sluggish, are 
> kicked out of the cluster, come back, and finally at some point they are 
> OOMed.
> 
> Now, I have restarted some OSD processes and hosts which helped to reduce the 
> memory usage - but now I have some OSDs crashing continously,
> leading to PG unavailability, and preventing recovery from completion.
> I have reported a ticket about that, with stacktrace and log:
> http://tracker.ceph.com/issues/23120
> This might well be a consequence of a previous OOM killer condition.
> 
> However, my final question after these ugly experiences is:
> Did somebody ever stresstest CephFS for many small files?
> Are those issues known? Can special configuration help?
> Are the memory issues known? Are there solutions?
> 
> We don't plan to use Ceph for many small files, but we don't have full 
> control of our users, which is why we wanted to test this "worst case" 
> scenario.
> It would be really bad if we lost a production filesystem due to such a 
> situation, so the plan was to test now to know what happens before we enter 
> production.
> As of now, this looks really bad, and I'm not sure the cluster will ever 
> recover.
> I'll give it some more time, but we'll likely kill off all remaining clients 
> next week and see what happens, and worst case recreate the Ceph cluster.
> 
> Cheers,
>         Oliver
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS very unstable with many small files

Reply via email to