hi,

can you give soem more details on the setup? number and size of osds.
are you using EC or not? and if so, what EC parameters?

thanks,

stijn

On 02/26/2018 08:15 AM, Linh Vu wrote:
> Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the 
> OSD nodes have 128GB each. Networking is 2x25Gbe.
> 
> 
> We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 
> 500-ish compute nodes. We have done stress testing with small files up to 2M 
> per directory as part of our acceptance testing, and encountered no problem.
> 
> ________________________________
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Oliver 
> Freyermuth <freyerm...@physik.uni-bonn.de>
> Sent: Monday, 26 February 2018 3:45:59 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] CephFS very unstable with many small files
> 
> Dear Cephalopodians,
> 
> in preparation for production, we have run very successful tests with large 
> sequential data,
> and just now a stress-test creating many small files on CephFS.
> 
> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 
> hosts with 32 OSDs each, running in EC k=4 m=2.
> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 
> 12.2.3.
> There are (at the moment) only two MDS's, one is active, the other standby.
> 
> For the test, we had 1120 client processes on 40 client machines (all 
> cephfs-fuse!) extract a tarball with 150k small files
> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a 
> separate subdirectory.
> 
> Things started out rather well (but expectedly slow), we had to increase
> mds_log_max_segments => 240
> mds_log_max_expiring => 160
> due to https://github.com/ceph/ceph/pull/18624
> and adjusted mds_cache_memory_limit to 4 GB.
> 
> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for 
> metadata) and so we have been careful with the cache
> (e.g. due to http://tracker.ceph.com/issues/22599 ).
> 
> After a while, we tested MDS failover and realized we entered a flip-flop 
> situation between the two MDS nodes we have.
> Increasing mds_beacon_grace to 240 helped with that.
> 
> Now, with about 100,000,000 objects written, we are in a disaster situation.
> First off, the MDS could not restart anymore - it required >40 GB of memory, 
> which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
> but join took many minutes:
> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
> and finally, 5 minutes later, OOM.
> 
> I stopped half of the stress-test tar's, which did not help - then I rebooted 
> half of the clients, which did help and let the MDS recover just fine.
> So it seems the client caps have been too many for the MDS to handle. I'm 
> unsure why "tar" would cause so many open file handles.
> Is there anything that can be configured to prevent this from happening?
> Now, I only lost some "stress test data", but later, it might be user's 
> data...
> 
> 
> In parallel, I had reinstalled one OSD host.
> It was backfilling well, but now, <24 hours later, before backfill has 
> finished, several OSD hosts enter OOM condition.
> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
> default bluestore cache size of 1 GB. However, it seems the processes are 
> using much more,
> up to several GBs until memory is exhausted. They then become sluggish, are 
> kicked out of the cluster, come back, and finally at some point they are 
> OOMed.
> 
> Now, I have restarted some OSD processes and hosts which helped to reduce the 
> memory usage - but now I have some OSDs crashing continously,
> leading to PG unavailability, and preventing recovery from completion.
> I have reported a ticket about that, with stacktrace and log:
> http://tracker.ceph.com/issues/23120
> This might well be a consequence of a previous OOM killer condition.
> 
> However, my final question after these ugly experiences is:
> Did somebody ever stresstest CephFS for many small files?
> Are those issues known? Can special configuration help?
> Are the memory issues known? Are there solutions?
> 
> We don't plan to use Ceph for many small files, but we don't have full 
> control of our users, which is why we wanted to test this "worst case" 
> scenario.
> It would be really bad if we lost a production filesystem due to such a 
> situation, so the plan was to test now to know what happens before we enter 
> production.
> As of now, this looks really bad, and I'm not sure the cluster will ever 
> recover.
> I'll give it some more time, but we'll likely kill off all remaining clients 
> next week and see what happens, and worst case recreate the Ceph cluster.
> 
> Cheers,
>         Oliver
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to