I'm talking about bluestore db+wal caching. It's good to know cache tier is deprecated now, I should check why.
It's not possible because I don't have enough slots on servers. I'm considering buying nvme in pci form. Now I'm trying to speed up the rep 2 pool for the file size between 10K-700K millions of small files. With compression the write speed is %5 reduced but the delete speed is %30 increased. Do you have any tuning advice for me? Best regards, Frank Schilder <fr...@dtu.dk>, 9 May 2023 Sal, 11:02 tarihinde şunu yazdı: > > When you say cache device, do you mean a ceph cache pool as a tier to a rep-2 > pool? If so, you might want to reconsider, cache pools are deprecated and > will be removed from ceph at some point. > > If you have funds to buy new drives, you can just as well deploy a beegfs (or > something else) on these. It is no problem to run ceph and beegfs on the same > hosts. The disks should not be shared, but that's all. This might still be a > simpler config than introducing a cache tier just to cover up for rep-2 > overhead. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: mhnx <morphinwith...@gmail.com> > Sent: Friday, May 5, 2023 9:26 PM > To: Frank Schilder > Cc: Janne Johansson; Ceph Users > Subject: Re: [ceph-users] Re: How can I use not-replicated pool (replication > 1 or raid-0) > > Hello Frank. > > >If your only tool is a hammer ... > >Sometimes its worth looking around. > > You are absolutely right! But I have limitations because my customer > is a startup and they want to create a hybrid system with current > hardware for all their needs. That's why I'm spending time to find a > work around. They are using cephfs on their Software and I moved them > on this path from NFS. At the beginning they were only looking for a > rep2 pool for their important data and Ceph was an absolutely great > idea. Now the system is running smoothly but they also want to move > the [garbage data] on the same system but as I told you, the data flow > is different and the current hardware (non plp sata ssd's without > bluestore cache) can not supply the required speed with replication 2. > They are happy with replication 1 speed but I'm not because when any > network, disk, or node goes down, the cluster will be suspended due to > rep1. > > Now I advised at least adding low latency PCI-Nvme's as a cache device > to force rep2 pool. I will solve the Write latency with PLP low > latency nvme's but still I need to solve deletion speed too. Actually > with the random write-delete I was trying to tell the difference on > delete speed. You are right, /dev/random requires cpu power and it > will create latency and it should not used for write speed tests. > > Currently I'm working on development of an automation script to fix > any problem for replication 1 pool. > It is what it is. > > Best regards. > > > > > Frank Schilder <fr...@dtu.dk>, 3 May 2023 Çar, 11:50 tarihinde şunu yazdı: > > > > > > Hi mhnx. > > > > > I also agree with you, Ceph is not designed for this kind of use case > > > but I tried to continue what I know. > > If your only tool is a hammer ... > > Sometimes its worth looking around. > > > > While your tests show that a rep-1 pool is faster than a rep-2 pool, the > > values are not exactly impressive. There are 2 things that are relevant > > here: ceph is a high latency system, its software stack is quite > > heavy-weight. Even for a rep-1 pool its doing a lot to ensure data > > integrity. BeeGFS is a lightweight low-latency system skipping a lot of > > magic, which makes it very suited for performance critical tasks but less > > for long-term archival applications. > > > > The second is that the device /dev/urandom is actually very slow (and even > > unpredictable on some systems, it might wait for more entropy to be > > created). Your times are almost certainly affected by that. If you want to > > have comparable and close to native storage performance, create the files > > you want to write to storage first in RAM and then copy from RAM to > > storage. Using random data is a good idea to bypass potential built-in > > accelerations for special data, like all-zeros. However, exclude the random > > number generator from the benchmark and generate the data first before > > timing its use. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: mhnx <morphinwith...@gmail.com> > > Sent: Tuesday, May 2, 2023 5:25 PM > > To: Frank Schilder > > Cc: Janne Johansson; Ceph Users > > Subject: Re: [ceph-users] Re: How can I use not-replicated pool > > (replication 1 or raid-0) > > > > Thank you for the explanation Frank. > > > > I also agree with you, Ceph is not designed for this kind of use case > > but I tried to continue what I know. > > My idea was exactly what you described, I was trying to automate > > cleaning or recreating on any failure. > > > > As you can see below, rep1 pool is very fast: > > - Create: time for i in {00001..99999}; do head -c 1K </dev/urandom > > >randfile$i; done > > replication 2 : 31m59.917s > > replication 1 : 7m6.046s > > -------------------------------- > > - Delete: time rm -rf testdir/ > > replication 2 : 11m56.994s > > replication 1 : 0m40.756s > > ------------------------------------- > > > > I started learning DRBD, I will also check BeeGFS thanks for the advice. > > > > Regards. > > > > Frank Schilder <fr...@dtu.dk>, 1 May 2023 Pzt, 10:27 tarihinde şunu yazdı: > > > > > > I think you misunderstood Janne's reply. The main statement is at the > > > end, ceph is not designed for an "I don't care about data" use case. If > > > you need speed for temporary data where you can sustain data loss, go for > > > something simpler. For example, we use beegfs with great success for a > > > burst buffer for an HPC cluster. It is very lightweight and will pull out > > > all performance your drives can offer. In case of disaster it is easily > > > possible to clean up. Beegfs does not care about lost data, such data > > > will simply become inaccessible while everything else just moves on. It > > > will not try to self-heal either. It doesn't even scrub data, so no > > > competition of users with admin IO. > > > > > > Its pretty much your use case. We clean it up every 6-8 weeks and if > > > something breaks we just redeploy the whole thing from scratch. > > > Performance is great and its a very simple and economic system to > > > administrate. No need for the whole ceph daemon engine with large RAM > > > requirements and extra admin daemons. > > > > > > Use ceph for data you want to survive a nuclear blast. Don't use it for > > > things its not made for and then complain. > > > > > > Best regards, > > > ================= > > > Frank Schilder > > > AIT Risø Campus > > > Bygning 109, rum S14 > > > > > > ________________________________________ > > > From: mhnx <morphinwith...@gmail.com> > > > Sent: Saturday, April 29, 2023 5:48 AM > > > To: Janne Johansson > > > Cc: Ceph Users > > > Subject: [ceph-users] Re: How can I use not-replicated pool (replication > > > 1 or raid-0) > > > > > > Hello Janne, thank you for your response. > > > > > > I understand your advice and be sure that I've designed too many EC > > > pools and I know the mess. This is not an option because I need SPEED. > > > > > > Please let me tell you, my hardware first to meet the same vision. > > > Server: R620 > > > Cpu: 2 x Xeon E5-2630 v2 @ 2.60GHz > > > Ram: 128GB - DDR3 > > > Disk1: 20x Samsung SSD 860 2TB > > > Disk2: 10x Samsung SSD 870 2TB > > > > > > My ssds does not have PLP. Because of that, every ceph write also > > > waits for TRIM. I want to know how much latency we are talking about > > > because I'm thinking of adding PLP NVME for wal+db cache to gain some > > > speed. > > > As you can see, I even try to gain from every TRIM command. > > > Currently I'm testing replication 2 pool and even this speed is not > > > enough for my use case. > > > Now I'm trying to boost the deletion speed because I'm writing and > > > deleting files all the time and this never ends. > > > I write this mail because replication 1 will decrease the deletion > > > speed but still I'm trying to tune some MDS+ODS parameters to increase > > > delete speed. > > > > > > Any help and idea will be great for me. Thanks. > > > Regards. > > > > > > > > > > > > Janne Johansson <icepic...@gmail.com>, 12 Nis 2023 Çar, 10:10 > > > tarihinde şunu yazdı: > > > > > > > > Den mån 10 apr. 2023 kl 22:31 skrev mhnx <morphinwith...@gmail.com>: > > > > > Hello. > > > > > I have a 10 node cluster. I want to create a non-replicated pool > > > > > (replication 1) and I want to ask some questions about it: > > > > > > > > > > Let me tell you my use case: > > > > > - I don't care about losing data, > > > > > - All of my data is JUNK and these junk files are usually between 1KB > > > > > to 32MB. > > > > > - These files will be deleted in 5 days. > > > > > - Writable space and I/O speed is more important. > > > > > - I have high Write/Read/Delete operations, minimum 200GB a day. > > > > > > > > That is "only" 18MB/s which should easily be doable even with > > > > repl=2,3,4. or EC. This of course depends on speed of drives, network, > > > > cpus and all that, but in itself it doesn't seem too hard to achieve > > > > in terms of average speeds. We have EC8+3 rgw backed by some 12-14 OSD > > > > hosts with hdd and nvme (for wal+db) that can ingest over 1GB/s if you > > > > parallelize the rgw streams, so 18MB/s seems totally doable with 10 > > > > decent machines. Even with replication. > > > > > > > > > I'm afraid that, in any failure, I won't be able to access the whole > > > > > cluster. Losing data is okay but I have to ignore missing files, > > > > > > > > Even with repl=1, in case of a failure, the cluster will still aim at > > > > fixing itself rather than ignoring currently lost data and moving on, > > > > so any solution that involves "forgetting" about lost data would need > > > > a ceph operator telling the cluster to ignore all the missing parts > > > > and to recreate the broken PGs. This would not be automatic. > > > > > > > > > > > > -- > > > > May the most significant bit of your life be positive. > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io