[ceph-users] Re: How can I use not-replicated pool (replication 1 or raid-0)

mhnx Wed, 10 May 2023 05:31:45 -0700

I'm talking about bluestore db+wal caching. It's good to know cache
tier is deprecated now, I should check why.


It's not possible because I don't have enough slots on servers. I'm
considering buying nvme in pci form.
Now I'm trying to speed up the rep 2 pool for the file size between
10K-700K millions of small files.
With compression the write speed is %5 reduced but the delete speed is
%30 increased.
Do you have any tuning advice for me?

Best regards,

Frank Schilder <fr...@dtu.dk>, 9 May 2023 Sal, 11:02 tarihinde şunu yazdı:
>
> When you say cache device, do you mean a ceph cache pool as a tier to a rep-2 
> pool? If so, you might want to reconsider, cache pools are deprecated and 
> will be removed from ceph at some point.
>
> If you have funds to buy new drives, you can just as well deploy a beegfs (or 
> something else) on these. It is no problem to run ceph and beegfs on the same 
> hosts. The disks should not be shared, but that's all. This might still be a 
> simpler config than introducing a cache tier just to cover up for rep-2 
> overhead.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: mhnx <morphinwith...@gmail.com>
> Sent: Friday, May 5, 2023 9:26 PM
> To: Frank Schilder
> Cc: Janne Johansson; Ceph Users
> Subject: Re: [ceph-users] Re: How can I use not-replicated pool (replication 
> 1 or raid-0)
>
> Hello Frank.
>
> >If your only tool is a hammer ...
> >Sometimes its worth looking around.
>
> You are absolutely right! But I have limitations because my customer
> is a startup and they want to create a hybrid system with current
> hardware for all their needs. That's why I'm spending time to find a
> work around. They are using cephfs on their Software and I moved them
> on this path from NFS. At the beginning they were only looking for a
> rep2 pool for their important data and Ceph was an absolutely great
> idea. Now the system is running smoothly but they also want to move
> the [garbage data] on the same system but as I told you, the data flow
> is different and the current hardware (non plp sata ssd's without
> bluestore cache) can not supply the required speed with replication 2.
> They are happy with replication 1 speed but I'm not because when any
> network, disk, or node goes down, the cluster will be suspended due to
> rep1.
>
> Now I advised at least adding low latency PCI-Nvme's as a cache device
> to force rep2 pool. I will solve the Write latency with PLP low
> latency nvme's but still I need to solve deletion speed too. Actually
> with the random write-delete I was trying to tell the difference on
> delete speed. You are right, /dev/random requires cpu power and it
> will create latency and it should not used for write speed tests.
>
> Currently I'm working on development of an automation script to fix
> any problem for replication 1 pool.
> It is what it is.
>
> Best regards.
>
>
>
>
> Frank Schilder <fr...@dtu.dk>, 3 May 2023 Çar, 11:50 tarihinde şunu yazdı:
>
>
> >
> > Hi mhnx.
> >
> > > I also agree with you, Ceph is not designed for this kind of use case
> > > but I tried to continue what I know.
> > If your only tool is a hammer ...
> > Sometimes its worth looking around.
> >
> > While your tests show that a rep-1 pool is faster than a rep-2 pool, the 
> > values are not exactly impressive. There are 2 things that are relevant 
> > here: ceph is a high latency system, its software stack is quite 
> > heavy-weight. Even for a rep-1 pool its doing a lot to ensure data 
> > integrity. BeeGFS is a lightweight low-latency system skipping a lot of 
> > magic, which makes it very suited for performance critical tasks but less 
> > for long-term archival applications.
> >
> > The second is that the device /dev/urandom is actually very slow (and even 
> > unpredictable on some systems, it might wait for more entropy to be 
> > created). Your times are almost certainly affected by that. If you want to 
> > have comparable and close to native storage performance, create the files 
> > you want to write to storage first in RAM and then copy from RAM to 
> > storage. Using random data is a good idea to bypass potential built-in 
> > accelerations for special data, like all-zeros. However, exclude the random 
> > number generator from the benchmark and generate the data first before 
> > timing its use.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: mhnx <morphinwith...@gmail.com>
> > Sent: Tuesday, May 2, 2023 5:25 PM
> > To: Frank Schilder
> > Cc: Janne Johansson; Ceph Users
> > Subject: Re: [ceph-users] Re: How can I use not-replicated pool 
> > (replication 1 or raid-0)
> >
> > Thank you for the explanation Frank.
> >
> > I also agree with you, Ceph is not designed for this kind of use case
> > but I tried to continue what I know.
> > My idea was exactly what you described, I was trying to automate
> > cleaning or recreating on any failure.
> >
> > As you can see below, rep1 pool is very fast:
> > - Create: time for i in {00001..99999}; do head -c 1K </dev/urandom
> > >randfile$i; done
> > replication 2 : 31m59.917s
> > replication 1 : 7m6.046s
> > --------------------------------
> > - Delete: time rm -rf testdir/
> > replication 2 : 11m56.994s
> > replication 1 : 0m40.756s
> > -------------------------------------
> >
> > I started learning DRBD, I will also check BeeGFS thanks for the advice.
> >
> > Regards.
> >
> > Frank Schilder <fr...@dtu.dk>, 1 May 2023 Pzt, 10:27 tarihinde şunu yazdı:
> > >
> > > I think you misunderstood Janne's reply. The main statement is at the 
> > > end, ceph is not designed for an "I don't care about data" use case. If 
> > > you need speed for temporary data where you can sustain data loss, go for 
> > > something simpler. For example, we use beegfs with great success for a 
> > > burst buffer for an HPC cluster. It is very lightweight and will pull out 
> > > all performance your drives can offer. In case of disaster it is easily 
> > > possible to clean up. Beegfs does not care about lost data, such data 
> > > will simply become inaccessible while everything else just moves on. It 
> > > will not try to self-heal either. It doesn't even scrub data, so no 
> > > competition of users with admin IO.
> > >
> > > Its pretty much your use case. We clean it up every 6-8 weeks and if 
> > > something breaks we just redeploy the whole thing from scratch. 
> > > Performance is great and its a very simple and economic system to 
> > > administrate. No need for the whole ceph daemon engine with large RAM 
> > > requirements and extra admin daemons.
> > >
> > > Use ceph for data you want to survive a nuclear blast. Don't use it for 
> > > things its not made for and then complain.
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: mhnx <morphinwith...@gmail.com>
> > > Sent: Saturday, April 29, 2023 5:48 AM
> > > To: Janne Johansson
> > > Cc: Ceph Users
> > > Subject: [ceph-users] Re: How can I use not-replicated pool (replication 
> > > 1 or raid-0)
> > >
> > > Hello Janne, thank you for your response.
> > >
> > > I understand your advice and be sure that I've designed too many EC
> > > pools and I know the mess. This is not an option because I need SPEED.
> > >
> > > Please let me tell you, my hardware first to meet the same vision.
> > > Server: R620
> > > Cpu: 2 x Xeon E5-2630 v2 @ 2.60GHz
> > > Ram: 128GB - DDR3
> > > Disk1: 20x Samsung SSD 860 2TB
> > > Disk2: 10x Samsung SSD 870 2TB
> > >
> > > My ssds does not have PLP. Because of that, every ceph write also
> > > waits for TRIM. I want to know how much latency we are talking about
> > > because I'm thinking of adding PLP NVME for wal+db cache to gain some
> > > speed.
> > > As you can see, I even try to gain from every TRIM command.
> > > Currently I'm testing replication 2 pool and even this speed is not
> > > enough for my use case.
> > > Now I'm trying to boost the deletion speed because I'm writing and
> > > deleting files all the time and this never ends.
> > > I write this mail because replication 1 will decrease the deletion
> > > speed but still I'm trying to tune some MDS+ODS parameters to increase
> > > delete speed.
> > >
> > > Any help and idea will be great for me. Thanks.
> > > Regards.
> > >
> > >
> > >
> > > Janne Johansson <icepic...@gmail.com>, 12 Nis 2023 Çar, 10:10
> > > tarihinde şunu yazdı:
> > > >
> > > > Den mån 10 apr. 2023 kl 22:31 skrev mhnx <morphinwith...@gmail.com>:
> > > > > Hello.
> > > > > I have a 10 node cluster. I want to create a non-replicated pool
> > > > > (replication 1) and I want to ask some questions about it:
> > > > >
> > > > > Let me tell you my use case:
> > > > > - I don't care about losing data,
> > > > > - All of my data is JUNK and these junk files are usually between 1KB 
> > > > > to 32MB.
> > > > > - These files will be deleted in 5 days.
> > > > > - Writable space and I/O speed is more important.
> > > > > - I have high Write/Read/Delete operations, minimum 200GB a day.
> > > >
> > > > That is "only" 18MB/s which should easily be doable even with
> > > > repl=2,3,4. or EC. This of course depends on speed of drives, network,
> > > > cpus and all that, but in itself it doesn't seem too hard to achieve
> > > > in terms of average speeds. We have EC8+3 rgw backed by some 12-14 OSD
> > > > hosts with hdd and nvme (for wal+db) that can ingest over 1GB/s if you
> > > > parallelize the rgw streams, so 18MB/s seems totally doable with 10
> > > > decent machines. Even with replication.
> > > >
> > > > > I'm afraid that, in any failure, I won't be able to access the whole
> > > > > cluster. Losing data is okay but I have to ignore missing files,
> > > >
> > > > Even with repl=1, in case of a failure, the cluster will still aim at
> > > > fixing itself rather than ignoring currently lost data and moving on,
> > > > so any solution that involves "forgetting" about lost data would need
> > > > a ceph operator telling the cluster to ignore all the missing parts
> > > > and to recreate the broken PGs. This would not be automatic.
> > > >
> > > >
> > > > --
> > > > May the most significant bit of your life be positive.
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How can I use not-replicated pool (replication 1 or raid-0)

Reply via email to