[ceph-users] Re: Performance improvement suggestion

2024-02-21 Thread Peter Grandi
> 1. Write object A from client.
> 2. Fsync to primary device completes.
> 3. Ack to client.
> 4. Writes sent to replicas.
[...]

As mentioned in the discussion this proposal is the opposite of
what the current policy, is, which is to wait for all replicas
to be written before writes are acknowledged to the client:

https://github.com/ceph/ceph/blob/main/doc/architecture.rst

   "After identifying the target placement group, the client
   writes the object to the identified placement group's primary
   OSD. The primary OSD then [...] confirms that the object was
   stored successfully in the secondary and tertiary OSDs, and
   reports to the client that the object was stored
   successfully."

A more revolutionary option would be for 'librados' to write in
parallel to all the "active set" OSDs and report this to the
primary, but that would greatly increase client-Ceph traffic,
while the current logic increases traffic only among OSDs.

> So I think that to maintain any semblance of reliability,
> you'd need to at least wait for a commit ack from the first
> replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k'
synchronous (write completes to the client only when all at
least 'k' replicas, including primary, have been committed) and
'm' asynchronous, instead of 'k' being just 1 or 2.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing?

2024-01-24 Thread Peter Grandi
> [...] After a few days, I have on our OSD nodes around 90MB/s
> read and 70MB/s write while 'ceph -s' have client io as
> 2,5MB/s read and 50MB/s write. [...]

This is one of my pet-peeves: that a storage system must have
capacity (principally IOPS) to handle both a maintenance
workload and a user workload, and since the former often
involves whole-storage or whole-metadata operations it can be
quite heavy, especially in the case of Ceph where rebalancing
and scrubbing and checking should be fairly frequent to detect
and correct inconsistencies.

> Is this activity OK? [...]

Indeed. Some "clever" people "save money" by "rightsizing" their
storage so it cannot run at the same time the maintenance and
the user workload, and so turn off the maintenance workload,
because they "feel lucky" I guess, but I do not recommend that.
:-). I have seen more than one Ceph cluster that did not have
the capacity even to run *just* the maintenance workload.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Peter Grandi
>> So we were going to replace a Ceph cluster with some hardware we had
>> laying around using SATA HBAs but I was told that the only right way
>> to build Ceph in 2023 is with direct attach NVMe.

My impression are somewhat different:

* Nowadays it is rather more difficult to find 2.5in SAS or SATA
  "Enterprise" SSDs than most NVMe types. NVMe as a host bus
  also has much greater bandwidth than SAS or SATA, but Ceph is
  mostly about IOPS rather than single-device bandwidth. So in
  general willing or less willing one has got to move to NVMe.

* Ceph was designed (and most people have forgotten it) for many
  small capacity 1-OSD cheap servers, and lots of them, but
  unfortunately it is not easy to find small cheap "enterprise"
  SSD servers. In part because many people use rather unwisely
  as figure-of-merit the capacity per server-price most NVMe
  servers have many slots, which means either RAID-ing devices
  into a small number of large OSDs, which goes against all Ceph
  stands for, or running many OSD daemons on one system, which
  work-ish but is not best.

>> Does anyone have any recommendation for a 1U barebones server
>> (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays
>> that are direct attached to the motherboard without a bridge
>> or HBA for Ceph specifically?

> If you're buying new, Supermicro would be my first choice for
> vendor based on experience.
> https://www.supermicro.com/en/products/nvme

Indeed, SuperMicro does them fairly well, and there are also
GigaByte, and Tyan I think, not yet seen Intel-based models.

> You said 2.5" bays, which makes me think you have existing
> drives. There are models to fit that, but if you're also
> considering new drives, you can get further density in E1/E3

BTW "NVMe" is a bus specification (something not too different
from SCSI-over-PCIe), and there are several different physical
specifications, like 2.5in U.2 (SFF-8639), 2.5in U.3
(SFF-TA-1001), and various types of EDSFF (SFF-TA-1006,7,8). U.3
is still difficult to find but its connector supports SATA, SAS
and NVMe U.2; I have not yet seen EDSFF boxes actually available
retail without enormous delivery times, I guess the big internet
companies buy all the available production.

https://nvmexpress.org/wp-content/uploads/Session-4-NVMe-Form-Factors-Developer-Day-SSD-Form-Factors-v8.pdf
https://media.kingston.com/kingston/content/ktc-content-nvme-general-ssd-form-factors-graph-en-3.jpg
https://media.kingston.com/kingston/pdf/ktc-article-understanding-ssd-technology-en.pdf
https://www.snia.org/sites/default/files/SSSI/OCP%20EDSFF%20JM%20Hands.pdf

> The only caveat is that you will absolutely want to put a
> better NIC in these systems, because 2x10G is easy to saturate
> with a pile of NVME.

That's one reason why Ceph was designed for many small 1-OSD
servers (ideally distributed across several racks) :-). Note: to
maximize changes of many-to-many traffic instead of many-to-one.
Anyhow Ceph again is all about lots of IOPS more than
bandwidth, but if you need bandwidth nowadays many 10Gb NICs
support 25Gb/s too, and 40Gb/s and 100Gb/s are no longer that
expensive (but the cables are horrible).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS mirror very slow (maybe for small files?)

2023-11-13 Thread Peter Grandi
> the speed of data transfer is varying a lot over time (200KB/s
> – 120MB/s). [...] The FS in question, has a lot of small files
> in it and I suspect this is the cause of the variability – ie,
> the transfer of many small files will be more impacted by
> greater site-site latency.

200KB/s on small files across sites? That's pretty good. I have
seen rates of 3-5KB/s on some Ceph instances for reading local
small files, never mind remotely.

> If this suspicion is true, what options do I have to improve
> the overall throughput?

In practice not much. Perhaps switching to all-RAM storage (with
battery backup) for OSDs might help :-). In one case by undoing
some of the more egregious issues I managed to improve small
file transfer rates locally by 10 times, that is to 40-60KB/s.
In your case a 10 times, if achievable, improvement might get
you transfer rates of 2MB/s. Often the question is not just
longer network latency, but whether your underlying storage can
sustain the IOPS needed for "scan" type operations at the same
time as user workload.

Perhaps it would go a lot faster if you just RSYNC, or even just
'tar -f - -c ... | ssh ... tar =f - -x' (or 'rclone' if you
don't use CephFS) and it would be worth doing a test of
transferring a directory (or bucket if you don't use CephFS)
with small files by RSYNC and/or 'tar' to a non-Ceph remote
target and a Ceph remote target to see what you could achieve.

No network/sharded filesystem (and very few local ones) handles
well small files. In some cases I have seen Ceph was used to
store a traditional filesystem image of a type more suitable for
small files, mounted on a loop device.

https://www.sabi.co.uk/blog/anno05-4th.html?051016#051016
https://www.sabi.co.uk/blog/0909Sep.html?090919#090919
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH Cluster performance review

2023-11-12 Thread Peter Grandi
>>> during scrubbing, OSD latency spikes to 300-600 ms,

>> I have seen Ceph clusters spike to several seconds per IO
>> operation as they were designed for the same goals.

>>> resulting in sluggish performance for all VMs. Additionally,
>>> some OSDs fail during the scrubbing process.

>> Most likely they time out because of IO congestion rather than
>> failing.

> bluestore(/var/lib/ceph/osd/ceph-10) log_latency slow operation observed for 
> next, latency = 74835459564ns
> bluestore(/var/lib/ceph/osd/ceph-10) log_latency slow operation observed for 
> next, latency = 42822161884ns

7.48s? 4.28s?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH Cluster performance review

2023-11-12 Thread Peter Grandi
> during scrubbing, OSD latency spikes to 300-600 ms,

I have seen Ceph clusters spike to several seconds per IO
operation as they were designed for the same goals.

> resulting in sluggish performance for all VMs. Additionally,
> some OSDs fail during the scrubbing process.

Most likely they time out because of IO congestion rather than
failing.

> In such instances, promptly halting the scrubbing resolves the
> issue.

> (6 SSD node + 6 HDD node) All nodes are connected through 10G
> bonded link, i.e. 10Gx2=20GB for each node. 64 SSD 42 HDD 106
> one-ssd 256 active+clean one-hdd 512 active+clean
> cloudstack.hdd 512 active+clean

Your Ceph cluster has been optimized for high latency and IO
congestion, goals that are suprisingly quite common, and is
performing well given its design parameters (it is far from
full, if it becomes fuller it will achieve its goals even
better).

https://www.sabi.co.uk/blog/15-one.html?150305#150305
"How many VMs per disk arm?"

https://www.sabi.co.uk/blog/15-one.html?150329#150329
"CERN's old large disk discussion and IOPS-per-TB"
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How do you handle large Ceph object storage cluster?

2023-10-19 Thread Peter Grandi
> [...] (>10k OSDs, >60 PB of data).

6TBs on average per OSD? Hopully SSDs or RAID10 (or low-number,
3-5) RAID5.

> It is entirely dedicated to object storage with S3 interface.
> Maintenance and its extension are getting more and more
> problematic and time consuming.

Ah the joys of a single large unified storage pool :-).
https://www.sabi.co.uk/blog/0804apr.html?080417#080417

> We consider to split it to two or more completely separate
> clusters

I would suggest doing it 1-2 years ago...

> create S3 layer of abstraction with some additional metadata
> that will allow us to use these 2+ physically independent
> instances as a one logical cluster.

That's what the bucket hierarchy in a Ceph cluster instance
already does. What your layer is going to do is either:

 1) Lookup the object ID in a list of instances, and fetch the
object from the instance that validates the object ID;
 2) Maintain a huge table of all object IDs and which instances
they are in.

But 1) is basically what CRUSH already does and 2) means giving
up the Ceph "decentralized" philosophy based on CRUSH.

BTW one old practice that so few systems follow is to use as
object keys neither addresses nor identifiers, but *both*: first
access the address treating it as a hint, check that the
identifier matches, if not do a slower lookup using the object
identifier part to find the actual address.

> Additionally, newest data is the most demanded data, so we
> have to spread it equally among clusters to avoid skews in
> cluster load.

I usually do the opposite, but that depends on your application.

My practice is to recognize that data is indeed usually
stratified by date, and regard filesystem instances as "silos"
and create a new filesystems instance every some months or
years, and direct all new file creation to the latest instance,
and then get rid progressively of the older instances or copy
their "active" data onwards into the new instance, and the
"inactive" data to offline storage.
http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b

If you really need to keep all data forever online, which is
usually not the case (that's why there are laws that expire
matters after N years) the second best option is to keep old
silos powered up indefinitely, and they will take very little
attention beyond refreshing the hardware periodically and
migrating the data to new instances when that stops being
economical.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to deal with increasing HDD sizes ? 1 OSD for 2 LVM-packed HDDs ?

2023-10-18 Thread Peter Grandi
> * Ceph cluster with old nodes having 6TB HDDs
> * Add new node with new 12TB HDDs

Halving IOPS-per-TB?

https://www.sabi.co.uk/blog/17-one.html?170610#170610
https://www.sabi.co.uk/blog/15-one.html?150329#150329

> Is it supported/recommended to pack 2 6TB HDDs handled by 2
> old OSDs into 1 12TB LVM disk handled by 1 new OSD ?

The OSDs are just random daemons, what matters to chunk
distribution in Ceph is buckets, and in this case leaf buckets.

So it all depends on the CRUSH map but I suspect that
manipulating it so that two existing leaf buckets become one is
not possible or too tricky to attempt.

One option would be to divide the 12TB disk in 2 partitions/LVs
of 6TB and run 2 OSDs against it. It is not recommended, but I
don't see a big issue in this case other than IOPS-per-TB.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Time Estimation for cephfs-data-scan scan_links

2023-10-18 Thread Peter Grandi
[...]
> What is being done is a serial tree walk and copy in 3
> replicas of all objects in the CephFS metadata pool, so it
> depends on both the read and write IOPS rate for the metadata
> pools, but mostly in the write IOPS. [...] Wild guess:
> metadata is on 10x 3.84TB SSDs without persistent cache, data
> is on 48x 8TB devices probably HDDs. Very cost effective :-).

I do not know if those guesses are right, but in general most
Ceph instances I have seen have been designed with the "cost
effective" choice of providing enough IOPS to run the user
workload (but often not even that), but not also more to be able
to run the admin workload quickly (checking, scanning,
scrubbing, migrating, 'fsck' or 'resilvering' of the underlying
filesystem). There is often a similar situation for non HPC
filesystem types, but the scale and pressure on instances of
those are usually much lower than for HPC filesystem instances,
so the consequencesa are less obvious.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Time Estimation for cephfs-data-scan scan_links

2023-10-13 Thread Peter Grandi
>> However, I've observed that the cephfs-data-scan scan_links step has
>> been running for over 24 hours on 35 TB of data, which is replicated
>> across 3 OSDs, resulting in more than 100 TB of raw data.

What matters is the number of "inodes" (and secondarily their
size), that is the number of metadata objects, which is
proportional to the number of files and directories in the
CephFS instance.

>> Does anyone have an estimation on the duration for this step?

> scan_links has to iterate through every object in the metadata pool
> and for each object iterate over the omap key/values - so this step
> scales to the amount of objects in the metadata pool, i.e., the number
> of directories and files in the file system.

>> pools:   12 pools, 1475 pgs
>> objects: 50.89M objects, 72 TiB
>> usage:   207 TiB used, 148 TiB / 355 TiB avail
>> pgs: 579358/152674596 objects misplaced (0.379%)

51m between data and metadata objects, average obiect space used
4MiB, average metadata per object 230KiB (looks like 3-way
replication as per default).

>> POOL  TYPE USED  AVAIL
>> cephfs_metadata   metadata  1045G  35.6T
>> cephfs.c3sl.datadata 114T  35.6T
[...]
>> POOL  TYPE USED  AVAIL
>> cephfs.c3sl.meta  metadata  28.2G  35.6T
>> cephfs.c3sl.datadata 114T  35.6T

Total between data and metadata 142TiB, so CephFS uses around 2/3
of the 207TiB stored in this Ceph instance, so perhaps 2/3 of
the objects too, so maybe 35m objects in the CephFS instance.

What is being done is a serial tree walk and copy in 3 replicas
of all objects in the CephFS metadata pool, so it depends on
both the read and write IOPS rate for the metadata pools, but
mostly in the write IOPS.

Note: it is somewhat like an 'fsck' but an 'fsck' that makes 3
copies of each inode.

I wonder whether the source (presumably 'cephfs_metadata') and
target (presumably 'cephfs.c3sl.meta') pools are on the same
physical devices, and whether they are SSDs with high small
writes rates or not, and the physical storage properties.

Wild guess: metadata is on 10x 3.84TB SSDs without persistent
cache, data is on 48x 8TB devices probably HDDs. Very cost
effective :-).

>  mds: 0/10 daemons up (10 failed), 9 standby
>  osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs

Overall it looks like 1 day copied 1TB out of 28TB in metadata,
so looks like it will take a month.

1TB of metadata means 1.5m 230KiB metadata object processed in 1
day, so around 15 metadata objects read and written in 3 copies
per second, with a 12MB/s metadata storage write rate, which are
plausible numbers for a metadata pool on SSDs with
non-persistent cache, so the estimate of just 3-4 more weeks
looks plausible again.

Running something like 'iostat -dk -zy 1' on one of the servers
with metadata drives might also help get an idea.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Time Estimation for cephfs-data-scan scan_links

2023-10-13 Thread Peter Grandi
>> However, I've observed that the cephfs-data-scan scan_links step has
>> been running for over 24 hours on 35 TB of data, which is replicated
>> across 3 OSDs, resulting in more than 100 TB of raw data.

What matters is the number of "inodes" (and secondarily their
size), that is the number of metadata objects, which is
proportional to the number of files and directories in the
CephFS instance.

>> Does anyone have an estimation on the duration for this step?

> scan_links has to iterate through every object in the metadata pool
> and for each object iterate over the omap key/values - so this step
> scales to the amount of objects in the metadata pool, i.e., the number
> of directories and files in the file system.

>> pools:   12 pools, 1475 pgs
>> objects: 50.89M objects, 72 TiB
>> usage:   207 TiB used, 148 TiB / 355 TiB avail
>> pgs: 579358/152674596 objects misplaced (0.379%)

51m between data and metadata objects, average obiect space used
4MiB, average metadata per object 230KiB (looks like 3-way
replication as per default).

>> POOL  TYPE USED  AVAIL
>> cephfs_metadata   metadata  1045G  35.6T
>> cephfs.c3sl.datadata 114T  35.6T
[...]
>> POOL  TYPE USED  AVAIL
>> cephfs.c3sl.meta  metadata  28.2G  35.6T
>> cephfs.c3sl.datadata 114T  35.6T

Total between data and metadata 142TiB, so CephFS uses around 2/3
of the 207TiB stored in this Ceph instance, so perhaps 2/3 of
the objects too, so maybe 35m objects in the CephFS instance.

What is being done is a serial tree walk and copy in 3 replicas
of all objects in the CephFS metadata pool, so it depends on
both the read and write IOPS rate for the metadata pools, but
mostly in the write IOPS.

Note: it is somewhat like an 'fsck' but an 'fsck' that makes 3
copies of each inode.

I wonder whether the source (presumably 'cephfs_metadata') and
target (presumably 'cephfs.c3sl.meta') pools are on the same
physical devices, and whether they are SSDs with high small
writes rates or not, and the physical storage properties.

Wild guess: metadata is on 10x 3.84TB SSDs without persistent
cache, data is on 48x 8TB devices probably HDDs. Very cost
effective :-).

>  mds: 0/10 daemons up (10 failed), 9 standby
>  osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs

Overall it looks like 1 day copied 1TB out of 28TB in metadata,
so looks like it will take a month.

1TB of metadata means 1.5m 230KiB metadata object processed in 1
day, so around 15 metadata objects read and written in 3 copies
per second, with a 12MB/s metadata storage write rate, which are
plausible numbers for a metadata pool on SSDs with
non-persistent cache, so the estimate of just 3-4 more weeks
looks plausible again.

Running something like 'iostat -dk -zy 1' on one of the servers
with metadata drives might also help get an idea.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Decrepit ceph cluster performance

2023-08-14 Thread Peter Grandi
> We recently started experimenting with Proxmox Backup Server,
> which is really cool, but performs enough IO to basically lock
> out the VM being backed up, leading to IO timeouts, leading to
> user complaints. :-(

The two most common things I have had to fix over years as to
storage systems I hav inherited have been:

* Too low IOPS-per-TB to handle a realistic workload.
* Too few total IOPS to handle the user and sysadmin (checking,
  scrubbing, backup, balancing, backfilling, ...) workloads.

Both happen because most sysadmins are heavily incentivized to
save money now even if there is a huge price to pay later when
the storage capacity fills up.

An SSD based storage cluster like the one you have to deal with
has plenty of IOPS, so your case is strange, in particular that
latencies in your tests are low at the same time as IO rates are
low; badly overloaded storage complexes have latencies 1 second
and way above.

That your test reports small latencies as average but a max
latency of 37s and long pauses with 0 IOPS are reported is
suspicious. It could be that *some* OSD SSDs are not in good
condition and they slow down everything, as the Ceph daemons
wait for the slowest OSD to respond. 37s looks like retries on a
failing SSD.

In an ideal world you would have on the cluster a capacity
monitor like Ganglia etc. showing year-long graphs of network
bandwidth and IO rates and latencies, but I guess this was not
setup like that.

> The SSDs are probably 5-8 years old. The OSDs were rebuilt to
> bluestore around the luminous timeframe. (Nautilus, maybe. It
> was a while ago.)

>> Newer SSD controllers / models are better than older models
>> at housekeeping over time, so the secure-erase might freshen
>> performance.

Indeed 5-8 year old firmware may not be as sophisticated as more
recent firmware, in particular as to needing periodic explicit
TRIMs. As to that I noticed this:

>>> Its primary use is serving RBD VM block devices for Proxmox

A VM workload, and in particular RBD, involves often very small
random writes and "mixed-use SSD"s are not as suitable to that,
in particular if the usual and insane practice of having VM
operating systems log to virtual disks has been followed.

So the physical storage on the SSDs may have become hideously
fragmented, thus indeed requiring TRIMs, especially if the
endurance levels are low (which is dangerous), and especially if
the workload never pauses enough to run the firmware compaction
mechanism (which is likely given that the storage complex cannot
sustain both the user workload and backups).

In particular check the logs of these OSDs to see which specific
SSDs are reporting the slowest IOPs

>>> 36 slow ops, oldest one blocked for 37 sec, daemons 
>>> [osd.10,osd.12,osd.13,osd.14,osd.15,osd.17,osd.2,osd.25,osd.28,osd.3]...
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Workload that delete 100 M object daily via lifecycle

2023-07-18 Thread Peter Grandi
 [...] S3 workload, that will need to delete 100M file
 daily [...]

>> [...] average (what about peaks?) around 1,200 committed
>> deletions per second (across the traditional 3 metadata
>> OSDs) sustained, that may not leave a lot of time for file
> creation, writing or reading. :-)[...]

>>> [...] So many people seem to think that distributed (or
>>> even local) filesystems (and in particular their metadata
>>> servers) can sustain the same workload as high volume
>>> transactional DBMSes. [...]

> Index pool distributed over a large number of NVMe OSDs?
> Multiple, dedicated RGW instances that only run LC?

As long as that guarantees a total maximum network+write
latency of well below 800µs across all of them that might
result in a committed rate of a deletion every 800µs (and there
are no peaks and the metadata server only does deletions and
does not do creations or opens or any "maintenance" operations
like checks and backups). :-)

Sometimes I suggest somewhat seriously entirely RAM based
metadata OSDs, which given a suitable environment may be
feasible. But I still wonder why "So many people seem to think
... can sustain the same workload as high volume transactional
DBMSes" :-).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Workload that delete 100 M object daily via lifecycle

2023-07-18 Thread Peter Grandi
>>> On Mon, 17 Jul 2023 19:19:34 +0700, Ha Nguyen Van
>>>  said:

> [...] S3 workload, that will need to delete 100M file daily [...]

So many people seem to think that distributed (or even local)
filesystems (and in particular their metadata servers) can
sustain the same workload as high volume transactional DBMSes.

PS 100m deletions per day given 86,400 second per day means on
average (what about peaks?) around 1,200 committed deletions per
second (across the traditional 3 metadata OSDs) sustained on the
metadata server, that may not leave a lot of time for file
creation, writing or reading. :-)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ls: cannot access '/cephfs': Stale file handle

2023-05-18 Thread Peter Grandi
>>> On Wed, 17 May 2023 16:52:28 -0500, Harry G Coin
>>>  said:

> I have two autofs entries that mount the same cephfs file
> system to two different mountpoints.  Accessing the first of
> the two fails with 'stale file handle'.  The second works
> normally. [...]

Something pretty close to that works for me... I would check the
related 'dmesg' lines. Also 'grep cephfs /proc/mounts' to double
check the actual mount lines.

Note: mountpoints just under '/' as in '/cephfs' have some
downsides: http://www.sabi.co.uk/blog/23-one.html?230123#230123

Note: give the above I would anyhow mount the CephFS instance
once to a directory like '/mnt/cephfs' once and use two symlinks
to it (or 'bind' mounts if not under '/').
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deleting millions of objects

2023-05-18 Thread Peter Grandi
> [...] We have this slow and limited delete issue also. [...]

That usually, apart from command list length limitations,
happens because so many Ceph storage backends have too low
committed IOPS (write, but not just) for mass metadata (and
equivalently small data) operations, never mind for running them
in parallel with the user workload. They are very expensive, and
many Ceph storage backend are built to minimize cost-per-TB
only.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Peter Grandi
On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only numbers 
already divided by replication factor), you need 55 days to scrub it once.

That's 8x larger than the default scrub factor [...] Also, even if I set
the default scrub interval to 8x larger, it my disks will still be thrashing > 
seeks 100% of the time, affecting the cluster's  throughput and latency
performance.


Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).


There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).


A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.


The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.


https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.


I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Peter Grandi

> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
> numbers already divided by replication factor), you need 55 days
> to scrub it once.
> That's 8x larger than the default scrub factor [...] Also, even
> if I set the default scrub interval to 8x larger, it my disks
> will still be thrashing seeks 100% of the time, affecting the
> cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).


There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).


A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.


The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.


https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.


I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-27 Thread Peter Grandi



> On a 38 TB cluster, if you scrub 8 MB/s on 10 disks (using only
> numbers already divided by replication factor), you need 55 days
> to scrub it once.
> That's 8x larger than the default scrub factor [...] Also, even
> if I set the default scrub interval to 8x larger, it my disks
> will still be thrashing seeks 100% of the time, affecting the
> cluster's  throughput and latency performance.

Indeed! Every Ceph instance I have seen (not many) and almost every HPC 
storage system I have seen have this problem, and that's because they 
were never setup to have enough IOPS to support the maintenance load, 
never mind the maintenance load plus the user load (and as a rule not 
even the user load).


There is a simple reason why this happens: when a large Ceph (etc. 
storage instance is initially setup, it is nearly empty, so it appears 
to perform well even if it was setup with inexpensive but slow/large 
HDDs, then it becomes fuller and therefore heavily congested but whoever 
set it up has already changed jobs or been promoted because of their 
initial success (or they invent excuses).


A figure-of-merit that matters is IOPS-per-used-TB, and making it large 
enough to support concurrent maintenance (scrubbing, backfilling, 
rebalancing, backup) and user workloads. That is *expensive*, so in my 
experience very few storage instance buyers aim for that.


The CERN IT people discovered long ago that quotes for storage workers 
always used very slow/large HDDs that performed very poorly if the specs 
were given as mere capacity, so they switched to requiring a different 
metric, 18MB/s transfer rate of *interleaved* read and write per TB of 
capacity, that is at least two parallel access streams per TB.


https://www.sabi.co.uk/blog/13-two.html?131227#131227
"The issue with disk drives with multi-TB capacities"

BTW I am not sure that a floor of 18MB/s of interleaved read and write 
per TB is high enough to support simultaneous maintenance and user loads 
for most Ceph instances, especially in HPC.


I have seen HPC storage systems "designed" around 10TB and even 18TB 
HDDs, and the best that can be said about those HDDs is that they should 
be considered "tapes" with some random access ability.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io