[ceph-users] DM-Cache for spinning OSDs

2022-05-16 Thread Stolte, Felix
Hey guys,

i have three servers with 12x 12 TB Sata HDDs and 1x 3,4 TB NVME. I am thinking 
of putting DB/WAL on the NVMe as well as an 5GB DM-Cache for each spinning 
disk. Is anyone running something like this in a production environment?


best regards
Felix
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior
-
-

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Trouble about reading gwcli disks state

2022-05-16 Thread icy chan
Hi,

I would like to ask if anybody knows how to handle the gwcli status below.
- Disks state in gwcli shows as "Unknowm"
- Clients still mounting the "Unknown" disks and seems working normally.

Two of the rbd disks show "Unknown" instead of "Online" in gwcli.
==
# gwcli ls /disks
o- disks
...
[77312G, Disks: 10]
  o- ssd-rf2
..
[ssd-rf2 (6.0T)]
  | o- iscsi_01
.
[ssd-rf2/iscsi_01 (Unknown, 3.0T)]
  | o- iscsi_02
.
[ssd-rf2/iscsi_02 (Unknown, 3.0T)]
  o- ssd-rf3
..
[ssd-rf3 (8.0T)]
o- iscsi_pool_01

[ssd-rf3/iscsi_pool_01
(Online, 4.0T)]
o- iscsi_pool_02

[ssd-rf3/iscsi_pool_02
(Online, 4.0T)]
==

Both "Lock Owner" and "State" are "Unknown" inside info session.
==
# gwcli /disks/ssd-rf2/iscsi_01 info
Image .. iscsi_01
Ceph Cluster  .. ceph
Pool  .. ssd-rf2
Wwn   .. 7b441630-2868-47d2-94f1-35efea4cf258
Size H.. 3.0T
Feature List  .. RBD_FEATURE_LAYERING
 RBD_FEATURE_EXCLUSIVE_LOCK
 RBD_FEATURE_OBJECT_MAP
 RBD_FEATURE_FAST_DIFF
 RBD_FEATURE_DEEP_FLATTEN
Snapshots ..
Owner .. sds-ctt-gw1
Lock Owner.. Unknown
State .. Unknown
Backstore .. user:rbd
Backstore Object Name .. ssd-rf2.iscsi_01
Control Values
- hw_max_sectors .. 1024
- max_data_area_mb .. 8
- osd_op_timeout .. 30
- qfull_timeout .. 5
==

Below is reference output from a noral rbd disk.
==
# gwcli /disks/ssd-rf3/iscsi_pool_01 info
Image .. iscsi_pool_01
Ceph Cluster  .. ceph
Pool  .. ssd-rf3
Wwn   .. 20396fed-2aba-422d-99c2-8353b8910fa4
Size H.. 4.0T
Feature List  .. RBD_FEATURE_LAYERING
 RBD_FEATURE_EXCLUSIVE_LOCK
 RBD_FEATURE_OBJECT_MAP
 RBD_FEATURE_FAST_DIFF
 RBD_FEATURE_DEEP_FLATTEN
Snapshots ..
Owner .. sds-ctt-gw2
Lock Owner.. sds-ctt-gw2
State .. Online
Backstore .. user:rbd
Backstore Object Name .. ssd-rf3.iscsi_pool_01
Control Values
- hw_max_sectors .. 1024
- max_data_area_mb .. 8
- osd_op_timeout .. 30
- qfull_timeout .. 5
==

Nothing special found in the rbd setting.
==
root@sds-ctt-mon1:/# rbd ls -p ssd-rf2
iscsi_01
iscsi_02
root@sds-ctt-mon1:/# rbd -p ssd-rf2 info iscsi_01
rbd image 'iscsi_01':
size 3 TiB in 3145728 objects
order 20 (1 MiB objects)
snapshot_count: 0
id: 272654e71f95e9
block_name_prefix: rbd_data.272654e71f95e9
format: 2
features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten
op_features:
flags:
create_timestamp: Mon Mar  7 05:28:55 2022
access_timestamp: Tue May 17 02:17:16 2022
modify_timestamp: Tue May 17 02:17:16 2022
root@sds-ctt-mon1:/# rbd -p ssd-rf3 info iscsi_pool_01
rbd image 'iscsi_pool_01':
size 4 TiB in 4194304 objects
order 20 (1 MiB objects)
snapshot_count: 0
id: 29bebcd9d3b6aa
block_name_prefix: rbd_data.29bebcd9d3b6aa
format: 2
features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten
op_features:
flags:
create_timestamp: Tue Aug 11 02:32:37 2020
access_timestamp: Tue May 17 02:17:31 2022
modify_timestamp: Tue May 17 02:17:39 2022
root@sds-ctt-mon1:/#
==

Clus

[ceph-users] Ceph User + Dev Monthly May Meetup

2022-05-16 Thread Neha Ojha
Hi everyone,

This month's Ceph User + Dev Monthly meetup is on May 19, 14:00-15:00
UTC. Please add topics to the agenda:
https://pad.ceph.com/p/ceph-user-dev-monthly-minutes. We are hoping to
receive feedback on the Quincy release and hear more about your
general ops experience regarding upgrades, new use-cases, problems
etc.

Hope to see you there!

Thanks,
Neha

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stretch cluster questions

2022-05-16 Thread Gregory Farnum
I'm not quite clear where the confusion is coming from here, but there
are some misunderstandings. Let me go over it a bit:

On Tue, May 10, 2022 at 1:29 AM Frank Schilder  wrote:
>
> > What you are missing from stretch mode is that your CRUSH rule wouldn't
> > guarantee at least one copy in surviving room (min_size=2 can be
> > achieved with 2 copies in lost room).
>
> I'm afraid this deserves a bit more explanation. How would it be possible 
> that, when both sites are up and with a 4(2) replicated rule, that a 
> committed write does not guarantee all 4 copies to be present? As far as I 
> understood the description of ceph's IO path, if all members of a PG are up, 
> a write is only acknowledged to a client after all shards/copies have been 
> committed to disk.

So in a perfectly normal PG of size 4, min_size 2, the OSDs are happy
to end peering and go active with only 2 up OSDs. That's what min_size
means. A PG won't serve IO until it's active, and it requires min_size
participants to do so — but once it's active, it acknowledges writes
once the live participants have written them down.

> In other words, with a 4(2) rule with 2 copies per DC, if one DC goes down 
> you *always* have 2 life copies and still read access in the other DC. 
> Setting min-size to 1 would allow write access too, albeit with a risk of 
> data loss (a 4(2) rule is really not secure for a 2DC HA set-up as in 
> degraded state you end up with 2(1) in 1 DC, its much better to use a wide EC 
> profile with m>k to achieve redundant single-site writes).

Nope, there is no read access to a PG which doesn't have min_size
active copies. And if you have 4 *live* copies and lose a DC, yes, you
still have two copies. But consider an alternative scenario:
1) 2 copies in each of 2 DCs.
2) Two OSDs in DC 1 restart, which happens to share PG x.
3) PG x goes active with the remaining two OSDs in DC 2.

Does (3) make sense there?

So now add in step 4:
4) DC 2 gets hit by a meteor.

Now, you have no current copies of PG x because the only current
copies got hit by a meteor.

>
> The only situation I could imagine this not being guaranteed (both DCs 
> holding 2 copies at all times in healthy condition) is that writes happen 
> while one DC is down, the down DC comes up and the other DC goes down before 
> recovery finishes. However, then stretch mode will not help either.

Right, so you just skipped over the part that it helps with: stretch
mode *guarantees* that a PG has OSDs from both DCs in its acting set
before the PG can finish peering. Redoing the scenario from before
1) 2 copies in each of 2 DCs,
2) Two OSDs in DC 1 restart, which happens to share PG x
3) PG x cannot go active because it lacks a replica in DC 1.
4) DC 2 gets hit by a meteor
5) All OSDs in DC 1 come back up
6) All PGs go active

So stretch mode adds another dimension to "the PG can finish peering
and go active" which includes the CRUSH buckets as a requirement, in
addition to a simple count of the replicas.

> My understanding of the useful part is, that stretch mode elects one monitor 
> to be special and act as a tie-breaker in case a DC goes down or a split 
> brain situation occurs between 2DCs. The change of min-size in the 
> stretch-rule looks a bit useless and even dangerous to me. A stretched 
> cluster should be designed to have a secure redundancy scheme per site and, 
> for replicated rules, that would mean size=6, min_size=2 (degraded 3(2)). 
> Much better seems to be something like an EC profile k=4, m=6 with 5 shards 
> per DC, which has only 150% overhead compared with 500% overhead of a 6(2) 
> replicated rule.

Yeah, the min_size change is because you don't want to become
unavailable when rebooting any of your surviving nodes. When you lost
a DC, you effectively go from running an ultra-secure 4-copy system to
a rather-less-secure 2-copy system. And generally people with 2-copy
systems want their data to still be available when doing maintenance.
;)
(Plus, well, hopefully you either get the other data center back
quickly, or you can expand the cluster to get to a nice safe 3-copy
system.)

But, yes. As Maximilian suggests, the use case for stretch mode is
pretty specific. If you're using RGW, you should be better-served by
its multisite feature, and if your application can stomach
asynchronous replication that will be much less expensive. RBD has
both sync and async replication options across clusters.
But sometimes you really just want exactly the same data in exactly
the same place at the same time. That's what stretch mode is for.
-Greg

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-16 Thread stéphane chalansonnet
Hello,

Yes we got several slow ops stocks for many seconds.
What we noted  : CPU/MeM usage less than Nautilus (
https://drive.google.com/file/d/1NGa5sA8dlQ65ld196Ku2hm_Y0xxvfvNs/view?usp=sharingt
)
Same behaviour than you .

For the moment, the rebuild of one our node seems to fix the latency issue
for it.
Exemple Disk write request avg waiting time ( HDD)
Nautilus : 8-11ms
Pacific before rebuild : 29-46ms
Pacific after rebuild : 4-5ms

disk Average queue size
Nautilus : 3-5ms
Pacific before rebuild : 6-10ms
Pacific after rebuild : 1-2ms

*As a part of this upgrade, did you also migrate the OSDs to sharded
rocksdb column families? This would have been done by setting bluestore's
"quick fix on mount" setting to true or by issuing a "ceph-bluestore-tool
repair" offline, perhaps in response to a BLUESTORE_NO_PER_POOL_OMAP
warning post-upgrade*
*=> * I'm going to let my colleague answer parts of that(he will probably
answer tomorrow)

Regards,

Le lun. 16 mai 2022 à 17:20, Wesley Dillingham  a
écrit :

> In our case it appears that file deletes have a very high impact on osd
> operations. Not a significant delete either ~20T on a 1PB utilized
> filesystem (large files as well).
>
> We are trying to tune down cephfs delayed deletes via:
> "mds_max_purge_ops": "512",
> "mds_max_purge_ops_per_pg": "0.10",
>
> with some success but still experimenting with how we can reduce the
> throughput impact from osd slow ops.
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
>
>
> On Mon, May 16, 2022 at 9:49 AM Wesley Dillingham 
> wrote:
>
>> We have a newly-built pacific (16.2.7) cluster running 8+3 EC jerasure
>> ~250 OSDS across 21 hosts which has significantly lower than expected IOPS.
>> Only doing about 30 IOPS per spinning disk (with appropriately sized SSD
>> bluestore db) around ~100 PGs per OSD. Have around 100 CephFS (ceph fuse
>> 16.2.7) clients using the cluster. Cluster regularly reports slow ops from
>> the OSDs but the vast majority, 90% plus of the OSDs, are only <50% IOPS
>> utilized. Plenty of cpu/ram/network left on all cluster nodes. We have
>> looked for hardware (disk/bond/network/mce) issues across the cluster with
>> no findings / checked send-qs and received-q's across the cluster to try
>> and narrow in on an individual failing component but nothing found there.
>> Slow ops are also spread equally across the servers in the cluster. Does
>> your cluster report any health warnings (slow ops etc) alongside your
>> reduced performance?
>>
>> Respectfully,
>>
>> *Wes Dillingham*
>> w...@wesdillingham.com
>> LinkedIn 
>>
>>
>> On Mon, May 16, 2022 at 2:00 AM Martin Verges 
>> wrote:
>>
>>> Hello,
>>>
>>> depending on your workload, drives and OSD allocation size, using the 3+2
>>> can be way slower than the 4+2. Maybe give it a small benchmark and try
>>> if
>>> you see a huge difference. We had some benchmarks with such and they
>>> showed
>>> quite ugly results in some tests. Best way to deploy EC in our findings
>>> is
>>> in power of 2, like 2+x, 4+x, 8+x, 16+x. Especially when you deploy OSDs
>>> before the Ceph allocation change patch, you might end up consuming way
>>> more space if you don't use power of 2. With the 4k allocation size at
>>> least this has been greatly improved for newer deployed OSDs.
>>>
>>> --
>>> Martin Verges
>>> Managing director
>>>
>>> Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges
>>>
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>
>>>
>>> On Sun, 15 May 2022 at 20:30, stéphane chalansonnet 
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Thank you for your answer.
>>> > this is not a good news if you also notice a performance decrease on
>>> your
>>> > side
>>> > No, as far as we know, you cannot downgrade to Octopus.
>>> > Going forward seems to be the only way, so Quincy .
>>> > We have a a qualification cluster so we can try on it (but full virtual
>>> > configuration)
>>> >
>>> >
>>> > We are using 4+2 and 3+2 profile
>>> > Are you also on the same profile on your Cluster ?
>>> > Maybe replicated profile are not be impacted ?
>>> >
>>> > Actually, we are trying to recreate one by one the OSD.
>>> > some parameters can be only set by this way .
>>> > The first storage Node is almost rebuild, we will see if the latencies
>>> on
>>> > it are below the others ...
>>> >
>>> > Wait and see .
>>> >
>>> > Le dim. 15 mai 2022 à 10:16, Martin Verges  a
>>> > écrit :
>>> >
>>> >> Hello,
>>> >>
>>> >> what exact EC level do you use?
>>> >>
>>> >> I can confirm, that our internal data shows a performance drop when
>>> using
>>> >> pacific. So far Octopus is faster and better than pacific but I doubt
>>> you
>>> >> can roll back to it. We haven't rerun our benchmarks on Quincy yet,
>>> but
>

[ceph-users] v16.2.8 Pacific released

2022-05-16 Thread David Galloway
We're happy to announce the 8th backport release in the Pacific series. 
We recommend users to update to this release. For a detailed release 
notes with links & changelog please refer to the official blog entry at 
https://ceph.io/en/news/blog/2022/v16-2-8-pacific-released


Notable Changes
---

* MON/MGR: Pools can now be created with `--bulk` flag. Any pools 
created with `bulk` will use a profile of the `pg_autoscaler` that 
provides more performance from the start. However, any pools created 
without the `--bulk` flag will remain using it's old behavior by 
default. For more details, see: 
https://docs.ceph.com/en/latest/rados/operations/placement-groups/


* MGR: The pg_autoscaler can now be turned `on` and `off` globally with 
the `noautoscale` flag. By default this flag is unset and the default 
pg_autoscale mode remains the same. For more details, see: 
https://docs.ceph.com/en/latest/rados/operations/placement-groups/


* A health warning will now be reported if the ``require-osd-release`` 
flag is not set to the appropriate release after a cluster upgrade.


* CephFS: Upgrading Ceph Metadata Servers when using multiple active 
MDSs requires ensuring no pending stray entries which are directories 
are present for active ranks except rank 0. See 
https://docs.ceph.com/en/latest/releases/pacific/#upgrading-from-octopus-or-nautilus.


Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-16.2.8.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 209e51b856505df4f2f16e54c0d7a9e070973185

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: S3 and RBD backup

2022-05-16 Thread William Edwards

> Op 16 mei 2022 om 13:41 heeft Sanjeev Jha  het 
> volgende geschreven:
> 
> Hi,
> 
> Could someone please let me know how to take S3 and RBD backup from Ceph side 
> and possibility to take backup from Client/user side?
> 
> Which tool should I use for the backup?

It depends.

> 
> Best regards,
> Sanjeev Kumar Jha
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] client.admin crashed

2022-05-16 Thread farhad kh
 i have error a in my cluster ceph

HEALT_WARN 1 demons have recently crashed
[WRN] RECENT_CRASH: 1 demons have  recently crashed
 client.admin crashed on host node1 at 2022-05-16T08:30:41205667z
what does this mean
How can I fix it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] client.admin crashed

2022-05-16 Thread farhad kh
i have error a in my cluster ceph

HEALT_WARN 1 demons have recently crashed
[WRN] RECENT_CRASH: 1 demons have  recently crashed
 client.admin crashed on host node1 at 2022-05-16T08:30:41205667z
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-16 Thread Wesley Dillingham
In our case it appears that file deletes have a very high impact on osd
operations. Not a significant delete either ~20T on a 1PB utilized
filesystem (large files as well).

We are trying to tune down cephfs delayed deletes via:
"mds_max_purge_ops": "512",
"mds_max_purge_ops_per_pg": "0.10",

with some success but still experimenting with how we can reduce the
throughput impact from osd slow ops.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Mon, May 16, 2022 at 9:49 AM Wesley Dillingham 
wrote:

> We have a newly-built pacific (16.2.7) cluster running 8+3 EC jerasure
> ~250 OSDS across 21 hosts which has significantly lower than expected IOPS.
> Only doing about 30 IOPS per spinning disk (with appropriately sized SSD
> bluestore db) around ~100 PGs per OSD. Have around 100 CephFS (ceph fuse
> 16.2.7) clients using the cluster. Cluster regularly reports slow ops from
> the OSDs but the vast majority, 90% plus of the OSDs, are only <50% IOPS
> utilized. Plenty of cpu/ram/network left on all cluster nodes. We have
> looked for hardware (disk/bond/network/mce) issues across the cluster with
> no findings / checked send-qs and received-q's across the cluster to try
> and narrow in on an individual failing component but nothing found there.
> Slow ops are also spread equally across the servers in the cluster. Does
> your cluster report any health warnings (slow ops etc) alongside your
> reduced performance?
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
>
>
> On Mon, May 16, 2022 at 2:00 AM Martin Verges 
> wrote:
>
>> Hello,
>>
>> depending on your workload, drives and OSD allocation size, using the 3+2
>> can be way slower than the 4+2. Maybe give it a small benchmark and try if
>> you see a huge difference. We had some benchmarks with such and they
>> showed
>> quite ugly results in some tests. Best way to deploy EC in our findings is
>> in power of 2, like 2+x, 4+x, 8+x, 16+x. Especially when you deploy OSDs
>> before the Ceph allocation change patch, you might end up consuming way
>> more space if you don't use power of 2. With the 4k allocation size at
>> least this has been greatly improved for newer deployed OSDs.
>>
>> --
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
>>
>> On Sun, 15 May 2022 at 20:30, stéphane chalansonnet 
>> wrote:
>>
>> > Hi,
>> >
>> > Thank you for your answer.
>> > this is not a good news if you also notice a performance decrease on
>> your
>> > side
>> > No, as far as we know, you cannot downgrade to Octopus.
>> > Going forward seems to be the only way, so Quincy .
>> > We have a a qualification cluster so we can try on it (but full virtual
>> > configuration)
>> >
>> >
>> > We are using 4+2 and 3+2 profile
>> > Are you also on the same profile on your Cluster ?
>> > Maybe replicated profile are not be impacted ?
>> >
>> > Actually, we are trying to recreate one by one the OSD.
>> > some parameters can be only set by this way .
>> > The first storage Node is almost rebuild, we will see if the latencies
>> on
>> > it are below the others ...
>> >
>> > Wait and see .
>> >
>> > Le dim. 15 mai 2022 à 10:16, Martin Verges  a
>> > écrit :
>> >
>> >> Hello,
>> >>
>> >> what exact EC level do you use?
>> >>
>> >> I can confirm, that our internal data shows a performance drop when
>> using
>> >> pacific. So far Octopus is faster and better than pacific but I doubt
>> you
>> >> can roll back to it. We haven't rerun our benchmarks on Quincy yet, but
>> >> according to some presentation it should be faster than pacific. Maybe
>> try
>> >> to jump away from the pacific release into the unknown!
>> >>
>> >> --
>> >> Martin Verges
>> >> Managing director
>> >>
>> >> Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges
>> >>
>> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> >> CEO: Martin Verges - VAT-ID: DE310638492
>> >> Com. register: Amtsgericht Munich HRB 231263
>> >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>> >>
>> >>
>> >> On Sat, 14 May 2022 at 12:27, stéphane chalansonnet <
>> schal...@gmail.com>
>> >> wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> After a successful update from Nautilus to Pacific on Centos8.5, we
>> >>> observed some high latencies on our cluster.
>> >>>
>> >>> We did not find very much thing on community related to latencies post
>> >>> migration
>> >>>
>> >>> Our setup is
>> >>> 6x storage Node (256GRAM, 2SSD OSD + 5*6To SATA HDD)
>> >>> Erasure coding profile
>> >>> We have two EC pool :
>> >>> -> Pool1 : Full HDD SAS Drive 6To
>> >>> -> Pool2 : Full SSD Drive
>> >>>
>> >>> Object S3 and RBD block workload
>> >>>
>> >>> Our performanc

[ceph-users] Re: Migration Nautilus to Pacifi : Very high latencies (EC profile)

2022-05-16 Thread Wesley Dillingham
We have a newly-built pacific (16.2.7) cluster running 8+3 EC jerasure ~250
OSDS across 21 hosts which has significantly lower than expected IOPS. Only
doing about 30 IOPS per spinning disk (with appropriately sized SSD
bluestore db) around ~100 PGs per OSD. Have around 100 CephFS (ceph fuse
16.2.7) clients using the cluster. Cluster regularly reports slow ops from
the OSDs but the vast majority, 90% plus of the OSDs, are only <50% IOPS
utilized. Plenty of cpu/ram/network left on all cluster nodes. We have
looked for hardware (disk/bond/network/mce) issues across the cluster with
no findings / checked send-qs and received-q's across the cluster to try
and narrow in on an individual failing component but nothing found there.
Slow ops are also spread equally across the servers in the cluster. Does
your cluster report any health warnings (slow ops etc) alongside your
reduced performance?

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Mon, May 16, 2022 at 2:00 AM Martin Verges 
wrote:

> Hello,
>
> depending on your workload, drives and OSD allocation size, using the 3+2
> can be way slower than the 4+2. Maybe give it a small benchmark and try if
> you see a huge difference. We had some benchmarks with such and they showed
> quite ugly results in some tests. Best way to deploy EC in our findings is
> in power of 2, like 2+x, 4+x, 8+x, 16+x. Especially when you deploy OSDs
> before the Ceph allocation change patch, you might end up consuming way
> more space if you don't use power of 2. With the 4k allocation size at
> least this has been greatly improved for newer deployed OSDs.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
>
> On Sun, 15 May 2022 at 20:30, stéphane chalansonnet 
> wrote:
>
> > Hi,
> >
> > Thank you for your answer.
> > this is not a good news if you also notice a performance decrease on your
> > side
> > No, as far as we know, you cannot downgrade to Octopus.
> > Going forward seems to be the only way, so Quincy .
> > We have a a qualification cluster so we can try on it (but full virtual
> > configuration)
> >
> >
> > We are using 4+2 and 3+2 profile
> > Are you also on the same profile on your Cluster ?
> > Maybe replicated profile are not be impacted ?
> >
> > Actually, we are trying to recreate one by one the OSD.
> > some parameters can be only set by this way .
> > The first storage Node is almost rebuild, we will see if the latencies on
> > it are below the others ...
> >
> > Wait and see .
> >
> > Le dim. 15 mai 2022 à 10:16, Martin Verges  a
> > écrit :
> >
> >> Hello,
> >>
> >> what exact EC level do you use?
> >>
> >> I can confirm, that our internal data shows a performance drop when
> using
> >> pacific. So far Octopus is faster and better than pacific but I doubt
> you
> >> can roll back to it. We haven't rerun our benchmarks on Quincy yet, but
> >> according to some presentation it should be faster than pacific. Maybe
> try
> >> to jump away from the pacific release into the unknown!
> >>
> >> --
> >> Martin Verges
> >> Managing director
> >>
> >> Mobile: +49 174 9335695  | Chat: https://t.me/MartinVerges
> >>
> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >> CEO: Martin Verges - VAT-ID: DE310638492
> >> Com. register: Amtsgericht Munich HRB 231263
> >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> >>
> >>
> >> On Sat, 14 May 2022 at 12:27, stéphane chalansonnet  >
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> After a successful update from Nautilus to Pacific on Centos8.5, we
> >>> observed some high latencies on our cluster.
> >>>
> >>> We did not find very much thing on community related to latencies post
> >>> migration
> >>>
> >>> Our setup is
> >>> 6x storage Node (256GRAM, 2SSD OSD + 5*6To SATA HDD)
> >>> Erasure coding profile
> >>> We have two EC pool :
> >>> -> Pool1 : Full HDD SAS Drive 6To
> >>> -> Pool2 : Full SSD Drive
> >>>
> >>> Object S3 and RBD block workload
> >>>
> >>> Our performances in nautilus, before the upgrade , are acceptable.
> >>> However , the next day , performance dropped by 3 or 4
> >>> Benchmark showed 15KIOPS on flash drive , before upgrade we had
> >>> almost 80KIOPS
> >>> Also, HDD pool is almost down (too much lantencies
> >>>
> >>> We suspected , maybe, an impact on erasure Coding configuration on
> >>> Pacific
> >>> Anyone observed the same behaviour ? any tuning ?
> >>>
> >>> Thank you for your help.
> >>>
> >>> ceph osd tree
> >>> ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT
> >>> PRI-AFF
> >>>  -1 347.61304  root default
> >>>  -3  56.71570  host cnp31tcephosd01
> >>>   0hdd5.63399  osd.0 up   1.0
> >>> 1.0
> >>>   1

[ceph-users] Re: repairing damaged cephfs_metadata pool

2022-05-16 Thread Gregory Farnum
On Tue, May 10, 2022 at 2:47 PM Horvath, Dustin Marshall
 wrote:
>
> Hi there, newcomer here.
>
> I've been trying to figure out if it's possible to repair or recover cephfs 
> after some unfortunate issues a couple of months ago; these couple of nodes 
> have been offline most of the time since the incident.
>
> I'm sure the problem is that I lack the ceph expertise to quite sus out where 
> the broken bits are. This was a 2-node cluster (I know I know) that had a 
> hypervisor primary disk fail, and the entire OS was lost. I reinstalled the 
> hypervisor, rejoined it to the cluster (proxmox), rejoined ceph to the other 
> node, re-added the OSDs. It came back with quorum problems and some PGs were 
> inconsistent and some were lost. Some of that is due to my own fiddling 
> around, which possibly exacerbated things. Eventually I had to edit the 
> monmap down to 1 monitor, which had all kinds of screwy journal issues...it's 
> been a while since I've tried resuscitating this, so the details in my memory 
> are fuzzy.
>
> My cluster health isn't awful. Output is basically this:
> ```
> root@pve02:~# ceph -s
>   cluster:
> id: 8b31840b-5706-4c92-8135-0d6e03976af1
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 filesystem is offline
> 1 mds daemon damaged
> noout flag(s) set
> 16 daemons have recently crashed
>
>   services:
> mon: 1 daemons, quorum pve02 (age 3d)
> mgr: pve01(active, since 4d)
> mds: 0/1 daemons up
> osd: 7 osds: 7 up (since 2d), 7 in (since 7w)
>  flags noout
>
>   data:
>volumes: 0/1 healthy, 1 recovering; 1 damaged
> pools:   5 pools, 576 pgs
> objects: 1.51M objects, 4.0 TiB
> usage:   8.2 TiB used, 9.1 TiB / 17 TiB avail
> pgs: 575 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   241 KiB/s wr, 0 op/s rd, 10 op/s wr
> ```
>
> I've tried a couple times running down the steps in here 
> (https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/), but I 
> always hit an error at scan_links, where I get a crash dump of sorts. If I 
> try and mark the cephfs as repaired/joinable, MDS daemons will try and replay 
> and then fail.

Yeah, that generally won't work until the process is fully complete —
otherwise the MDS starts hitting the metadata inconsistencies from
having a halfway-done FS!

> The only occurrences of err/ERR in the MDS logs are a line like this:
> ```
> 2022-05-07T18:31:26.342-0500 7f22b44d8700  1 mds.0.94  waiting for osdmap 
> 301772 (which blocklists prior instance)
> 2022-05-07T18:31:26.346-0500 7f22adccb700 -1 log_channel(cluster) log [ERR] : 
> failed to read JournalPointer: -1 ((1) Operation not permitted)
> 2022-05-07T18:31:26.346-0500 7f22af4ce700  0 mds.0.journaler.pq(ro) error 
> getting journal off disk

That pretty much means the mds log/journal doesn't actually exist. I'm
actually surprised that this is the thing that causes crash since you
probably did the "cephfs-journal-tool --rank=0 journal reset" command
in that doc.

But as the page says, these are advanced tools which can wreck your
filesystem if you do them wrong, and the details matter. You'll have
to share as much as you can of what's been done to the cluster. Even
if you did some aborted recovery procedures, just running through it
again may work out. We'd need the scan_links error for certain,
though.
-Greg

> ```
>
> I haven't had much luck on the googles with diagnosing that error; seems 
> uncommon. My hope is that the cephfs_data pool is fine. I actually never had 
> any inconsistent PG issues on a pool other than the metadata pool, so that's 
> the only one that suffered actual acute injury during the hardware 
> failure/quorum loss.
> If I had more experience with the rados tools, I'd probably be more helpful. 
> I have plenty of logs lying about and can perform any diagnoses that might 
> help, but I hate to spam too much here right out of the gate.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] S3 and RBD backup

2022-05-16 Thread Sanjeev Jha
Hi,

Could someone please let me know how to take S3 and RBD backup from Ceph side 
and possibility to take backup from Client/user side?

Which tool should I use for the backup?

Best regards,
Sanjeev Kumar Jha
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reasonable MDS rejoin time?

2022-05-16 Thread Felix Lee

Hi, Jos,
Many thanks for your reply.
And sorry, I missed to mention the version, which is 14.2.22.

Here is the log:
https://drive.google.com/drive/folders/1qzPf64qw16VJDKSzcDoixZ690KL8XSoc?usp=sharing

Here, the ceph01(active) and ceph11(standby-replay) were the ones what 
suffered crash. The log didn't tell us much but several slow request 
were occurring. And, the ceph11 had "cache is too large" warning by the 
time it went crashed, suppose it could happen when doing recovery. (each 
MDS has 64GB memory, BTW )
The ceph16 is current rejoin one, I've turned debug_mds to 20 for a 
while as ceph-mds.ceph16.log-20220516.gz



Thanks
&
Best regards,
Felix Lee ~



On 5/16/22 14:45, Jos Collin wrote:
It's hard to suggest without the logs. Do verbose logging debug_mds=20. 
What's the ceph version? Do you have the logs why the MDS crashed?


On 16/05/22 11:20, Felix Lee wrote:

Dear all,
We currently have 7 multi-active MDS, with another 7 standby-replay.
We thought this should cover most of disasters, and it actually did. 
But things just got happened, here is the story:
One of MDS crashed and standby-replay took over, but got stuck at 
resolve state.
Then, the other two MDS(rank 0 and 5) received tones of slow requests, 
and my colleague restarted them, thinking the standby-replay would 
take over immediately (this seemed to be wrong or at least unnecessary 
action, I guess...). Then, it resulted three of them in resolve state...
In the meanwhile, I realized that the first failed rank(rank 2) had 
abnormal memory usage and kept getting crashed, after couple 
restarting, the memory usage was back to normal, and then, those tree 
MDS entered into rejoin state.
Now, this rejoin state is there for three days and keeps going as 
we're speaking. Here, no significant error message shows up even with 
"debug_mds 10", so, we have no idea when it's gonna end and if it's 
really running on the track.
So, I am wondering how do we check MDS rejoin progress/status to make 
sure if it's running normally? Or, how do we estimate the rejoin time 
and maybe improve it? because we always need to tell user the time 
estimation of its recovery.



Thanks
&
Best regards,
Felix Lee ~



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--
Felix H.T Lee   Academia Sinica Grid & Cloud.
Tel: +886-2-27898308
Office: Room P111, Institute of Physics, 128 Academia Road, Section 2, 
Nankang, Taipei 115, Taiwan


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd mirroring - journal growing and snapshot high io load

2022-05-16 Thread ronny.lippold




Am 2022-05-12 15:29, schrieb Arthur Outhenin-Chalandre:


We are going towards mirror snapshots, but we didn't advertise
internally so far and we won't enable it on every images; it would only
be for new volumes if people want explicitly that feature. So we are
probably not going to hit these performance issues that you suffer for
quite some time and the scope of it should be limited...


thanks for all the informations
we tried snapshot based on friday again. after 30 images, the high load 
came back.

i do not understand, whats happened.
6 days, before the problem started, we added one ssd per host.
but there were no problems. but 1 week later performance went down.
we are running completely  out of ideas.

actually, we try journal based again.
the space is growing and we will wait and see, if there is a limit.

do you have any idea, how we can get the space back?
i think about something like, resync the images, who are behind the 
master.


is somebody else outside, who uses the rbd mirror replication with ceph?

ronny
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io