[ceph-users] Re: Data loss on appends, prod outage

2021-09-10 Thread 胡 玮文
Thanks for sharing this. Following this thread, I realize we are also affected by this bug. We have multiple reports on corrupted tensorboard event file, which I think are caused by this bug. We are using Ubuntu 20.04, the affected kernel version should be HWE kernel > 5.11 and < 5.11.0-34. The

[ceph-users] mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread mk
Hi CephFolks, I have a cluster 14.2.21-22/Ubuntu 18.04 with 3 mon’s. After going down/restart of 1 mon(amon3) it stucks on probing and its also out of quorum, we have changed nothing and it was working, regardless we have checked tcp ports / mtu's of mons are open and reachable. appreciate any

[ceph-users] Re: Drop of performance after Nautilus to Pacific upgrade

2021-09-10 Thread Luis Domingues
Thanks for your observation. Indeed, I do not get drop of performance when upgrading from Nautilus to Octopus. But even using Pacific 16.1.0, the performance just goes down, so I guess we run into the same issue somehow. I do not think just staying in Octopus is a solution, as it will reach EOF

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread Eugen Block
I don't have an explanation but removing the mon store from the failed mon has resolved similar issues in the past. Could you give that a try? Zitat von mk : Hi CephFolks, I have a cluster 14.2.21-22/Ubuntu 18.04 with 3 mon’s. After going down/restart of 1 mon(amon3) it stucks on probing

[ceph-users] The best way of backup S3 buckets

2021-09-10 Thread huxia...@horebdata.cn
Dear Ceph folks, This is closely related to my previous questions on how to do safely and reliabely RadosGW remote replication. My major task is to backup S3 buckets. One obvious method is to use Ceph RadosGW multisite replication. I am wondering whether this is the best way to do S3 storage

[ceph-users] Re: The best way of backup S3 buckets

2021-09-10 Thread Janne Johansson
Den fre 10 sep. 2021 kl 12:56 skrev huxia...@horebdata.cn : > Dear Ceph folks, > This is closely related to my previous questions on how to do safely and > reliabely RadosGW remote replication. > My major task is to backup S3 buckets. One obvious method is to use Ceph > RadosGW multisite replicat

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread mk
Thx Eugen, just stopping mon and remove/rename only store.db and start mon? BR Max > On 10. Sep 2021, at 12:50, Eugen Block wrote: > > I don't have an explanation but removing the mon store from the failed mon > has resolved similar issues in the past. Could you give that a try? > > > Zitat v

[ceph-users] Re: The best way of backup S3 buckets

2021-09-10 Thread mhnx
If you need instant backup and lifecycle rules then Multisite is the best choice. If you need daily backup and do not have different ceph cluster, then rclone will be your best mate. 10 Eyl 2021 Cum 13:56 tarihinde huxia...@horebdata.cn şunu yazdı: > Dear Ceph folks, > > This is closely related

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread Eugen Block
Yes, give it a try. If the cluster is healthy otherwise it shouldn't be a problem. Zitat von mk : Thx Eugen, just stopping mon and remove/rename only store.db and start mon? BR Max On 10. Sep 2021, at 12:50, Eugen Block wrote: I don't have an explanation but removing the mon store from th

[ceph-users] Re: The best way of backup S3 buckets

2021-09-10 Thread huxia...@horebdata.cn
Thanks a lot for quick response. Will rclone be able to handle PB data backup? Does any have experience using rclone to backup massive S3 object store, and what lessons learned? best regards, Samuel huxia...@horebdata.cn From: mhnx Date: 2021-09-10 13:07 To: huxiaoyu CC: ceph-users Subj

[ceph-users] SSDs/HDDs in ceph Octopus

2021-09-10 Thread Luke Hall
Hi, We have six osd machines, each containing 6x4TB HDDs plus one nvme for rocksdb. I need to plan upgrading these machines to all or partial SSDs. The question I have is: I know that ceph recognises SSDs as distinct from HDDs from their physical device ids etc. In a setup with 50/50 HDDs/SS

[ceph-users] Re: The best way of backup S3 buckets

2021-09-10 Thread mhnx
Its great. I have moved millions of objects between two cluster and its a piece of art work by an awesome weirdo. Memory and cpu usage is epic. Very fast and it can use metada, md5 etc. But you need to write your own script İf you wanna crob job. 10 Eyl 2021 Cum 14:19 tarihinde huxia...@horebdata

[ceph-users] Re: SSDs/HDDs in ceph Octopus

2021-09-10 Thread Robert Sander
Hi Luke, Am 10.09.21 um 13:27 schrieb Luke Hall: I know that ceph recognises SSDs as distinct from HDDs from their physical device ids etc. In a setup with 50/50 HDDs/SSDs does ceph do anything natively to distinguish between the two speeds of storage? I.e do you need to create separate pools

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread mk
no doesn’t start the mon daemon: Sep 10 13:35:55 amon3 systemd[1]: ceph-mon@amon3.service: Start request repeated too quickly. Sep 10 13:35:55 amon3 systemd[1]: ceph-mon@amon3.service: Failed with result 'exit-code'. Sep 10 13:35:55 amon3 systemd[1]: Failed to start Ceph cluster monitor daemon.

[ceph-users] List pg with heavily degraded objects

2021-09-10 Thread George Shuklin
Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage, and I want to see the most degraded objects (which has 1 copy, then objects with 2 copies, etc

[ceph-users] Re: SSDs/HDDs in ceph Octopus

2021-09-10 Thread Luke Hall
It is best practice to have rulesets that select either hdd or ssd classes and then assign these rules to different pools. It is not good practice to just mixed these classes in one pool, except for a transition period like with your project. The performance difference is just too large. P

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread Eugen Block
Is there anything wrong with the directory permissions? What does the mon log tell you? Zitat von mk : no doesn’t start the mon daemon: Sep 10 13:35:55 amon3 systemd[1]: ceph-mon@amon3.service: Start request repeated too quickly. Sep 10 13:35:55 amon3 systemd[1]: ceph-mon@amon3.service: Fa

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread mk
I have just seen that on failed mon store.db size is 50K but on both other healthy mons 151M, What is the best practice? redeploy failed mon? > On 10. Sep 2021, at 13:08, Eugen Block wrote: > > Yes, give it a try. If the cluster is healthy otherwise it shouldn't be a > problem. > > > Zitat

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread Eugen Block
Redeploying would probably be the fastest way if you don't want your cluster in a degraded state for too long. You can check the logs afterwards to see what went wrong. Zitat von mk : I have just seen that on failed mon store.db size is 50K but on both other healthy mons 151M, What is th

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread Janne Johansson
Den fre 10 sep. 2021 kl 13:55 skrev George Shuklin : > Hello. > I wonder if there is a way to see how many replicas are available for > each object (or, at least, PG-level statistics). Basically, if I have > damaged cluster, I want to see the scale of damage, and I want to see > the most degraded o

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread Konstantin Shalygin
Hi, One time I search undersized PG's with only one replica (as I remember) Snippet left in my notes, so may be help you ceph pg dump | grep undersized | awk '{print $1 " " $17 " " $18 " " $19}' | awk -vOFS='\t' '{ print length($4), $0 }' | sort -k1,1n | cut -f2- | head k > On 10 Sep 2021, at

[ceph-users] Re: mon stucks on probing and out of quorum, after down and restart

2021-09-10 Thread Konstantin Shalygin
Yes, try to use Wido's script (remove quorum logic or execute commands by hand) https://gist.github.com/wido/561c69dc2ec3a49d1dba10a59b53dfe5 k > On 10 Sep 2021, at 14:57, mk wrote: > > I have just seen that on failed mon store

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin
On 10/09/2021 15:19, Janne Johansson wrote: Den fre 10 sep. 2021 kl 13:55 skrev George Shuklin : Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage,

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread Janne Johansson
Den fre 10 sep. 2021 kl 14:27 skrev George Shuklin : > On 10/09/2021 15:19, Janne Johansson wrote: > >> Are there a way? pg list is not very informative, as it does not show > >> how badly 'unreplicated' data are. > > ceph pg dump should list all PGs and how many active OSDs they have in > > a list

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin
On 10/09/2021 14:49, George Shuklin wrote: Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged cluster, I want to see the scale of damage, and I want to see the most degraded objects (which

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread Janne Johansson
Den fre 10 sep. 2021 kl 14:39 skrev George Shuklin : > > On 10/09/2021 14:49, George Shuklin wrote: > > Hello. > > > > I wonder if there is a way to see how many replicas are available for > > each object (or, at least, PG-level statistics). Basically, if I have > > damaged cluster, I want to see t

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin
On 10/09/2021 15:37, Janne Johansson wrote: Den fre 10 sep. 2021 kl 14:27 skrev George Shuklin : On 10/09/2021 15:19, Janne Johansson wrote: Are there a way? pg list is not very informative, as it does not show how badly 'unreplicated' data are. ceph pg dump should list all PGs and how many ac

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread George Shuklin
On 10/09/2021 15:54, Janne Johansson wrote: Den fre 10 sep. 2021 kl 14:39 skrev George Shuklin : On 10/09/2021 14:49, George Shuklin wrote: Hello. I wonder if there is a way to see how many replicas are available for each object (or, at least, PG-level statistics). Basically, if I have damaged

[ceph-users] OSD Service Advanced Specification db_slots

2021-09-10 Thread Edward R Huyer
I recently upgraded my existing cluster to Pacific and cephadm, and need to reconfigure all the (rotational) OSDs to use NVMe drives for db storage. I think I have a reasonably good idea how that's going to work, but the use of db_slots and limit in the OSD service specification have me scratch

[ceph-users] Re: List pg with heavily degraded objects

2021-09-10 Thread Janne Johansson
> No, I'm worried about observability of the situation when data are in a > single copy (which I consider bit emergency). I've just created scenario > when only single server (2 OSD) got data on it, and right after > replication started, I can't detect that it's THAT bad. I've updated the If you h

[ceph-users] How many concurrent users can be supported by a single Rados gateway

2021-09-10 Thread huxia...@horebdata.cn
Dear Cephers, I am planning a Ceph Cluster (Lumninous 12.2.13) for hosting on-line courses for one university. The data would mostly be video media and thus 4+2 EC coded object store together with CivetWeb RADOS gateway will be utilized. We plan to use 4 physical machines as Rados gateway sole

[ceph-users] Re: How many concurrent users can be supported by a single Rados gateway

2021-09-10 Thread Konstantin Shalygin
> On 10 Sep 2021, at 18:04, huxia...@horebdata.cn wrote: > > I am planning a Ceph Cluster (Lumninous 12.2.13) for hosting on-line courses > for one university. The data would mostly be video media and thus 4+2 EC > coded object store together with CivetWeb RADOS gateway will be utilized. > >

[ceph-users] Re: OSD Service Advanced Specification db_slots

2021-09-10 Thread Matthew Vernon
On 10/09/2021 15:20, Edward R Huyer wrote: Question 2: If db_slots still *doesn't* work, is there a coherent way to divide up a solid state DB drive for use by a bunch of OSDs when the OSDs may not all be created in one go? At first I thought it was related to limit, but re-reading the advance

[ceph-users] Re: How many concurrent users can be supported by a single Rados gateway

2021-09-10 Thread Eugen Block
The first suggestion is to not use Luminous since it’s already EOL. We noticed major improvements in performance when upgrading from L to Nautilus, and N will also be EOL soon. Since there are some reports about performance degradation when upgrading to Pacific I would recommend to use Octo

[ceph-users] Re: How many concurrent users can be supported by a single Rados gateway

2021-09-10 Thread Konstantin Shalygin
Nautilus already EOL too, commits is not backported to this branch. Only by companies who made products on this release and can verify patches self k Sent from my iPhone > On 10 Sep 2021, at 18:23, Eugen Block wrote: > Nautilus, and N will also be EOL soon ___

[ceph-users] Re: Drop of performance after Nautilus to Pacific upgrade

2021-09-10 Thread Igor Fedotov
Hi Luis, some chances that you're hit by https://tracker.ceph.com/issues/52089. What is your physical DB volume configuration - are there fast standalone disks for that? If so are they showing high utilization during the benchmark? It makes sense to try 16.2.6 once available - would the prob

[ceph-users] Ignore Ethernet interface

2021-09-10 Thread Dominik Baack
Hi, we are currently trying to deploy CephFS on 7 storage nodes connected by two infiniband ports and an ethernet port for external communication. For various reasons the network interfaces are mapped to the same IP range e.g. x.x.x.15y (eno1), x.x.x.17y (ib1), x.x.x.18y (ib2) with x constan

[ceph-users] Re: Drop of performance after Nautilus to Pacific upgrade

2021-09-10 Thread Lomayani S. Laizer
Hello, I might be hit by the same bug. After upgrading from octopus to pacific my cluster is slower by around 2-3times. I will try 16.2.6 when is out On Fri, Sep 10, 2021 at 6:58 PM Igor Fedotov wrote: > Hi Luis, > > some chances that you're hit by https://tracker.ceph.com/issues/52089. > What

[ceph-users] Re: How many concurrent users can be supported by a single Rados gateway

2021-09-10 Thread huxia...@horebdata.cn
Thanks for the suggestions. My viewpoints may be wrong, but i think stability is utmost for us, and an older version such as Luminous may be much well battle-field tested that recent ones. Unless there is some instatbilty or bug reports, I would still trust older versions. Just my own preferen