[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
Is it FileStore or BlueStore? With this SSD-HDD solution, is journal or WAL/DB on SSD or HDD? My understanding is that, there is no benefit to put journal or WAL/DB on SSD with such solution. It will also eliminate the single point of failure when having all WAL/DB on one SSD. Just want to confirm. Another thought is to have separate pools, like all-SSD pool and all-HDD pool. Each pool will be used for different purpose. For example, image, backup, object can be in all-HDD pool and VM volume can be in all-SSD pool. Thanks! Tony > -Original Message- > From: 胡 玮文 > Sent: Monday, October 26, 2020 9:20 AM > To: Frank Schilder > Cc: Anthony D'Atri ; ceph-users@ceph.io > Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD > replicated pool > > > > 在 2020年10月26日,15:43,Frank Schilder 写道: > > > > > >> I’ve never seen anything that implies that lead OSDs within an acting > set are a function of CRUSH rule ordering. > > > > This is actually a good question. I believed that I had seen/heard > that somewhere, but I might be wrong. > > > > Looking at the definition of a PG, is states that a PG is an ordered > set of OSD (IDs) and the first up OSD will be the primary. In other > words, it seems that the lowest OSD ID is decisive. If the SSDs were > deployed before the HDDs, they have the smallest IDs and, hence, will be > preferred as primary OSDs. > > I don’t think this is correct. From my experiments, using previously > mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the > primary OSDs are always SSD. > > I also have a look at the code, if I understand it correctly: > > * If the default primary affinity is not changed, then the logic about > primary affinity is skipped, and the primary would be the first one > returned by CRUSH algorithm [1]. > > * The order of OSDs returned by CRUSH still matters if you changed the > primary affinity. The affinity represents the probability of a test to > be success. The first OSD will be tested first, and will have higher > probability to become primary. [2] > * If any OSD has primary affinity = 1.0, the test will always success, > and any OSD after it will never be primary. > * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to > 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has > probability of 0.125. Otherwise, 1st will be primary. > * If no test success (Suppose all OSDs have affinity of 0), 1st OSD > will be primary as fallback. > > [1]: > https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012 > 53/src/osd/OSDMap.cc#L2456 > [2]: > https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012 > 53/src/osd/OSDMap.cc#L2561 > > So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient > for it to be the primary in my case. > > Do you think I should contribute these to documentation? > > > This, however, is not a sustainable situation. Any addition of OSDs > will mess this up and the distribution scheme will fail in the future. A > way out seem to be: > > > > - subdivide your HDD storage using device classes: > > * define a device class for HDDs with primary affinity=0, for example, > > pick 5 HDDs and change their device class to hdd_np (for no primary) > > * set the primary affinity of these HDD OSDs to 0 > > * modify your crush rule to use "step take default class hdd_np" > > * this will create a pool with primaries on SSD and balanced storage > > distribution between SSD and HDD > > * all-HDD pools deployed as usual on class hdd > > * when increasing capacity, one needs to take care of adding disks to > > hdd_np class and set their primary affinity to 0 > > * somewhat increased admin effort, but fully working solution > > > > Best regards, > > = > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > > > From: Anthony D'Atri > > Sent: 25 October 2020 17:07:15 > > To: ceph-users@ceph.io > > Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD > > replicated pool > > > >> I'm not entirely sure if primary on SSD will actually make the read > happen on SSD. > > > > My understanding is that by default reads always happen from the lead > OSD in the acting set. Octopus seems to (finally) have an option to > spread the reads around, which IIRC defaults to false. > > > > I’ve never seen anything that implies that lead OSDs within an acting > set are a function of CRUSH rule ordering. I’m not asserting that they > aren’t though, but I’m … skeptical. > > > > Setting primary affinity would do the job, and you’d want to have cron > continually update it across the cluster to react to topology changes. > I was told of this strategy back in 2014, but haven’t personally seen it > implemented. > > > > That said, HDDs are more of a bottleneck for writes than reads and > just might be fine for your application. Tiny reads are going to limit > you to some degree regardless
[ceph-users] pg xyz is stuck undersized for long time
Hi all, I moved the crush location of 8 OSDs and rebalancing went on happily (misplaced objects only). Today, osd.1 crashed, restarted and rejoined the cluster. However, it seems not to re-join some PGs it was a member of. I have now undersized PGs for no real reason I would believe: PG_DEGRADED Degraded data redundancy: 52173/2268789087 objects degraded (0.002%), 2 pgs degraded, 7 pgs undersized pg 11.52 is stuck undersized for 663.929664, current state active+undersized+remapped+backfilling, last acting [237,60,2147483647,74,233,232,292,86] The up and acting sets are: "up": [ 237, 2, 74, 289, 233, 232, 292, 86 ], "acting": [ 237, 60, 2147483647, 74, 233, 232, 292, 86 ], How can I get the PG to complete peering and osd.1 to join? I have an unreasonable number of degraded objects where the missing part is on this OSD. For completeness, here the cluster status: # ceph status cluster: id: ... health: HEALTH_ERR noout,norebalance flag(s) set 1 large omap objects 35815902/2268938858 objects misplaced (1.579%) Degraded data redundancy: 46122/2268938858 objects degraded (0.002%), 2 pgs degraded, 7 pgs undersized Degraded data redundancy (low space): 28 pgs backfill_toofull services: mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03 mgr: ceph-01(active), standbys: ceph-03, ceph-02 mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay osd: 299 osds: 275 up, 275 in; 301 remapped pgs flags noout,norebalance data: pools: 11 pools, 3215 pgs objects: 268.8 M objects, 675 TiB usage: 854 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 46122/2268938858 objects degraded (0.002%) 35815902/2268938858 objects misplaced (1.579%) 2907 active+clean 219 active+remapped+backfill_wait 47 active+remapped+backfilling 28 active+remapped+backfill_wait+backfill_toofull 6active+clean+scrubbing+deep 5active+undersized+remapped+backfilling 2active+undersized+degraded+remapped+backfilling 1active+clean+scrubbing io: client: 13 MiB/s rd, 196 MiB/s wr, 2.82 kop/s rd, 1.81 kop/s wr recovery: 57 MiB/s, 14 objects/s Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Not able to read file from ceph kernel mount
Hi, Hi, At last, the problem fixed for now by adding cluster network IP to the second interface. But It looks weird why the client wants to communicate with Cluster IP. Does anyone have an idea? why we need to provide cluster IP to client mounting thru kernel. Initially, when the cluster was set up it had only public network. later added cluster with cluster IP and it was working fine until the restart of the entire cluster. regards Amudhan P On Fri, Nov 6, 2020 at 12:02 AM Amudhan P wrote: > >> Hi, >> I am trying to read file from my ceph kernel mount and file read stays in >> bytes for very long and I am getting below error msg in dmesg. >> >> [ 167.591095] ceph: loaded (mds proto 32) >> [ 167.600010] libceph: mon0 10.0.103.1:6789 session established >> [ 167.601167] libceph: client144519 fsid f8bc7682-0d11-11eb-a332- >> 0cc47a5ec98a >> [ 272.132787] libceph: osd1 10.0.104.1:6891 socket closed (con state >> CONNECTING) >> >> Ceph cluster status is healthy no error It was working fine until before >> my entire cluster was down. >> >> Using Ceph octopus in debian. >> >> Regards >> Amudhan P >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Not able to read file from ceph kernel mount
Hi, At last, the problem fixed for now by adding cluster network IP to the second interface. But It looks weird why the client wants to communicate with Cluster IP. Does anyone have an idea? why we need to provide cluster IP to client mounting thru kernel. Initially, when the cluster was set up it had only public network. later added cluster with cluster IP and it was working fine until the restart of the entire cluster. regards Amudhan P On Fri, Nov 6, 2020 at 12:02 AM Amudhan P wrote: > Hi, > I am trying to read file from my ceph kernel mount and file read stays in > bytes for very long and I am getting below error msg in dmesg. > > [ 167.591095] ceph: loaded (mds proto 32) > [ 167.600010] libceph: mon0 10.0.103.1:6789 session established > [ 167.601167] libceph: client144519 fsid f8bc7682-0d11-11eb-a332- > 0cc47a5ec98a > [ 272.132787] libceph: osd1 10.0.104.1:6891 socket closed (con state > CONNECTING) > > Ceph cluster status is healthy no error It was working fine until before > my entire cluster was down. > > Using Ceph octopus in debian. > > Regards > Amudhan P > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph as a distributed filesystem and kerberos integration
NFS also works. I recommend NFS 4.1+ for performance reasons. On Sat, Nov 7, 2020 at 4:51 AM Marco Venuti wrote: > > Hi, > I have the same use-case. > Is there some alternative to Samba in order to export CephFS to the end > user? I am somewhat concerned with its potential security > vulnerabilities, which appear to be quite frequent. > Specifically, I need server-side enforced permissions and possibly > Kerberos authentication and server-side enforced quotas. > > Thank you, > Marco > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph as a distributed filesystem and kerberos integration
Hi, I have the same use-case. Is there some alternative to Samba in order to export CephFS to the end user? I am somewhat concerned with its potential security vulnerabilities, which appear to be quite frequent. Specifically, I need server-side enforced permissions and possibly Kerberos authentication and server-side enforced quotas. Thank you, Marco ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io