[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

2020-11-07 Thread Tony Liu
Is it FileStore or BlueStore? With this SSD-HDD solution, is journal
or WAL/DB on SSD or HDD? My understanding is that, there is no
benefit to put journal or WAL/DB on SSD with such solution. It will
also eliminate the single point of failure when having all WAL/DB
on one SSD. Just want to confirm.

Another thought is to have separate pools, like all-SSD pool and
all-HDD pool. Each pool will be used for different purpose. For example,
image, backup, object can be in all-HDD pool and VM volume can be in
all-SSD pool.


Thanks!
Tony
> -Original Message-
> From: 胡 玮文 
> Sent: Monday, October 26, 2020 9:20 AM
> To: Frank Schilder 
> Cc: Anthony D'Atri ; ceph-users@ceph.io
> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD
> replicated pool
> 
> 
> > 在 2020年10月26日,15:43,Frank Schilder  写道:
> >
> > 
> >> I’ve never seen anything that implies that lead OSDs within an acting
> set are a function of CRUSH rule ordering.
> >
> > This is actually a good question. I believed that I had seen/heard
> that somewhere, but I might be wrong.
> >
> > Looking at the definition of a PG, is states that a PG is an ordered
> set of OSD (IDs) and the first up OSD will be the primary. In other
> words, it seems that the lowest OSD ID is decisive. If the SSDs were
> deployed before the HDDs, they have the smallest IDs and, hence, will be
> preferred as primary OSDs.
> 
> I don’t think this is correct. From my experiments, using previously
> mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the
> primary OSDs are always SSD.
> 
> I also have a look at the code, if I understand it correctly:
> 
> * If the default primary affinity is not changed, then the logic about
> primary affinity is skipped, and the primary would be the first one
> returned by CRUSH algorithm [1].
> 
> * The order of OSDs returned by CRUSH still matters if you changed the
> primary affinity. The affinity represents the probability of a test to
> be success. The first OSD will be tested first, and will have higher
> probability to become primary. [2]
>   * If any OSD has primary affinity = 1.0, the test will always success,
> and any OSD after it will never be primary.
>   * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to
> 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has
> probability of 0.125. Otherwise, 1st will be primary.
>   * If no test success (Suppose all OSDs have affinity of 0), 1st OSD
> will be primary as fallback.
> 
> [1]:
> https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012
> 53/src/osd/OSDMap.cc#L2456
> [2]:
> https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a012
> 53/src/osd/OSDMap.cc#L2561
> 
> So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient
> for it to be the primary in my case.
> 
> Do you think I should contribute these to documentation?
> 
> > This, however, is not a sustainable situation. Any addition of OSDs
> will mess this up and the distribution scheme will fail in the future. A
> way out seem to be:
> >
> > - subdivide your HDD storage using device classes:
> > * define a device class for HDDs with primary affinity=0, for example,
> > pick 5 HDDs and change their device class to hdd_np (for no primary)
> > * set the primary affinity of these HDD OSDs to 0
> > * modify your crush rule to use "step take default class hdd_np"
> > * this will create a pool with primaries on SSD and balanced storage
> > distribution between SSD and HDD
> > * all-HDD pools deployed as usual on class hdd
> > * when increasing capacity, one needs to take care of adding disks to
> > hdd_np class and set their primary affinity to 0
> > * somewhat increased admin effort, but fully working solution
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Anthony D'Atri 
> > Sent: 25 October 2020 17:07:15
> > To: ceph-users@ceph.io
> > Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD
> > replicated pool
> >
> >> I'm not entirely sure if primary on SSD will actually make the read
> happen on SSD.
> >
> > My understanding is that by default reads always happen from the lead
> OSD in the acting set.  Octopus seems to (finally) have an option to
> spread the reads around, which IIRC defaults to false.
> >
> > I’ve never seen anything that implies that lead OSDs within an acting
> set are a function of CRUSH rule ordering. I’m not asserting that they
> aren’t though, but I’m … skeptical.
> >
> > Setting primary affinity would do the job, and you’d want to have cron
> continually update it across the cluster to react to topology changes.
> I was told of this strategy back in 2014, but haven’t personally seen it
> implemented.
> >
> > That said, HDDs are more of a bottleneck for writes than reads and
> just might be fine for your application.  Tiny reads are going to limit
> you to some degree regardless 

[ceph-users] pg xyz is stuck undersized for long time

2020-11-07 Thread Frank Schilder
Hi all,

I moved the crush location of 8 OSDs and rebalancing went on happily (misplaced 
objects only). Today, osd.1 crashed, restarted and rejoined the cluster. 
However, it seems not to re-join some PGs it was a member of. I have now 
undersized PGs for no real reason I would believe:

PG_DEGRADED Degraded data redundancy: 52173/2268789087 objects degraded 
(0.002%), 2 pgs degraded, 7 pgs undersized
pg 11.52 is stuck undersized for 663.929664, current state 
active+undersized+remapped+backfilling, last acting 
[237,60,2147483647,74,233,232,292,86]

The up and acting sets are:

"up": [
237,
2,
74,
289,
233,
232,
292,
86
],
"acting": [
237,
60,
2147483647,
74,
233,
232,
292,
86
],

How can I get the PG to complete peering and osd.1 to join? I have an 
unreasonable number of degraded objects where the missing part is on this OSD.

For completeness, here the cluster status:

# ceph status
  cluster:
id: ...
health: HEALTH_ERR
noout,norebalance flag(s) set
1 large omap objects
35815902/2268938858 objects misplaced (1.579%)
Degraded data redundancy: 46122/2268938858 objects degraded 
(0.002%), 2 pgs degraded, 7 pgs undersized
Degraded data redundancy (low space): 28 pgs backfill_toofull
 
  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-01(active), standbys: ceph-03, ceph-02
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 299 osds: 275 up, 275 in; 301 remapped pgs
 flags noout,norebalance
 
  data:
pools:   11 pools, 3215 pgs
objects: 268.8 M objects, 675 TiB
usage:   854 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 46122/2268938858 objects degraded (0.002%)
 35815902/2268938858 objects misplaced (1.579%)
 2907 active+clean
 219  active+remapped+backfill_wait
 47   active+remapped+backfilling
 28   active+remapped+backfill_wait+backfill_toofull
 6active+clean+scrubbing+deep
 5active+undersized+remapped+backfilling
 2active+undersized+degraded+remapped+backfilling
 1active+clean+scrubbing
 
  io:
client:   13 MiB/s rd, 196 MiB/s wr, 2.82 kop/s rd, 1.81 kop/s wr
recovery: 57 MiB/s, 14 objects/s

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Not able to read file from ceph kernel mount

2020-11-07 Thread Amudhan P
Hi,
Hi,

At last, the problem fixed for now by adding cluster network IP to the
second interface.

But It looks weird why the client wants to communicate with Cluster IP.

Does anyone have an idea? why we need to provide cluster IP to client
mounting thru kernel.

Initially, when the cluster was set up it had only public network. later
added cluster with cluster IP and it was working fine until the restart of
the entire cluster.

regards
Amudhan P

On Fri, Nov 6, 2020 at 12:02 AM Amudhan P  wrote:
>
>> Hi,
>> I am trying to read file from my ceph kernel mount and file read stays in
>> bytes for very long and I am getting below error msg in dmesg.
>>
>> [  167.591095] ceph: loaded (mds proto 32)
>> [  167.600010] libceph: mon0 10.0.103.1:6789 session established
>> [  167.601167] libceph: client144519 fsid f8bc7682-0d11-11eb-a332-
>> 0cc47a5ec98a
>> [  272.132787] libceph: osd1 10.0.104.1:6891 socket closed (con state
>> CONNECTING)
>>
>> Ceph cluster status is healthy no error It was working fine until before
>> my entire cluster was down.
>>
>> Using Ceph octopus in debian.
>>
>> Regards
>> Amudhan P
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Not able to read file from ceph kernel mount

2020-11-07 Thread Amudhan P
Hi,

At last, the problem fixed for now by adding cluster network IP to the
second interface.

But It looks weird why the client wants to communicate with Cluster IP.

Does anyone have an idea? why we need to provide cluster IP to client
mounting thru kernel.

Initially, when the cluster was set up it had only public network. later
added cluster with cluster IP and it was working fine until the restart of
the entire cluster.

regards
Amudhan P

On Fri, Nov 6, 2020 at 12:02 AM Amudhan P  wrote:

> Hi,
> I am trying to read file from my ceph kernel mount and file read stays in
> bytes for very long and I am getting below error msg in dmesg.
>
> [  167.591095] ceph: loaded (mds proto 32)
> [  167.600010] libceph: mon0 10.0.103.1:6789 session established
> [  167.601167] libceph: client144519 fsid f8bc7682-0d11-11eb-a332-
> 0cc47a5ec98a
> [  272.132787] libceph: osd1 10.0.104.1:6891 socket closed (con state
> CONNECTING)
>
> Ceph cluster status is healthy no error It was working fine until before
> my entire cluster was down.
>
> Using Ceph octopus in debian.
>
> Regards
> Amudhan P
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph as a distributed filesystem and kerberos integration

2020-11-07 Thread Nathan Fish
NFS also works. I recommend NFS 4.1+ for performance reasons.

On Sat, Nov 7, 2020 at 4:51 AM Marco Venuti  wrote:
>
> Hi,
> I have the same use-case.
> Is there some alternative to Samba in order to export CephFS to the end
> user? I am somewhat concerned with its potential security
> vulnerabilities, which appear to be quite frequent.
> Specifically, I need server-side enforced permissions and possibly
> Kerberos authentication and server-side enforced quotas.
>
> Thank you,
> Marco
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph as a distributed filesystem and kerberos integration

2020-11-07 Thread Marco Venuti
Hi,
I have the same use-case.
Is there some alternative to Samba in order to export CephFS to the end
user? I am somewhat concerned with its potential security
vulnerabilities, which appear to be quite frequent.
Specifically, I need server-side enforced permissions and possibly
Kerberos authentication and server-side enforced quotas.

Thank you,
Marco
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io