Re: [ceph-users] Recommended fs to use with rbd

2019-03-31 Thread Vitaliy Filippov
...which only works when mapped with `virtio-scsi` (not with the regular  
virtio driver) :)



The only important thing is to enable discard/trim on the file system.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck in active+clean+remapped

2019-03-31 Thread huang jun
seems like the crush cannot get enough osds for this pg,
what the output of 'ceph osd crush dump' and especially the 'tunables'
section values?

Vladimir Prokofev  于2019年3月27日周三 上午4:02写道:
>
> CEPH 12.2.11, pool size 3, min_size 2.
>
> One node went down today(private network interface started flapping, and 
> after a while OSD processes crashed), no big deal, cluster recovered, but not 
> completely. 1 PG stuck in active+clean+remapped state.
>
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES   LOG 
>  DISK_LOG STATE STATE_STAMPVERSION 
> REPORTEDUP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB
>   SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP   
> SNAPTRIMQ_LEN
> 20.a2   511  00   511   0  1584410172 
> 1500 1500 active+clean+remapped 2019-03-26 20:50:18.639452
> 96149'18920496861:935872[26,14] 26  [26,14,9] 26  
>   96149'189204 2019-03-26 10:47:36.17476995989'187669 2019-03-22 
> 23:29:02.322848 0
>
> it states it's placed on 26,14 OSDs, should be on 26,14,9. As far as I can 
> see there's nothing wrong with any of those OSDs, they work, host other PGs, 
> peer with each other, etc. I tried restarting all of them one after another, 
> but without any success.
> OSD 9 hosts 95 other PGs, don't think it's PG overdose.
>
> Last line of log from osd.9 mentioning PG 20.a2:
> 2019-03-26 20:50:16.294500 7fe27963a700  1 osd.9 pg_epoch: 96860 pg[20.a2( v 
> 96149'189204 (95989'187645,96149'189204] local-lis/les=96857/96858 n=511 
> ec=39164/39164 lis/c 96857/96855 les/c/f 96858/96856/66611 96859/96860/96855) 
> [26,14]/[26,14,9] r=2 lpr=96860 pi=[96855,96860)/1 crt=96149'189204 lcod 0'0 
> remapped NOTIFY mbc={}] state: transitioning to Stray
>
> Nothing else out of ordinary, just usual scrubs/deep-scrubs notifications.
> Any ideas what it it can be, or any other steps to troubleshoot this?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to force backfill a pg in ceph jewel

2019-03-31 Thread huang jun
The force-recovery/backfill command was introduced in Luminous version
if i remember right

Nikhil R  于2019年3月31日周日 上午7:59写道:
>
> Team,
> Is there a way to force backfill a pg in ceph jewel. I know this is available 
> in mimic. Is it available in ceph jewel
> I tried ceph pg backfill   pg backfill   but no luck
>
> Any help would be appreciated as we have a prod issue.
> in.linkedin.com/in/nikhilravindra
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Pools.

2019-03-31 Thread huang jun
What's the output of 'ceph osd dump' and 'ceph osd crush dump' and
'ceph health detail'?

Andrew J. Hutton  于2019年3月30日周六 上午7:05写道:
>
> I have tried to create erasure pools for CephFS using the examples given
> at
> https://swamireddy.wordpress.com/2016/01/26/ceph-diff-between-erasure-and-replicated-pool-type/
> but this is resulting in some weird behaviour.  The only number in
> common is that when creating the metadata store; is this related?
>
> [ceph@thor ~]$ ceph -s
>cluster:
>  id: b688f541-9ad4-48fc-8060-803cb286fc38
>  health: HEALTH_WARN
>  Reduced data availability: 128 pgs inactive, 128 pgs incomplete
>
>services:
>  mon: 3 daemons, quorum thor,odin,loki
>  mgr: odin(active), standbys: loki, thor
>  mds: cephfs-1/1/1 up  {0=thor=up:active}, 1 up:standby
>  osd: 5 osds: 5 up, 5 in
>
>data:
>  pools:   2 pools, 256 pgs
>  objects: 21 objects, 2.19KiB
>  usage:   5.08GiB used, 7.73TiB / 7.73TiB avail
>  pgs: 50.000% pgs not active
>   128 creating+incomplete
>   128 active+clean
>
> Pretty sure these were the commands used.
>
> ceph osd pool create storage 1024 erasure ec-42-profile2
> ceph osd pool create storage 128 erasure ec-42-profile2
> ceph fs new cephfs storage_metadata storage
> ceph osd pool create storage_metadata 128
> ceph fs new cephfs storage_metadata storage
> ceph fs add_data_pool cephfs storage
> ceph osd pool set storage allow_ec_overwrites true
> ceph osd pool application enable storage cephfs
> fs add_data_pool default storage
> ceph fs add_data_pool cephfs storage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Thank you!
HuangJun
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph block storage cluster limitations

2019-03-31 Thread Christian Balzer

Hello,

essentially Anthony has given all the answers here.

Seeing as I'm sketching out a double digit PetaByte cluster ATM, I'm piping
up anyway.

On Sat, 30 Mar 2019 15:53:14 -0700 Anthony D'Atri wrote:

> > Hello,
> > 
> > I wanted to know if there are any max limitations on
> > 
> > - Max number of Ceph data nodes
> > - Max number of OSDs per data node
> > - Global max on number of OSDs
> > - Any limitations on the size of each drive managed by OSD?
> > - Any limitation on number of client nodes?
> > - Any limitation on maximum number of RBD volumes that can be created?  
>
For something you want to implement it probably wouldn't hurt to to shovel
some gold towards RedHat and get replies in writing from them. 
_IF_ they're willing to make such statments. ^o^

  
> I don’t think there any *architectural* limits, but there can be *practical* 
> limits.  There are a lot of variables and everyone has a unique situation, 
> but some thoughts:
> 
> > Max number of Ceph data nodes  
> 
> May be limited at some extreme by networking.  Don’t cheap out on your 
> switches.
> 
Not every OSD needs to talk to every other OSDs and the MON communication
is light, but yeah, it adds up. 
Note that I said OSDs, not actual hosts, they don't count in any traffic
equations.

> > - Max number of OSDs per data node  
> 
> People have run at least 72.  Consider RAM required for a given set of 
> drives, and that a single host/chassis isn’t a big percentage of your 
> cluster.  Ie., don’t have a huge fault domain that will bite you later.  For 
> a production cluster at scale I would suggest at least 12 OSD nodes, but this 
> depends on lots of variables.  Conventional wisdom is 1GB RAM per 1TB of OSD; 
> in practice for a large cluster I would favor somewhat more.  A cluster with, 
> say, 3 nodes of 72 OSDs each is going to be in bad way when one fails.
> 

Yes, proper sizing of failure domains is a must.
Looking at the recent bluestore discussions it feels like 4GB RAM per OSD
is a good starting point. That's something affected by your use case and
requirements (more RAM, more caching can be configured).

> > - Global max on number of OSDs  
> 
> A cluster with at lest 10800 has existed.
> 
> https://indico.cern.ch/event/542464/contributions/2202295/attachments/1289543/1921810/cephday-dan.pdf
> https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
> 
> The larger a cluster becomes, the more careful attention must be paid to 
> topology and tuning.
> 
Amen to that. 

For the same 10PB cluster I've come up with designs ranging from nearly 800
nodes and 13000 OSDs (3x replica, 16x 2.4TB HDDs per 2U node) to slightly
more than half of these numbers with the same HW but 10+4 erasure encoding.

These numbers made me more than slightly anxious and had other people (who
wanted to use that existing HE) outright faint or run away screaming. 

A different design with 60 effective OSDs (10TB HDDs) per 4U  requires
just 28 nodes and 1800 OSDs with 10+4 EC, much more manageable with
regards to numbers and rack space.

The same 4U hardware with 6 RAID6 (10 HDDs each) and thus 6 OSDs per node
and 3x replica on the Ceph level requires 64 nodes and results in only 384
OSDs.
If this is a feasible design will no doubt be debated here by purists. 
It has however the distinct advantage of extreme resilience and is very
unlikely to ever require a rebuild due to a failed OSD (HDD).
And given 1800 disks, failures are statistically going to be a common
occurrence.
It also allows for significantly less CPU resources due to no EC and far
fewer OSDs. 

For the record and to forestall some comments, this is for an object
storage (RGW) cluster dealing with largish (3MB average) objects, so IOPS
aren't the prime objective here.

See also my "Erasure Coding failure domain" mail just now.

> > Also, any advise on using NVMes for OSD drives?  
> 
> They rock.  Evaluate your servers carefully:
> * Some may route PCI through a multi-mode SAS/SATA HBA
> * Watch for PCI bridges or multiplexing
> * Pinning, minimize data over QPI links
> * Faster vs more cores can squeeze out more performance 
> 
> AMD Epyc single-socket systems may be very interesting for NVMe OSD nodes.
> 
I'm happy that somebody else spotted this. ^o^

Regards,

Christian


> > What is the known maximum cluster size that Ceph RBD has been deployed to?  
> 
> See above.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Coding failure domain (again)

2019-03-31 Thread Christian Balzer



Hello,

considering erasure coding for the first time (so excuse seemingly
obvious questions) and staring at the various previous posts
and documentation and in particular:
http://docs.ceph.com/docs/master/dev/osd_internals/erasure_coding/

Am I correct that unlike with with replication there isn't a maximum size
of the critical path OSDs?

Meaning that with replication x3 and typical values of 100 PGs per OSD at
most 300 OSDs form a set out of which 3 OSDs need to fail for data loss.
The statistical likelihood for that based on some assumptions
is significant, but not nightmarishly so.
A cluster with 1500 OSDs in total is thus as susceptible as one with just
300.
Meaning that 3 disk losses in the big cluster don't necessarily mean data
loss at all.

However it feels that with EC all OSDs can essentially be in the same set
and thus having 6 out of 1500 OSDs fail in a 10+5 EC pool with 100 PGs per
OSD would affect every last object in that cluster, not just a subset.

If these ramblings are correct (or close to it), then an obvious risk
mitigation would be to pick smallish encoding sets and low numbers of PGs.
For example a 4+2 encoding and 50 PGs per OSD would reduce things down to
the same risk as a 3x replica pool.

Feedback welcome.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com