Re: [ceph-users] RGW Swift metadata dropped when S3 bucket versioning enabled

2018-12-05 Thread Maxime Guyot
Hi Florian,

Thanks for the help. I did further testing and narrowed it down to objects
that have been uploaded when the bucket has versioning enabled.
Objects created before that are not affected: all metadata operations are
still possible.

Here is a simple way to reproduce this:
http://paste.openstack.org/show/736713/
And here is the snippet to easily turn on/off S3 versioning on a given
bucket: https://gist.github.com/Miouge1/b8ae19b71411655154e74e609b61f24e

Cheers,
Maxime

On Fri, 30 Nov 2018 at 22:28 Florian Haas  wrote:

> On 28/11/2018 19:06, Maxime Guyot wrote:
> > Hi Florian,
> >
> > You assumed correctly, the "test" container (private) was created with
> > the "openstack container create test", then I am using the S3 API to
> > enable/disable object versioning on it.
> > I use the following Python snippet to enable/disable S3 bucket
> versioning:
> >
> > import boto, boto.s3, boto.s3.connection
> > conn = conn = boto.connect_s3(aws_access_key_id='***',
> > aws_secret_access_key='***', host='***', port=8080,
> > calling_format=boto.s3.connection.OrdinaryCallingFormat())
> > bucket = conn.get_bucket('test')
> > bucket.configure_versioning(True) # Or False to disable S3 bucket
> versioning
> > bucket.get_versioning_status()
>
> Thanks for making this so easy to reproduce! I must confess upfront that
> I've found myself unable to reproduce your problem, but I've retraced
> your steps and maybe you'll find this useful to develop a hypothesis as
> to what's happening in your case.
>
> $ openstack object show -f shell foo bar
> account="AUTH_5ed51981f4a8468292bf2c578806ebf7"
> container="foo"
> content_length="12"
> content_type="text/plain"
> last_modified="Thu, 22 Nov 2018 15:02:57 GMT"
> object="bar"
>
> properties="S3cmd-Attrs='atime:1542629253/ctime:1542629253/gid:1000/gname:florian/md5:6f5902ac237024bdd0c176cb93063dc4/mode:33204/mtime:1542629253/uid:1000/uname:florian'"
>
> See the properties that are set there? These are obviously not
> properties ever set through the Swift API, but instead they were set
> when I uploaded this object into the corresponding bucket, using the S3
> API.
>
> I can double check that property with boto:
>
> >>> foo = conn.get_bucket('foo')
> >>> bar = bucket.get_key('bar')
> >>> bar.metadata
> {'s3cmd-attrs':
>
> u'atime:1542629253/ctime:1542629253/gid:1000/gname:florian/md5:6f5902ac237024bdd0c176cb93063dc4/mode:33204/mtime:1542629253/uid:1000/uname:florian'}
>
> Now I enable versioning:
>
> >>> foo.configure_versioning(True)
> True
> >>> foo.get_versioning_status()
> {'Versioning': 'Enabled'}
>
> Check if the metadata is still there:
>
> >>> bar.metadata
> {'s3cmd-attrs':
>
> u'atime:1542629253/ctime:1542629253/gid:1000/gname:florian/md5:6f5902ac237024bdd0c176cb93063dc4/mode:33204/mtime:1542629253/uid:1000/uname:florian'}
>
> Refetch object to be sure:
>
> >>> bar = bucket.get_key('bar')
> {'s3cmd-attrs':
>
> u'atime:1542629253/ctime:1542629253/gid:1000/gname:florian/md5:6f5902ac237024bdd0c176cb93063dc4/mode:33204/mtime:1542629253/uid:1000/uname:florian'}
>
> Disable versioning again:
>
> >>> foo.configure_versioning(False)
> True
> >>> foo.get_versioning_status()
> {'Versioning': 'Suspended'}
>
> Now add a property using the Swift API:
>
> $ openstack object set --property spam=eggs foo bar
>
> And read it back:
>
> $ openstack object show -f shell foo bar
> account="AUTH_5ed51981f4a8468292bf2c578806ebf7"
> container="foo"
> content_length="12"
> content_type="text/plain"
> last_modified="Wed, 28 Nov 2018 19:52:48 GMT"
> object="bar"
> properties="Spam='eggs'"
>
> Notice that not only has the property been set, it has *overwritten* the
> S3 properties that were set before. I am not sure if this is meant to be
> this way, i.e. if native Swift acts this way too, but it appears to be
> how radosgw does it.
>
> However, now that have the "spam" property set, I go ahead and re-enable
> versioning:
>
> >>> foo.configure_versioning(True)
> True
>
> >>> foo.get_versioning_status()
> {'Versioning': 'Enabled'}
>
> And then I re-query my object:
>
> $ openstack object show -f shell foo bar
> account="AUTH_5ed51981f4a8468292bf2c578806ebf7"
> container="foo"
> content_length="12"
> content_type="text/plain"
> last_modified="Thu, 29 Nov 2018 11:47:41 GMT"
> object="bar"
> prop

Re: [ceph-users] RGW Swift metadata dropped when S3 bucket versioning enabled

2018-11-28 Thread Maxime Guyot
Hi Florian,

You assumed correctly, the "test" container (private) was created with the
"openstack container create test", then I am using the S3 API to
enable/disable object versioning on it.
I use the following Python snippet to enable/disable S3 bucket versioning:

import boto, boto.s3, boto.s3.connection
conn = conn = boto.connect_s3(aws_access_key_id='***',
aws_secret_access_key='***', host='***', port=8080,
calling_format=boto.s3.connection.OrdinaryCallingFormat())
bucket = conn.get_bucket('test')
bucket.configure_versioning(True) # Or False to disable S3 bucket versioning
bucket.get_versioning_status()

> Semi-related: I've seen some interesting things when mucking around with
> a single container/bucket while switching APIs, when it comes to
> container properties and metadata. For example, if you set a public read
> ACL on an S3 bucket, the the corresponding Swift container is also
> publicly readable but its read ACL looks empty (i.e. private) when you
> ask via the Swift API.

This can definitely become a problem if Swift API says "private" but data
is actually publicly available.
Since the doc says "S3 and Swift APIs share a common namespace, so you may
write data with one API and retrieve it with the other", it might be useful
to document this kind of limitations somewhere.

Cheers,
/ Maxime

On Wed, 28 Nov 2018 at 17:58 Florian Haas  wrote:

> On 27/11/2018 20:28, Maxime Guyot wrote:
> > Hi,
> >
> > I'm running into an issue with the RadosGW Swift API when the S3 bucket
> > versioning is enabled. It looks like it silently drops any metadata sent
> > with the "X-Object-Meta-foo" header (see example below).
> > This is observed on a Luminous 12.2.8 cluster. Is that a normal thing?
> > Am I misconfiguring something here?
> >
> >
> > With S3 bucket versioning OFF:
> > $ openstack object set --property foo=bar test test.dat
> > $ os object show test test.dat
> > ++--+
> > | Field  | Value|
> > ++--+
> > | account| v1   |
> > | container  | test |
> > | content-length | 507904   |
> > | content-type   | binary/octet-stream  |
> > | etag   | 03e8a398f343ade4e1e1d7c81a66e400 |
> > | last-modified  | Tue, 27 Nov 2018 13:53:54 GMT|
> > | object | test.dat |
> > | properties | Foo='bar'|  <= Metadata is
> here
> > ++--+
> >
> > With S3 bucket versioning ON:
>
> Can you elaborate on what exactly you're doing here to enable S3 bucket
> versioning? Do I assume correctly that you are creating the "test"
> container using the swift or openstack client, then sending a
> VersioningConfiguration request against the "test" bucket, as explained
> in
>
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html#how-to-enable-disable-versioning-intro
> ?
>
> > $ openstack object set --property foo=bar test test2.dat
> > $ openstack object show test test2.dat
> > ++--+
> > | Field  | Value|
> > ++--+
> > | account| v1   |
> > | container  | test |
> > | content-length | 507904   |
> > | content-type   | binary/octet-stream  |
> > | etag   | 03e8a398f343ade4e1e1d7c81a66e400 |
> > | last-modified  | Tue, 27 Nov 2018 13:56:50 GMT|
> > | object | test2.dat| <= Metadata is
> absent
> > ++--+
>
> Semi-related: I've seen some interesting things when mucking around with
> a single container/bucket while switching APIs, when it comes to
> container properties and metadata. For example, if you set a public read
> ACL on an S3 bucket, the the corresponding Swift container is also
> publicly readable but its read ACL looks empty (i.e. private) when you
> ask via the Swift API.
>
> Cheers,
> Florian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Swift metadata dropped when S3 bucket versioning enabled

2018-11-27 Thread Maxime Guyot
Hi,

I'm running into an issue with the RadosGW Swift API when the S3 bucket
versioning is enabled. It looks like it silently drops any metadata sent
with the "X-Object-Meta-foo" header (see example below).
This is observed on a Luminous 12.2.8 cluster. Is that a normal thing? Am I
misconfiguring something here?


With S3 bucket versioning OFF:
$ openstack object set --property foo=bar test test.dat
$ os object show test test.dat
++--+
| Field  | Value|
++--+
| account| v1   |
| container  | test |
| content-length | 507904   |
| content-type   | binary/octet-stream  |
| etag   | 03e8a398f343ade4e1e1d7c81a66e400 |
| last-modified  | Tue, 27 Nov 2018 13:53:54 GMT|
| object | test.dat |
| properties | Foo='bar'|  <= Metadata is here
++--+

With S3 bucket versioning ON:
$ openstack object set --property foo=bar test test2.dat
$ openstack object show test test2.dat
++--+
| Field  | Value|
++--+
| account| v1   |
| container  | test |
| content-length | 507904   |
| content-type   | binary/octet-stream  |
| etag   | 03e8a398f343ade4e1e1d7c81a66e400 |
| last-modified  | Tue, 27 Nov 2018 13:56:50 GMT|
| object | test2.dat| <= Metadata is absent
++--+

Cheers,

/ Maxime
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] after reboot node appear outside the root root tree

2017-09-13 Thread Maxime Guyot
Hi,

This is a common problem when doing custom CRUSHmap, the default behavior
is to update the OSD node to location in the CRUSHmap on start. did you
keep to the defaults there?

If that is the problem, you can either:
1) Disable the update on start option: "osd crush update on start = false"
(see
http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-location)
2) Customize the script defining the location of OSDs with "crush location
hook = /path/to/customized-ceph-crush-location" (see
https://github.com/ceph/ceph/blob/master/src/ceph-crush-location.in).

Cheers,
Maxime

On Wed, 13 Sep 2017 at 18:35 German Anders  wrote:

> *# ceph health detail*
> HEALTH_OK
>
> *# ceph osd stat*
> 48 osds: 48 up, 48 in
>
> *# ceph pg stat*
> 3200 pgs: 3200 active+clean; 5336 MB data, 79455 MB used, 53572 GB / 53650
> GB avail
>
>
> *German*
>
> 2017-09-13 13:24 GMT-03:00 dE :
>
>> On 09/13/2017 09:08 PM, German Anders wrote:
>>
>> Hi cephers,
>>
>> I'm having an issue with a newly created cluster 12.2.0
>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc). Basically when I
>> reboot one of the nodes, and when it come back, it come outside of the root
>> type on the tree:
>>
>> root@cpm01:~# ceph osd tree
>> ID  CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>> -15   12.0 *root default*
>> * 36  nvme  1.0 osd.36 up  1.0 1.0*
>> * 37  nvme  1.0 osd.37 up  1.0 1.0*
>> * 38  nvme  1.0 osd.38 up  1.0 1.0*
>> * 39  nvme  1.0 osd.39 up  1.0 1.0*
>> * 40  nvme  1.0 osd.40 up  1.0 1.0*
>> * 41  nvme  1.0 osd.41 up  1.0 1.0*
>> * 42  nvme  1.0 osd.42 up  1.0 1.0*
>> * 43  nvme  1.0 osd.43 up  1.0 1.0*
>> * 44  nvme  1.0 osd.44 up  1.0 1.0*
>> * 45  nvme  1.0 osd.45 up  1.0 1.0*
>> * 46  nvme  1.0 osd.46 up  1.0 1.0*
>> * 47  nvme  1.0 osd.47 up  1.0 1.0*
>>  -7   36.0 *root root*
>>  -5   24.0 rack rack1
>>  -1   12.0 node cpn01
>>   01.0 osd.0  up  1.0 1.0
>>   11.0 osd.1  up  1.0 1.0
>>   21.0 osd.2  up  1.0 1.0
>>   31.0 osd.3  up  1.0 1.0
>>   41.0 osd.4  up  1.0 1.0
>>   51.0 osd.5  up  1.0 1.0
>>   61.0 osd.6  up  1.0 1.0
>>   71.0 osd.7  up  1.0 1.0
>>   81.0 osd.8  up  1.0 1.0
>>   91.0 osd.9  up  1.0 1.0
>>  101.0 osd.10 up  1.0 1.0
>>  111.0 osd.11 up  1.0 1.0
>>  -3   12.0 node cpn03
>>  241.0 osd.24 up  1.0 1.0
>>  251.0 osd.25 up  1.0 1.0
>>  261.0 osd.26 up  1.0 1.0
>>  271.0 osd.27 up  1.0 1.0
>>  281.0 osd.28 up  1.0 1.0
>>  291.0 osd.29 up  1.0 1.0
>>  301.0 osd.30 up  1.0 1.0
>>  311.0 osd.31 up  1.0 1.0
>>  321.0 osd.32 up  1.0 1.0
>>  331.0 osd.33 up  1.0 1.0
>>  341.0 osd.34 up  1.0 1.0
>>  351.0 osd.35 up  1.0 1.0
>>  -6   12.0 rack rack2
>>  -2   12.0 node cpn02
>>  121.0 osd.12 up  1.0 1.0
>>  131.0 osd.13 up  1.0 1.0
>>  141.0 osd.14 up  1.0 1.0
>>  151.0 osd.15 up  1.0 1.0
>>  161.0 osd.16 up  1.0 1.0
>>  171.0 osd.17 up  1.0 1.0
>>  181.0 osd.18 up  1.0 1.0
>>  191.0 osd.19 up  1.0 1.0
>>  201.0 osd.20 up  1.0 1.0
>>  211.0 osd.21 up  1.0 1.0
>>  221.0 osd.22 up  1.0 1.0
>>  231.0 osd.23 up  1.0 1.0
>> * -4  0 node cpn04*
>>
>> Any ideas of why this happen? and how can I fix it? It supposed to be
>> inside rack2
>>
>> Thanks in advance,
>>
>> Best,
>>
>> *German*
>>
>>
>> ___
>> 

Re: [ceph-users] where is a RBD in use

2017-08-31 Thread Maxime Guyot
Hi Götz,

Something like "rbd status image-spec" usually works for me. Man page says:
"Show the status of the image, including which clients have it open."
I'll tell you which IPs have it open which should help you to track it down.

Cheers,
Maxime

On Thu, 31 Aug 2017 at 16:26 Götz Reinicke 
wrote:

> Hi,
>
> Is it possible to see which clients are using an RBD? … I found an RBD in
> one of my pools but cant remember if I ever use / mounted it to a client.
>
> Thx for feedback ! Regards . Götz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cluster - configuration tips and reccomendation - NVMe

2017-07-05 Thread Maxime Guyot
Hi Massimiliano,

I am a little surprised to see 6x NVMe, 64GB of RAM, 2x100 NICs and E5-2603
v4, that's one of the cheapest E5 Intel CPU mixed with some pretty high end
gear, it does not make sense. Wildo's right go with much higher frequency:
E5-2637 v4, E5-2643 v4, E5-1660 v4, E5-1650 v4. If you need to go on the
cheap, the E3 serie is interesting (E3-1220 v6, E3-1230 v6, ...) if you can
work with the limitations: max 64GB of RAM, max 4 cores  and single CPU.

Higher frequency should reduce latency when communicating with NICs and
SSDs which benefits Ceph's performance.

100G NICs is overkill for throughput, but it should reduce the latency. 25G
NIC are becoming popular for servers (replacing 10G NICs).

Cheers,
Maxime

On Wed, 5 Jul 2017 at 10:55 Massimiliano Cuttini  wrote:

> Dear all,
>
> luminous is coming and sooner we should be allowed to avoid double writing.
> This means use 100% of the speed of SSD and NVMe.
> Cluster made all of SSD and NVMe will not be penalized and start to make
> sense.
>
> Looking forward I'm building the next pool of storage which we'll setup on
> next term.
> We are taking in consideration a pool of 4 with the following single node
> configuration:
>
>- 2x E5-2603 v4 - 6 cores - 1.70GHz
>- 2x 32Gb of RAM
>- 2x NVMe M2 for OS
>- 6x NVMe U2 for OSD
>- 2x 100Gib ethernet cards
>
> We have yet not sure about which Intel and how much RAM we should put on
> it to avoid CPU bottleneck.
> Can you help me to choose the right couple of CPU?
> Did you see any issue on the configuration proposed?
>
>
> Thanks,
> Max
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 300 active+undersized+degraded+remapped

2017-07-01 Thread Maxime Guyot
Hi Deepak,

As Wildo pointed it out in the thread you linked, "osd crush update on
start" and osd crush location are quick ways to fix this. If you are doing
custom locations (like for tiering NVMe vs HDD) "osd crush location hook"
(Doc:
http://docs.ceph.com/docs/master/rados/operations/crush-map/#custom-location-hooks
) is a good option as well: it allows you to configure the crush location
of the OSD based on a script, it shouldn't be too hard to detect if the OSD
is NVMe or SATA and set its location based on that. It's really nice when
you add new OSDs to see them arrive in the right location automatically.
Shameless plug: you can find an example in this blog post
http://www.root314.com/2017/01/15/Ceph-storage-tiers/#tiered-crushmap I
hope it helps

Cheers,
Maxime

On Sat, 1 Jul 2017 at 03:28 Deepak Naidu  wrote:

> OK, so looks like its ceph crushmap behavior
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
>
>
>
> --
>
> Deepak
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Deepak Naidu
> *Sent:* Friday, June 30, 2017 7:06 PM
> *To:* David Turner; ceph-users@lists.ceph.com
>
>
> *Subject:* Re: [ceph-users] 300 active+undersized+degraded+remapped
>
>
>
> OK, I fixed the issue. But this is very weird. But will list them so its
> easy for other to check when there is similar issue.
>
>
>
> 1)  I had create rack aware osd tree
>
> 2)  I have SATA OSD’s and NVME OSD
>
> 3)  I created rack aware policy for both SATA and NVME OSD
>
> 4)  NVME OSD was used for CEPH FS Meta
>
> 5)  Recently: When I tried reboot of OSD node, it seemed that my
> journal volumes which were on NVME didn’t startup bcos of the UDEV rules
> and I had to create startup script to fix them.
>
> 6)  With that. I had rebooted all the OSD one by one monitoring the
> ceph status.
>
> 7)  I was at the 3rd last node, then I notice the pgstuck warning.
> Not sure when and what happened, but I started getting this PG stuck
> issue(which is listed in my original email)
>
> 8)  I wasted time to look at the issue/error, but then I found the
> pool 100% used issue.
>
> 9)  Now when I tried ceph osd tree. It looks like my NVME OSD’s went
> back to the host level OSD’s rather than the newly created/mapped NVME rack
> level. Ie no OSD’s under nvme-host name. This was the issue.
>
> 10)   Luckily I had created the backup of compiled version. I imported
> them in crushmap rule and now pool status is OK.
>
>
>
> But, my question is how did ceph re-map the CRUSH rule ?
>
>
>
> I had to create “new host entry” for NVME in crushmap ie
>
>
>
> host OSD1-nvme  -- This is just dummy entry in crushmap ie it
> doesn’t resolve to any hostname
>
> host OSD1  -- This is the actual hostname and
> resolves to IP and has an hostname
>
>
>
> Is that the issue ?
>
>
>
> Current status
>
>
>
> health HEALTH_OK
>
> osdmap e5108: 610 osds: 610 up, 610 in
>
> flags sortbitwise,require_jewel_osds
>
>   pgmap v247114: 15450 pgs, 3 pools, 322 GB data, 86102 objects
>
> 1155 GB used, 5462 TB / 5463 TB avail
>
>15450 active+clean
>
>
>
>
>
> Pool1  15   233M  0  1820T
>   3737
>
> Pool2 16  00
> 1820T  0
>
> Pool Meta   17 34928k0
>   2357G28
>
>
>
>
>
> *Partial list of my osd tree*
>
>
>
> -152.76392 rack
> rack1-nvme
>
> -180.69098 host OSD1-nvme
>
>  600.69098 osd.60 up  1.0
> 1.0
>
> -210.69098 host OSD2-nvme
>
> 2430.69098 osd.243up  1.0
> 1.0
>
> -240.69098 host
> OSD3-NGN1-nvme
>
> 4260.69098 osd.426up  1.0
> 1.0
>
> -1 5456.27734 root
> default
>
> -12 2182.51099 rack
> rack1-sata
>
>  -2  545.62775 host OSD1
>
>   09.09380 osd.0  up  1.0
> 1.0
>
>   19.09380 osd.1  up  1.0
> 1.0
>
>   29.09380 osd.2  up  1.0
> 1.0
>
>   39.09380 osd.3  up  1.0
> 1.0
>
> -2  545.62775 host OSD2
>
>   09.09380 osd.0  up  1.0
> 1.0
>
>   19.09380 osd.1  up  1.0
> 1.0
>
>   29.09380 osd.2  up  1.0
> 1.0
>
>   39.09380 osd.3  up  1.0
> 1.0
>
> -2  545.62775 host OSD2
>
>   09.09380 osd.0  up  1.0
> 1.0
>
>   19.09380 osd.1  up  1.0
> 1.0
>
>   29.09380 osd.2  up  

Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-22 Thread Maxime Guyot
Hi,

One of the benefits of PCIe NVMe is that it does not take a disk slot,
resulting in a higher density. For example a 6048R-E1CR36N with 3x PCIe
NVMe yields 36 OSDs per servers (12 OSD per NVMe) where it yields 30 OSDs
per server if using SATA SSDs (6 OSDs per SSD).

Since you say that you used 10% of P3700 endurance in 1 year (7.3PB
endurance, so 0.73PB/year), so a 400GB P3600 would work for 3 years. Maybe
good enough until BlueStore is more stable.

Cheers,
Maxime

On Thu, 22 Jun 2017 at 03:59 Christian Balzer  wrote:

>
> Hello,
>
> Hmm, gmail client not grokking quoting these days?
>
> On Wed, 21 Jun 2017 20:40:48 -0500 Brady Deetz wrote:
>
> > On Jun 21, 2017 8:15 PM, "Christian Balzer"  wrote:
> >
> > On Wed, 21 Jun 2017 19:44:08 -0500 Brady Deetz wrote:
> >
> > > Hello,
> > > I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I
> have
> > 12
> > > osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> > > drives providing 10GB journals for groups of 12 6TB spinning rust
> drives
> > > and 2x lacp 40gbps ethernet.
> > >
> > > Our hardware provider is recommending that we start deploying P4600
> drives
> > > in place of our P3700s due to availability.
> > >
> > Welcome to the club and make sure to express your displeasure about
> > Intel's "strategy" to your vendor.
> >
> > The P4600s are a poor replacement for P3700s and also still just
> > "announced" according to ARK.
> >
> > Are you happy with your current NVMes?
> > Firstly as in, what is their wearout, are you expecting them to easily
> > survive 5 years at the current rate?
> > Secondly, how about speed? with 12 HDDs and 1GB/s write capacity of the
> > NVMe I'd expect them to not be a bottleneck in nearly all real life
> > situations.
> >
> > Keep in mind that 1.6TB P4600 is going to last about as long as your
> 400GB
> > P3700, so if wear-out is a concern, don't put more stress on them.
> >
> >
> > Oddly enough, the Intel tools are telling me that we've only used about
> 10%
> > of each drive's endurance over the past year. This honestly surprises me
> > due to our workload, but maybe I'm thinking my researchers are doing more
> > science than they actually are.
> >
> That's pretty impressive still, but also lets you do numbers as to what
> kind of additional load you _may_ be able to consider, obviously not more
> than twice the current amount to stay within 5 years before wearing
> them out.
>
>
> >
> > Also the P4600 is only slightly faster in writes than the P3700, so
> that's
> > where putting more workload onto them is going to be a notable issue.
> >
> > > I've seen some talk on here regarding this, but wanted to throw an idea
> > > around. I was okay throwing away 280GB of fast capacity for the
> purpose of
> > > providing reliable journals. But with as much free capacity as we'd
> have
> > > with a 4600, maybe I could use that extra capacity as a cache tier for
> > > writes on an rbd ec pool. If I wanted to go that route, I'd probably
> > > replace several existing 3700s with 4600s to get additional cache
> > capacity.
> > > But, that sounds risky...
> > >
> > Risky as in high failure domain concentration and as mentioned above a
> > cache-tier with obvious inline journals and thus twice the bandwidth
> needs
> > will likely eat into the write speed capacity of the journals.
> >
> >
> > Agreed. On the topic of journals and double bandwidth, am I correct in
> > thinking that btrfs (as insane as it may be) does not require double
> > bandwidth like xfs? Furthermore with bluestore being close to stable,
> will
> > my architecture need to change?
> >
> BTRFS at this point is indeed a bit insane, given the current levels of
> support, issues (search the ML archives) and future developments.
> And you'll still wind up with double writes most likely, IIRC.
>
> These aspects of Bluestore have been discussed here recently, too.
> Your SSD/NVMe space requirements will go down, but if you want to have the
> same speeds and more importantly low latencies you'll wind up with all
> writes going through them again, so endurance wise you're still in that
> "Lets make SSDs great again" hellhole.
>
> >
> > If (and seems to be a big IF) you can find them, the Samsung PM1725a
> 1.6TB
> > seems to be a) cheaper and b) at 2GB/s write speed more likely to be
> > suitable for double duty.
> > Similar (slightly better on paper) endurance than then P4600, so keep
> that
> > in mind, too.
> >
> >
> > My vendor is an HPC vendor so /maybe/ they have access to these elusive
> > creatures. In which case, how many do you want? Haha
> >
> I was just looking at availability with a few google searches, our current
> needs are amply satisfied with S37xx SSDs, no need for NVMes really.
> But as things are going, maybe I'll be forced to Optane and friends simply
> by lack of alternatives.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications

Re: [ceph-users] design guidance

2017-06-06 Thread Maxime Guyot
Hi Daniel,

The flexibility of Ceph is that you can start with your current config,
scale out and upgrade (CPUs, journals etc...) as your performance
requirement increase.

6x1.7Ghz, are we speaking about the Xeon E5 2603L v4? Any chance to bump
that to 2620 v4 or 2630 v4?
Test how the 6x1.7Ghz handles 36 OSDs, then based on that take a decision
to RAID0/LVM or not.
If you have a need for large-low performance block storage, it could be
worth to do a hybrid setup with *some* OSDs in Raid0/LVM.

Since this is a virtualisation use case (VMware and KVM), did you consider
journals? This 256GB SATA SSD is not enough for 36 filestore journals.
Assuming that those 256GB SSD have a performance profile compatible with
journal, a storage tier OSDs with SSD journal (20%) and OSD with collocated
journals (80%) could be nice. Then you place the VMs in different tiers
based on write latency requirements.

If you have the budget for it, you can fit 3x PCIe SSD/NVMe cards into
those StorageServers, that would make a 1:12 ratio and pretty good write
latency.
Another option is to start with filestore then upgrade to bluestore when
stable.

IMO a single network for cluster and public is easier to manage. Since you
already have a 10G cluster, continue with that. Either:
1) If you are tight on 10G ports, do 2x10G per node and skip the 40G NIC
2) If you have plenty of ports, do 4x10G per node: split the 40G NIC into
4x10G.
13 servers (9+3) is usually too small to run in a single ToR setup. So you
should be good with a LACP pair of standard 10G switch as ToR, which you
probably already have?

Cheers,
Maxime

On Tue, 6 Jun 2017 at 08:33 Adrian Saul 
wrote:

> > > Early usage will be CephFS, exported via NFS and mounted on ESXi 5.5
> > > and
> > > 6.0 hosts(migrating from a VMWare environment), later to transition to
> > > qemu/kvm/libvirt using native RBD mapping. I tested iscsi using lio
> > > and saw much worse performance with the first cluster, so it seems
> > > this may be the better way, but I'm open to other suggestions.
> > >
> > I've never seen any ultimate solution to providing HA iSCSI on top of
> Ceph,
> > though other people here have made significant efforts.
>
> In our tests our best results were with SCST - also because it provided
> proper ALUA support at the time.  I ended up developing my own pacemaker
> cluster resources to manage the SCST orchestration and ALUA failover.  In
> our model we have  a pacemaker cluster in front being an RBD client
> presenting LUNs/NFS out to VMware (NFS), Solaris and Hyper-V (iSCSI).  We
> are using CephFS over NFS but performance has been poor, even using it just
> for VMware templates.  We are on an earlier version of Jewel so its
> possibly some later versions may improve CephFS for that but I have not had
> time to test it.
>
> We have been running a small production/POC for over 18 months on that
> setup, and gone live into a much larger setup in the last 6 months based on
> that model.  It's not without its issues, but most of that is a lack of
> test resources to be able to shake out some of the client compatibility and
> failover shortfalls we have.
>
> Confidentiality: This email and any attachments are confidential and may
> be subject to copyright, legal or some other professional privilege. They
> are intended solely for the attention and use of the named addressee(s).
> They may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been
> sent to you by mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] handling different disk sizes

2017-06-06 Thread Maxime Guyot
Hi Félix,

Changing the failure domain to OSD is probably the easiest option if this
is a test cluster. I think the commands would go like:
- ceph osd getcrushmap -o map.bin
- crushtool -d map.bin -o map.txt
- sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0
type osd/' map.txt
- crushtool -c map.txt -o map.bin
- ceph osd setcrushmap -i map.bin

Moving HDDs into ~8TB/server would be a good option if this is a capacity
focused use case. It will allow you to reboot 1 server at a time without
radosgw down time. You would target for 26/3 = 8.66TB/ node so:
- node1: 1x8TB
- node2: 1x8TB +1x2TB
- node3: 2x6 TB + 1x2TB

If you are more concerned about performance then set the weights to 1 on
all HDDs and forget about the wasted capacity.

Cheers,
Maxime


On Tue, 6 Jun 2017 at 00:44 Christian Wuerdig 
wrote:

> Yet another option is to change the failure domain to OSD instead host
> (this avoids having to move disks around and will probably meet you initial
> expectations).
> Means your cluster will become unavailable when you loose a host until you
> fix it though. OTOH you probably don't have too much leeway anyway with
> just 3 hosts so it might be an acceptable trade-off. It also means you can
> just add new OSDs to the servers wherever they fit.
>
> On Tue, Jun 6, 2017 at 1:51 AM, David Turner 
> wrote:
>
>> If you want to resolve your issue without purchasing another node, you
>> should move one disk of each size into each server.  This process will be
>> quite painful as you'll need to actually move the disks in the crush map to
>> be under a different host and then all of your data will move around, but
>> then your weights will be able to utilize the weights and distribute the
>> data between the 2TB, 3TB, and 8TB drives much more evenly.
>>
>> On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary  wrote:
>>
>>>
>>>
>>> On 06/05/2017 02:48 PM, Christian Balzer wrote:
>>> >
>>> > Hello,
>>> >
>>> > On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> We have a small cluster for radosgw use only. It has three nodes,
>>> witch 3
>>> > ^  ^
>>> >> osds each. Each node has different disk sizes:
>>> >>
>>> >
>>> > There's your answer, staring you right in the face.
>>> >
>>> > Your default replication size is 3, your default failure domain is
>>> host.
>>> >
>>> > Ceph can not distribute data according to the weight, since it needs
>>> to be
>>> > on a different node (one replica per node) to comply with the replica
>>> size.
>>>
>>> Another way to look at it is to imagine a situation where 10TB worth of
>>> data
>>> is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas,
>>> this
>>> data must be replicated to node02 but ... there only is 2x3 6TB
>>> available.
>>> So the maximum you can store is 6TB and remaining disk space on node01
>>> and node03
>>> will never be used.
>>>
>>> python-crush analyze will display a message about that situation and
>>> show which buckets
>>> are overweighted.
>>>
>>> Cheers
>>>
>>> >
>>> > If your cluster had 4 or more nodes, you'd see what you expected.
>>> > And most likely wouldn't be happy about the performance with your 8TB
>>> HDDs
>>> > seeing 4 times more I/Os than then 2TB ones and thus becoming the
>>> > bottleneck of your cluster.
>>> >
>>> > Christian
>>> >
>>> >> node01 : 3x8TB
>>> >> node02 : 3x2TB
>>> >> node03 : 3x3TB
>>> >>
>>> >> I thought that the weight handle the amount of data that every osd
>>> receive.
>>> >> In this case for example the node with the 8TB disks should receive
>>> more
>>> >> than the rest, right? All of them receive the same amount of data and
>>> the
>>> >> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing
>>> >> something wrong?
>>> >>
>>> >> The cluster is jewel LTS 10.2.7.
>>> >>
>>> >> # ceph osd df
>>> >> ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
>>> >>  0 7.27060  1.0  7445G 1012G  6432G 13.60 0.57 133
>>> >>  3 7.27060  1.0  7445G 1081G  6363G 14.52 0.61 163
>>> >>  4 7.27060  1.0  7445G  787G  6657G 10.58 0.44 120
>>> >>  1 1.81310  1.0  1856G 1047G   809G 56.41 2.37 143
>>> >>  5 1.81310  1.0  1856G  956G   899G 51.53 2.16 143
>>> >>  6 1.81310  1.0  1856G  877G   979G 47.24 1.98 130
>>> >>  2 2.72229  1.0  2787G 1010G  1776G 36.25 1.52 140
>>> >>  7 2.72229  1.0  2787G  831G  1955G 29.83 1.25 130
>>> >>  8 2.72229  1.0  2787G 1038G  1748G 37.27 1.56 146
>>> >>   TOTAL 36267G 8643G 27624G 23.83
>>> >> MIN/MAX VAR: 0.44/2.37  STDDEV: 18.60
>>> >> #
>>> >>
>>> >> # ceph osd tree
>>> >> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> >> -1 35.41795 root default
>>> >> -2 21.81180 host node01
>>> >>  0  7.27060 osd.0   up  1.0  1.0
>>> >>  3  7.27060 osd.3   up  1.0  1.0
>>> >>  4  7.27060  

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-01 Thread Maxime Guyot
Hi,

Lots of good info on SSD endurance in this thread.

For Ceph journal you should also consider the size of the backing OSDs: the
SSD journal won't last as long if backing 5x8TB OSDs or 5x1TB OSDs.

For example, the S3510 480GB (275TB of endurance), if backing 5x8TB (40TB)
OSDs, will provide very little endurance, assuming triple replication you
will be able to fill the OSDs twice and that's about it (275/(5x8)/3).
On the other end of the scale a 1.2TB S3710 backing 5x1TB will be able to
fill them 1620 times before running out of endurance (24300/(5x1)/3).

Ultimately it depends on your workload. Some people can get away with S3510
as journals if the workload is read intensive, but in most cases the higher
endurance is a safe bet (S3710 or S3610).

Cheers,
Maxime


On Mon, 1 May 2017 at 11:04 Jens Dueholm Christensen 
wrote:

> Sorry for topposting, but..
>
> The Intel 35xx drives are rated for a much lower DWPD
> (drive-writes-per-day) than the 36xx or 37xx models.
>
> Keep in mind that a single SSD that acts as journal for 5 OSDs will
> recieve ALL writes for those 5 OSDs before the data is moved off to the
> OSDs actual data drives.
>
> This makes for quite a lot of writes, and along with the
> consumer/enterprise advice others have written about, your SSD journal
> devices will recieve quite a lot of writes over time.
>
> The S3510 is rated for 0.3 DWPD for 5 years (
> http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3510-spec.html
> )
> The S3610 is rated for 3 DWPD for 5 years  (
> http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3610-spec.html
> )
> The S3710 is rated for 10 DWPD for 5 years (
> http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html
> )
>
> A 480GB S3510 has no endurance left once you have written 0.275PB to it.
> A 480GB S3610 has no endurance left once you have written 3.7PB to it.
> A 400GB S3710 has no endurance left once you have written 8.3PB to it.
>
> This makes for quite a lot of difference over time - even if a S3510 wil
> only act as journal for 1 or 2 OSDs, it will wear out much much much faster
> than others.
>
> And I know I've used the xx10 models above, but the xx00 models have all
> been replaced by those newer models now.
>
> And yes, the xx10 models are using MLC NAND, but so were the xx00 models,
> that have a proven trackrecord and delivers what Intel promised in the
> datasheet.
>
> You could try and take a look at some of the enterprise SSDs that Samsung
> has launched.
> Price-wise they are very competitive to Intel, but I want to see (or at
> least hear from others) if they can deliver what their datasheet promises.
> Samsungs consumer SSDs did not (840/850 Pro), so I'm only using S3710s in
> my cluster.
>
>
> Before I created our own cluster some time ago, I found these threads from
> the mailinglist regarding the exact same disks we had been expecting to use
> (Samsung 840/850 Pro), that was quickly changed to Intel S3710s:
>
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17369.html
>
> A longish thread about Samsung consumer drives:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000572.html
> - highlights from that thread:
>   -
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000610.html
>   -
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000611.html
>   -
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000798.html
>
> Regards,
> Jens Dueholm Christensen
> Rambøll Survey IT
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adam Carheden
> Sent: Wednesday, April 26, 2017 5:54 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Sharing SSD journals and SSD drive choice
>
> Thanks everyone for the replies.
>
> I will be avoiding TLC drives, it was just something easy to benchmark
> with existing equipment. I hadn't though of unscrupulous data durability
> lies or performance suddenly tanking in unpredictable ways. I guess it
> all comes down to trusting the vendor since it would be expensive in
> time and $$ to test for such things.
>
> Any thoughts on multiple Intel 35XX vs a single 36XX/37XX? All have "DC"
> prefixes and are listed in the Data Center section of their marketing
> pages, so I assume they'll all have the same quality underlying NAND.
>
> --
> Adam Carheden
>
>
> On 04/26/2017 09:20 AM, Chris Apsey wrote:
> > Adam,
> >
> > Before we deployed our cluster, we did extensive testing on all kinds of
> > SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME
> > Drives.  We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB
> > to 12x HGST 10TB SAS3 HDDs.  It provided the best
> > price/performance/density balance for us overall.  As a frame of
> > reference, we have 384 OSDs spread across 16 nodes.
> >
> > A few 

Re: [ceph-users] Data not accessible after replacing OSD with larger volume

2017-05-01 Thread Maxime Guyot
Hi,

"Yesterday I replaced one of the 100 GB volumes with a new 2 TB volume
which includes creating a snapshot, detaching the old volume, attaching the
new volume, then using parted to correctly set the start/end of the data
partition. This all went smoothly and no issues reported from AWS or the
server."
While this method should work, I think you would be better off adding the
new 2TB OSD and changing the weight of the old OSD to 0 before un-mounting,
detaching and deleting it.

David is right, your weight and reweight values are off.

Do you have more info on your cluster status? maybe something like OSD
nearfull? the data is in a pool with triple replication?

Side note: since you run Ceph in AWS, you might be interested in this piece
from the folks at GitLab:
https://about.gitlab.com/2016/11/10/why-choose-bare-metal/

Cheers,
Maxime

On Mon, 1 May 2017 at 06:40 David Turner  wrote:

> The crush weight should match the size of your osds. The 100GB osds having
> 0.090 probably based on GiB vs GB. Your 2TB osds should have a weight of
> 2.000, or there about.  Your reweight values will be able to go back much
> closer to 1 once you fix the weights of the larger osds.  Fixing that might
> allow your cluster to finish backfilling.
>
> How do you access your images? Is it through cephfs, rgw, or rbd? Your
> current health doesn't look like it should prevent access to your images.
> The only thing I can think of other than mds or rgw not running would be to
> issue a deep scrub on some of the pgs on the newly increased osd to see if
> there are any inconsistent pgs on it.
>
> On Sun, Apr 30, 2017, 10:40 AM Scott Lewis  wrote:
>
>> Hi,
>>
>> I am a complete n00b to CEPH and cannot seem to figure out why my cluster
>> isn't working as expected. We have 39 OSDs, 36 of which are 100 GB
>> volumes and 3 are 2 TB volumes managed under AWS EC2.
>>
>> Yesterday I replaced one of the 100 GB volumes with a new 2 TB volume
>> which includes creating a snapshot, detaching the old volume, attaching the
>> new volume, then using parted to correctly set the start/end of the data
>> partition. This all went smoothly and no issues reported from AWS or the
>> server.
>>
>> However, when I started reweighting the OSDs, the health status went to
>> HEALTH_WARN with over 500 pgs stuck unclean, and about 14% of objects
>> misplaced. I am adding the health detail, crushmap, and OSD tree here:
>>
>> Crushmap: https://pastebin.com/HxiAChP3
>> Health Detail: https://pastebin.com/K7ZqLQH9
>> OSD Tree: https://pastebin.com/qGRk3R8S
>>
>> We use CEPH to storage our image inventory which is about 5 million or so
>> images. If you do a search on our site, https://iconfinder.com, none of
>> the images is showing up.
>>
>> This all started after doing the reweights when the new volume was added.
>> I tried setting all of the weights back to their original settings but this
>> did not help.
>>
>> The only other thing that I changed was to set the max PID threads to the
>> max allowed. I reset this to the original setting but that didn't work
>> either.
>>
>> sudo sysctl -w kernel.pid_max=32768
>>
>> Thanks in advance for any help.
>>
>> Scott Lewis
>> Sr. Developer & Head of Content
>> Iconfinder Aps
>>
>> http://iconfinder.com
>> http://twitter.com/iconfinder
>>
>> "Helping Designers Make a Living Doing What They Love"
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph with Clos IP fabric

2017-04-22 Thread Maxime Guyot
Hi,

>That only makes sense if you're running multiple ToR switches per rack for the 
>public leaf network. Multiple public ToR switches per rack is not very common; 
>most Clos crossbar networks run a single ToR switch. Several >guides on the 
>topic (including Arista & Cisco) suggest that you use something like MLAG in a 
>layer 2 domain between the switches if you need some sort of switch redundancy 
>inside the rack. This increases complexity, and most people decide that it's 
>not worth it and instead scale out across racks to gain the redundancy and 
>survivability that multiple ToR offer.
If you use MLAG for L2 redundancy, you’ll still want 2 BGP sessions for L3 
redundancy, so why not skipping the MLAG all together and terminating your BGP 
session on each ToR?

Judging by the routes (169.254.0.1), you are using BGP unnumebered?

It sounds like the “ip route get” output you get when using dummy0 is caused by 
a fallback on the default route, supposedly on eth0? Can check the exact routes 
received on server1 with “show ip bgp neighbors  received-routes” 
once you enable “neighbor  soft-reconfiguration inbound” and what’s 
installed in the table “ip route”?


Intrigued by this problem, I tried to reproduce it in a lab with virtualbox. I 
ran into the same problem.

Side note: Configuring the loopback IP on the physical interfaces is workable 
if you set it on **all** parallel links. Example with server1:

“iface enp3s0f0 inet static
  address 10.10.100.21/32
iface enp3s0f1 inet static
  address 10.10.100.21/32
iface enp4s0f0 inet static
  address 10.10.100.21/32
iface enp4s0f1 inet static
  address 10.10.100.21/32”

This should guarantee that the loopback ip is advertised if one of the 4 links 
to switch1 and switch2 is up, but I am not sure if that’s workable for ceph’s 
listening address.


Cheers,
Maxime

From: Richard Hesse <richard.he...@weebly.com>
Date: Thursday 20 April 2017 16:36
To: Maxime Guyot <maxime.gu...@elits.com>
Cc: Jan Marquardt <j...@artfiles.de>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph with Clos IP fabric

On Thu, Apr 20, 2017 at 2:13 AM, Maxime Guyot 
<maxime.gu...@elits.com<mailto:maxime.gu...@elits.com>> wrote:
>2) Why did you choose to run the ceph nodes on loopback interfaces as opposed 
>to the /24 for the "public" interface?
I can’t speak for this example, but in a clos fabric you generally want to 
assign the routed IPs on loopback rather than physical interfaces. This way if 
one of the link goes down (t.ex the public interface), the routed IP is still 
advertised on the other link(s).

That only makes sense if you're running multiple ToR switches per rack for the 
public leaf network. Multiple public ToR switches per rack is not very common; 
most Clos crossbar networks run a single ToR switch. Several guides on the 
topic (including Arista & Cisco) suggest that you use something like MLAG in a 
layer 2 domain between the switches if you need some sort of switch redundancy 
inside the rack. This increases complexity, and most people decide that it's 
not worth it and instead  scale out across racks to gain the redundancy and 
survivability that multiple ToR offer.

On Thu, Apr 20, 2017 at 4:04 AM, Jan Marquardt 
<j...@artfiles.de<mailto:j...@artfiles.de>> wrote:

Maxime, thank you for clarifying this. Each server is configured like this:

lo/dummy0: Loopback interface; Holds the ip address used with Ceph,
which is announced by BGP into the fabric.

enp5s0: Management Interface, which is used only for managing the box.
There should not be any Ceph traffic on this one.

enp3s0f0: connected to sw01 and used for BGP
enp3s0f1: connected to sw02 and used for BGP
enp4s0f0: connected to sw01 and used for BGP
enp4s0f1: connected to sw02 and used for BGP

These four interfaces are supposed to transport the Ceph traffic.

See above. Why are you running multiple public ToR switches in this rack? I'd 
suggest switching them to a single layer 2 domain and participate in the Clos 
fabric as a single unit, or scale out across racks (preferred). Why bother with 
multiple switches in a rack when you can just use multiple racks? That's the 
beauty of Clos: just add more spines if you need more leaf to leaf bandwidth.

How many OSD, servers, and racks are planned for this deployment?

-richard

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Maxime Guyot
Hi,

>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>1:4-5 is common but depends on your needs and the devices in question, ie. 
>assuming LFF drives and that you aren’t using crummy journals.

You might be speaking about different ratios here. I think that Anthony is 
speaking about journal/OSD and Reed speaking about capacity ratio between and 
HDD and SSD tier/root. 

I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on HDD), 
like Richard says you’ll get much better random read performance with primary 
OSD on SSD but write performance won’t be amazing since you still have 2 HDD 
copies to write before ACK. 

I know the doc suggests using primary affinity but since it’s a OSD level 
setting it does not play well with other storage tiers so I searched for other 
options. From what I have tested, a rule that selects the first/primary OSD 
from the ssd-root then the rest of the copies from the hdd-root works. Though I 
am not sure it is *guaranteed* that the first OSD selected will be primary.

“rule hybrid {
  ruleset 2
  type replicated
  min_size 1
  max_size 10
  step take ssd-root
  step chooseleaf firstn 1 type host
  step emit
  step take hdd-root
  step chooseleaf firstn -1 type host
  step emit
}”

Cheers,
Maxime



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding a new rack to crush map without pain?

2017-04-19 Thread Maxime Guyot
Hi Matthew,

I would expect the osd_crush_location parameter to take effect from the OSD 
activation. Maybe ceph-ansible would have info there?
A work around might be “set noin”, restart all the OSDs once the ceph.conf 
includes the crush location and enjoy the automatic CRUSHmap update (if you 
have osd crush update on start = true).

Cheers,
Maxime

On 12/04/17 18:46, "ceph-users on behalf of Matthew Vernon" 
 wrote:

Hi,

Our current (jewel) CRUSH map has rack / host / osd (and the default
replication rule does step chooseleaf firstn 0 type rack). We're shortly
going to be adding some new hosts in new racks, and I'm wondering what
the least-painful way of getting the new osds associated with the
correct (new) rack will be.

We deploy with ceph-ansible, which can add bits of the form
[osd.104]
osd crush location = root=default rack=1 host=sto-1-1

to ceph.conf, but I think this doesn't help for new osds, since
ceph-disk will activate them before ceph.conf is fully assembled (and
trying to arrange it otherwise would be serious hassle).

Would making a custom crush location hook be the way to go? then it'd
say rack=4 host=sto-4-x and new osds would end up allocated to rack 4?
And would I need to have done ceph osd crush add-bucket rack4 rack
first, presumably?

I am planning on adding osds to the cluster one box at a time, rather
than going with the add-everything-at-crush-weight-0 route; if nothing
else it seems easier to automate. And I'd rather avoid having to edit
the crush map directly...

Any pointers welcomed :)

Regards,

Matthew


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph extension - how to equilibrate ?

2017-04-19 Thread Maxime Guyot
Hi Pascal,

I ran into the same situation some time ago: a small cluster and adding a node 
with HDDs double the size of the existing ones and wrote about it here: 
http://ceph.com/planet/the-schrodinger-ceph-cluster/

When adding OSDs to a cluster rebalancing/data movement is unavoidable in most 
cases. Since you will be going from a 144TB cluster to a 240 TB cluster, you 
can estimate that +66% of your data will be rebalanced/moved.

Peter already covered how to move the HDDs from one server to another (incl. 
journal). I just want to point out that you can do the “ceph osd crush set" 
before you do the physical move of the drives. This lets you rebalance on your 
own terms (schedule, rollback etc…).

The easy way:

-  Create the new OSDs (8TB) with weight 0

-  Move each OSDs to its desired location and weights: “ceph osd crush 
set osd.X  root= host=”

-  Monitor and wait for the rebalance to be done (a few days or weeks 
depending on performance)

-  Set noout && physically move the drives && unset noout

In production, you want to consider the op priority and the granularity of the 
increase (increasing weights progressively etc…).

Cheers,
Maxime

From: ceph-users  on behalf of Peter Maloney 

Date: Tuesday 18 April 2017 20:26
To: "pascal.pu...@pci-conseil.net" , 
"ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Ceph extension - how to equilibrate ?

On 04/18/17 16:31, 
pascal.pu...@pci-conseil.net wrote:

Hello,

Just an advise : next time, I will extend my Jewel ceph cluster with a fourth 
node.

Actually, we have 3 x nodes of 12 x OSD with 4TB DD (36 x DD 4TB).

I will add a new node with 12 x 8TB DD (will add 12 new OSD => 48 OSD).
I hope those aren't SMR disks... make sure they're not or it will be very slow, 
to the point where osds will time out and die.


So, how to simply equilibrate ?

How to just unplug 3 x DD 4TB per node and add to fourth node  and just plug 3 
x 8TB in each node coming from fourth node ?
I think you only have to stop them (hopefully not enough to cause missing 
objects, and optionally set noout first), unmount them, move the disks, mount 
them and start them on the new node. Then change the crush rule:

ceph osd crush move osd.X host=nodeY

If your journals aren't being moved too, then flush the journals after the osds 
are stopped:

sync
ceph-osd --id $n --setuser ceph --setgroup ceph --flush-journal

(if that crashes, start the osd, then stop again, and retry)

and before starting them, make new journals.

ceph-osd --id $n --setuser ceph --setgroup ceph --mkjournal



I want at the end : 3 x DD 8TB per node and 9 x DD 4TB per node ?

How to do that in the easyest way ?

I don't want move all data : It will take a long time per OSD...
I don't know how much data this will move if any... but if it moves data, you 
probably don't have a choice.


Is there a way to just switch OSD between node ?

Thanks for your help.
Pascal,





___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs

2017-04-03 Thread Maxime Guyot
Hi Vlad,

I am curious if those OSDs are flapping all at once? If a single host is 
affected I would consider the network connectivity (bottlenecks and 
misconfigured bonds can generate strange situations), storage controller and 
firmware.

Cheers,
Maxime

From: ceph-users  on behalf of Vlad Blando 

Date: Sunday 2 April 2017 16:28
To: ceph-users 
Subject: [ceph-users] Flapping OSDs

Hi,

One of my ceph nodes have flapping OSDs, network between nodes are fine, it's 
on a 10GBase-T network. I don't see anything wrong with the network, but these 
OSDs are going up/down.

[root@avatar0-ceph4 ~]# ceph osd tree
# idweight  type name   up/down reweight
-1  174.7   root default
-2  29.12   host avatar0-ceph2
16  3.64osd.16  up  1
17  3.64osd.17  up  1
18  3.64osd.18  up  1
19  3.64osd.19  up  1
20  3.64osd.20  up  1
21  3.64osd.21  up  1
22  3.64osd.22  up  1
23  3.64osd.23  up  1
-3  29.12   host avatar0-ceph0
0   3.64osd.0   up  1
1   3.64osd.1   up  1
2   3.64osd.2   up  1
3   3.64osd.3   up  1
4   3.64osd.4   up  1
5   3.64osd.5   up  1
6   3.64osd.6   up  1
7   3.64osd.7   up  1
-4  29.12   host avatar0-ceph1
8   3.64osd.8   up  1
9   3.64osd.9   up  1
10  3.64osd.10  up  1
11  3.64osd.11  up  1
12  3.64osd.12  up  1
13  3.64osd.13  up  1
14  3.64osd.14  up  1
15  3.64osd.15  up  1
-5  29.12   host avatar0-ceph3
24  3.64osd.24  up  1
25  3.64osd.25  up  1
26  3.64osd.26  up  1
27  3.64osd.27  up  1
28  3.64osd.28  up  1
29  3.64osd.29  up  1
30  3.64osd.30  up  1
31  3.64osd.31  up  1
-6  29.12   host avatar0-ceph4
32  3.64osd.32  up  1
33  3.64osd.33  up  1
34  3.64osd.34  up  1
35  3.64osd.35  up  1
36  3.64osd.36  up  1
37  3.64osd.37  up  1
38  3.64osd.38  up  1
39  3.64osd.39  up  1
-7  29.12   host avatar0-ceph5
40  3.64osd.40  up  1
41  3.64osd.41  up  1
42  3.64osd.42  up  1
43  3.64osd.43  up  1
44  3.64osd.44  up  1
45  3.64osd.45  up  1
46  3.64osd.46  up  1
47  3.64osd.47  up  1
[root@avatar0-ceph4 ~]#


Here is my ceph.conf
---
[root@avatar0-ceph4 ~]# cat /etc/ceph/ceph.conf
[global]
fsid = 2f0d1928-2ee5-4731-a259-64c0dc16110a
mon_initial_members = avatar0-ceph0, avatar0-ceph1, avatar0-ceph2
mon_host = 172.40.40.100,172.40.40.101,172.40.40.102
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
osd_pool_default_min_size = 1
cluster_network = 172.50.50.0/24
public_network = 172.40.40.0/24
max_open_files = 131072
mon_clock_drift_allowed = .15
mon_clock_drift_warn_backoff = 30
mon_osd_down_out_interval = 300
mon_osd_report_timeout = 300
mon_osd_min_down_reporters = 3


[osd]
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 8
osd_max_backfills = 1
osd_recovery_op_priority = 1
osd_recovery_max_active = 1

[client]
rbd_cache = true
rbd_cache_writethrough_until_flush = true
---

Here's the log snippet on osd.34
---
2017-04-02 22:26:10.371282 7f1064eab700  0 -- 
172.50.50.105:6816/117130897 >> 
172.50.50.101:6808/190698 pipe(0x156a1b80 
sd=124 :46536 s=2 pgs=966 cs=1 l=0 c=0x13ae19c0).fault with nothing to send, 
going to standby
2017-04-02 22:26:10.371360 7f106ed5c700  0 -- 
172.50.50.105:6816/117130897 >> 
172.50.50.104:6822/181109 pipe(0x1018c2c0 
sd=75 :34196 s=2 pgs=1022 cs=1 l=0 c=0x1098fa20).fault with nothing 

Re: [ceph-users] How to think a two different disk's technologies architecture

2017-03-23 Thread Maxime Guyot
Hi Alexandro,

As I understand you are planning NVMe for Journal for SATA HDD and collocated 
journal for SATA SSD?

Option 1:
- 24x SATA SSDs per server, will have a bottleneck with the storage 
bus/controller.  Also, I would consider the network capacity 24xSSDs will 
deliver more performance than 24xHDD with journal, but you have the same 
network capacity on both types of nodes.
- This option is a little easier to implement: just move nodes in different 
CRUSHmap root
- Failure of a server (assuming size = 3) will impact all PGs
Option 2:
- You may have noisy neighbors effect between HDDs and SSDs, if HDDs are able 
to saturate your NICs or storage controller. So be mindful of this with the 
hardware design
- To configure the CRUSHmap for this you need to split each server in 2, I 
usually use “server1-hdd” and “server1-ssd” and map the right OSD in the right 
bucket, so a little extra work here but you can easily fix a “crush location 
hook” script for it (see example 
http://www.root314.com/2017/01/15/Ceph-storage-tiers/)
- In case of a server failure recovery will be faster than option 1 and will 
impact less PGs

Some general notes:
- SSD pools perform better with higher frequency CPUs
- the 1GB of RAM per TB is a little outdated, the current consensus for HDD 
OSDs is around 2GB/OSD (see 
https://www.redhat.com/cms/managed-files/st-rhcs-config-guide-technology-detail-inc0387897-201604-en.pdf)
- Network wise, if the SSD OSDs are rated for 500MB/s and use collocated 
journal you could generate up to 250MB/s of traffic per SSD OSD (24Gbps for 12x 
or 48Gbps for 24x) therefore I would consider doing 4x10G and consolidate both 
client and cluster network on that

Cheers,
Maxime

On 23/03/17 18:55, "ceph-users on behalf of Alejandro Comisario" 
 wrote:

Hi everyone!
I have to install a ceph cluster (6 nodes) with two "flavors" of
disks, 3 servers with SSD and 3 servers with SATA.

Y will purchase 24 disks servers (the ones with sata with NVE SSD for
the SATA journal)
Processors will be 2 x E5-2620v4 with HT, and ram will be 20GB for the
OS, and 1.3GB of ram per storage TB.

The servers will have 2 x 10Gb bonding for public network and 2 x 10Gb
for cluster network.
My doubts resides, ar want to ask the community about experiences and
pains and gains of choosing between.

Option 1
3 x servers just for SSD
3 x servers jsut for SATA

Option 2
6 x servers with 12 SSD and 12 SATA each

Regarding crushmap configuration and rules everything is clear to make
sure that two pools (poolSSD and poolSATA) uses the right disks.

But, what about performance, maintenance, architecture scalability, etc ?

thank you very much !

-- 
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need erasure coding, pg and block size explanation

2017-03-21 Thread Maxime Guyot
Hi Vincent,

There is no buffering until the object reaches 8MB. When the object is written, 
it has a given size. RADOS just splits the object in K chunks, padding occurs 
if the object size is not a multiple of K.

See also: 
http://docs.ceph.com/docs/master/dev/osd_internals/erasure_coding/developer_notes/

Cheers,
Maxime

From: ceph-users  on behalf of Vincent Godin 

Date: Tuesday 21 March 2017 17:16
To: ceph-users 
Subject: [ceph-users] Need erasure coding, pg and block size explanation

When we use a replicated pool of size 3 for example, each data, a block of 4MB 
is written on one PG which is distributed on 3 hosts (by default). The osd 
holding the primary will copy the block to OSDs holding the secondary and third 
PG.
With erasure code, let's take a raid5 schema like k=2 and m=1. Does Ceph buffer 
the data till it reach a amount of 8 MB which it can then divide into two 
blocks of 4MB and a parity control of 4MB  ? Does it just divide the data in 
two chunks whatever the size ? Will it use then PG1 on OSD.A  to store the 
first block, PG1 on OSD.X to store the second block of data and PG1 on OSD.z to 
store the parity ?
Thanks for your explanation because i didn't found any clear explanation on how 
data chunk and parity
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] total storage size available in my CEPH setup?

2017-03-14 Thread Maxime Guyot
Hi,

>> My question is how much total CEPH storage does this allow me? Only 2.3TB? 
>> or does the way CEPH duplicates data enable more than 1/3 of the storage?
> 3 means 3, so 2.3TB. Note that Ceph is spare, so that can help quite a bit.

To expand on this, you probably want to keep some margins and not run at your 
cluster 100% :) (especially if you are running RBD with thin provisioning). By 
default, “ceph status” will issue a warning at 85% full (osd nearfull ratio). 
You should also consider that you need some free space for auto healing to work 
(if you plan to use more than 3 OSDs on a size=3 pool).

Cheers,
Maxime 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs and erasure coding

2017-03-08 Thread Maxime Guyot
Hi,

>“The answer as to how to move an existing cephfs pool from replication to 
>erasure coding (and vice versa) is to create the new pool and rsync your data 
>between them.”
Shouldn’t it be possible to just do the “ceph osd tier add  ecpool cachepool && 
ceph osd tier cache-mode cachepool writeback” and let Ceph redirect the 
requests (CephFS or other) to the cache pool?

Cheers,
Maxime

From: ceph-users  on behalf of David Turner 

Date: Wednesday 8 March 2017 22:27
To: Rhian Resnick , "ceph-us...@ceph.com" 

Subject: Re: [ceph-users] cephfs and erasure coding

I use CephFS on erasure coding at home using a cache tier.  It works fine for 
my use case, but we know nothing about your use case to know if it will work 
well for you.

The answer as to how to move an existing cephfs pool from replication to 
erasure coding (and vice versa) is to create the new pool and rsync your data 
between them.

[cid:image001.jpg@01D298AE.DE1475E0]

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Rhian Resnick 
[rresn...@fau.edu]
Sent: Wednesday, March 08, 2017 12:54 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] cephfs and erasure coding

Two questions on Cephfs and erasure coding that Google couldn't answer.





1) How well does cephfs work with erasure coding?



2) How would you move an existing cephfs pool that uses replication to erasure 
coding?



Rhian Resnick

Assistant Director Middleware and HPC

Office of Information Technology



Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

 [mage] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Shrinking lab cluster to free hardware for a new deployment

2017-03-08 Thread Maxime Guyot
Hi Kevin,

I don’t know about those flags, but if you want to shrink your cluster you can 
simply set the weight of the OSDs to be removed to 0 like so: “ceph osd 
reweight osd.X 0”
You can either do it gradually if your are concerned about client I/O (probably 
not since you speak of a test / semi prod cluster) or all at once.
This should take care of all the data movements.

Once the cluster is back to HEALTH_OK, you can then proceed with the standard 
remove OSD procedure: 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#removing-osds-manual
You should be able to delete all the OSDs in a short period of time since the 
data movement has already been taken care of with the reweight.

I hope that helps.

Cheers,
Maxime

From: ceph-users  on behalf of Kevin Olbrich 

Date: Wednesday 8 March 2017 14:39
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] Shrinking lab cluster to free hardware for a new 
deployment

Hi!

Currently I have a cluster with 6 OSDs (5 hosts, 7TB RAID6 each).
We want to shut down the cluster but it holds some semi-productive VMs we might 
or might not need in the future.
To keep them, we would like to shrink our cluster from 6 to 2 OSDs (we use size 
2 and min_size 1).

Should I set the OSDs out one by one or with norefill, norecovery flags set but 
all at once?
If last is the case, which flags should be set also?

Thanks!

Kind regards,
Kevin Olbrich.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication vs Erasure Coding with only 2 elementsinthe failure-domain.

2017-03-08 Thread Maxime Guyot
Hi,

If using Erasure Coding, I think that should be using “choose indep” rather 
than “firstn” (according to 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-January/007306.html)

“- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

Unfortunately I'm not aware of a solution. It would require replacing 'step 
take ' with 'step take ' and 'step take ' 
with 'step take '. Iteration is not part of crush as far as I 
know. Maybe someone else can give some more insight into this.”

How about something like this:

“rule eck2m2_ruleset {
  ruleset 0
  type erasure
  min_size 4
  max_size 4
  step take default
  step choose indep 2 type room
 step choose indep 2 type host
  step emit
}
“
Such a rule should put 2 shards in each room on different 4 hosts.

If you are serious about surviving the loss of one of the room, you might want 
to consider the recovery time and how likely it is to have an OSD failure in 
the surviving room during the recovery phase. Something like EC(n,n+1) or  LRC 
(http://docs.ceph.com/docs/master/rados/operations/erasure-code-lrc/) might 
help.

Cheers,
Maxime

From: ceph-users  on behalf of Burkhard 
Linke 
Date: Wednesday 8 March 2017 08:05
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Replication vs Erasure Coding with only 2 
elementsinthe failure-domain.


Hi,

On 03/07/2017 05:53 PM, Francois Blondel wrote:

Hi all,



We have (only) 2 separate "rooms" (crush bucket) and would like to build a 
cluster being able to handle the complete loss of one room.

*snipsnap*

Second idea would be to use Erasure Coding, as it fits our performance 
requirements and would use less raw space.



Creating an EC profile like:

   “ceph osd erasure-code-profile set eck2m2room k=2 m=2 
ruleset-failure-domain=room”

and a pool using that EC profile, with “ceph osd pool create ecpool 128 128 
erasure eck2m2room” of course leads to having “128 creating+incomplete” PGs, as 
we only have 2 rooms.



Is there somehow a way to store the “parity chuncks” (m) on both rooms, so that 
the loss of a room would be possible ?



If I understood correctly, an Erasure Coding of for example k=2, m=2, would use 
the same space as a replication with a size of 2, but be more reliable, as we 
could afford the loss of more OSDs at the same time.

Would it be possible to instruct the crush rule to store the first k and m 
chuncks in room 1, and the second k and m chuncks in room 2 ?

As far as I understand erasure coding there's no special handling for parity or 
data chunks. To assemble an EC object you just need k chunks, regardless 
whether they are data or parity chunks.

You should be able to distribute the chunks among two rooms by creating a new 
crush rule:

- min_size 4
- max_size 4
- step take 
- step chooseleaf firstn 2 type host
- step emit
- step take 
- step chooseleaf firstn 2 type host
- step emit

I'm not 100% sure about whether chooseleaf is correct or another choose step is 
necessary to ensure that two osd from differents hosts are chosen (if 
necessary). The important point is using two choose-emit cycles and using the 
correct start points. Just insert the crush labels for the rooms.

This approach should work, but it has two drawbacks:

- crash handling
In case of host failing in a room, the PG from that host will be replicated to 
another host in the same room. You have to ensure that there's enough capacity 
in each rooms (vs. having enough capacity in the complete cluster), which might 
be tricky.

- bandwidth / host utilization
Almost all ceph based applications/libraries use the 'primary' osd for 
accessing data in a PG. The primary OSD is the first one generated by the crush 
rule. In the upper example, the primary OSDs will all be located in the first 
room. All client traffic will be heading to hosts in that room. Depending on 
your setup this might not be a desired solution.

Unfortunately I'm not aware of a solution. It would require to replace 'step 
take ' with 'step take ' and 'step take ' 
with 'step take '. Iteration is not part of crush as far as I 
know. Maybe someone else can give some more insight into this.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replica questions

2017-03-03 Thread Maxime Guyot
Hi Henrik and Matteo,

While I agree with Henrik: increasing your replication factor won’t improve 
recovery or read performance on its own. If you are changing from replica 2 to 
replica 3, you might need to scale-out your cluster to have enough space for 
the additional replica, and that would improve the recovery and read 
performance.

Cheers,
Maxime

From: ceph-users  on behalf of Henrik Korkuc 

Date: Friday 3 March 2017 11:35
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] replica questions

On 17-03-03 12:30, Matteo Dacrema wrote:
Hi All,

I’ve a production cluster made of 8 nodes, 166 OSDs and 4 Journal SSD every 5 
OSDs with replica 2 for a total RAW space of 150 TB.
I’ve few question about it:


  *   It’s critical to have replica 2? Why?
Replica size 3 is highly recommended. I do not know exact numbers but it 
decreases chance of data loss as 2 disk failures appear to be quite frequent 
thing, especially in larger clusters.


  *   Does replica 3 makes recovery faster?
no


  *   Does replica 3 makes rebalancing and recovery less heavy for customers? 
If I lose 1 node does replica 3 reduce the IO impact respect a replica 2?
no


  *   Does read performance increase with replica 3?
no


Thank you
Regards
Matteo


This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.





___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CrushMap Rule Change

2017-03-02 Thread Maxime Guyot
Hi Ahsley,

The rule you indicated, with “step choose indep 0 type osd” should select 13 
different OSDs but not necessary on 13 different servers. So you should be able 
to test that on say 4 servers if you have ~4 OSDs per server.

To split the selected OSDs across 4 hosts, I think you would do something like:
“step take fourtb
step choose indep 4 type host
step choose indep 4 type osd
step emit”

Cheers,
Maxime


From: ceph-users  on behalf of Ashley 
Merrick 
Date: Thursday 2 March 2017 11:34
To: "ceph-us...@ceph.com" 
Subject: [ceph-users] CrushMap Rule Change

Hello,

I am currently doing some erasure code tests in a dev environment.

I have set the following by “default”

rule sas {
ruleset 2
type erasure
min_size 3
max_size 13
step set_chooseleaf_tries 5
step set_choose_tries 100
step take fourtb
step choose indep 0 type osd
step emit
}

As I am splitting the file into 13 chunks it is placing these across 13 
different OSD’s.

In the DEV environment I do not have 13 hosts to do full host replication, 
however I am sure I can change the crush map rule to try and split evenly 
across the 4 HOST I have.

I’m think I will need to tell it to pick 4 HOST’s, and then the second line to 
pick OSD’s, however as 13 does not divide by 4 exactly what would be the best 
way to lay out this crushmap rule?

Thanks,
Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase number of replicas per node

2017-02-27 Thread Maxime Guyot
Hi Massimiliano,

You’ll need to update the rule with something like that:

rule rep6 {
ruleset 1
type replicated
min_size 6
max_size 6
step take root
step choose firstn 3 type host
step choose firstn 2 type osd
step emit
}

Testing it with crushtool and assuming a crush map with osd0-3 are in host1, 
osd4-7 in host2 and osd8-12 in host3 I get the following results
crushtool -i map.bin --test --rule 1 --show-mappings --x 1 --num-rep 6
CRUSH rule 1 x 1 [3,0,9,11,5,7]

Cheers,
Maxime

On 27/02/17 13:22, "ceph-users on behalf of Massimiliano Cuttini" 
 wrote:

Dear all,

i have 3 nodes with 4 OSD each.
And I would like to have 6 replicas.
So 2 replicas for nodes.

Does anybody know how to allow CRUSH to use twice the same node but 
different OSD?

Thanks,
Max


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-18 Thread Maxime Guyot
Hi Rick,


If you have some numbers and info on the setup that would be greatly 
appreciated.

I noticed Wildo's blog post about SMR drives: 
https://blog.widodh.nl/2017/02/do-not-use-smr-disks-with-ceph/ so I guess he 
ran into some problems?


Cheers,
Maxime

On Feb 18, 2017 23:04, rick stehno <rs3...@me.com> wrote:
I work for Seagate and have done over a hundred of tests using SMR 8TB disks in 
a cluster. It all depends on what your access is if SMR hdd would be the best 
choice. Remember SMR hdd don't perform well doing random writes, but are 
excellent for reads and sequential writes.
I have many tests where I added a SSD or PCIe flash card to place the journals 
on these devices and SMR performed better than a typical CMR disk and overall 
cheaper than using all CMR hdd. You can also use some type of caching like Ceph 
Cache Tier or other caching with very good results.
By placing the journals on flash or adopt some type of caching you are 
eliminating the double writes to the SMR hdd and performance should be fine. I 
have test results if you would like to see them.

Rick
Sent from my iPhone, please excuse any typing errors.

> On Feb 17, 2017, at 8:49 PM, Mike Miller <millermike...@gmail.com> wrote:
>
> Hi,
>
> don't go there, we tried this with SMR drives, which will slow down to 
> somewhere around 2-3 IOPS during backfilling/recovery and that renders the 
> cluster useless for client IO. Things might change in the future, but for 
> now, I would strongly recommend against SMR.
>
> Go for normal SATA drives with only slightly higher price/capacity ratios.
>
> - mike
>
>> On 2/3/17 2:46 PM, Stillwell, Bryan J wrote:
>> On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
>> <ceph-users-boun...@lists.ceph.com on behalf of w...@42on.com> wrote:
>>>
>>>> Op 3 februari 2017 om 11:03 schreef Maxime Guyot
>>>> <maxime.gu...@elits.com>:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Interesting feedback!
>>>>
>>>>  > In my opinion the SMR can be used exclusively for the RGW.
>>>>  > Unless it's something like a backup/archive cluster or pool with
>>>> little to none concurrent R/W access, you're likely to run out of IOPS
>>>> (again) long before filling these monsters up.
>>>>
>>>> That¹s exactly the use case I am considering those archive HDDs for:
>>>> something like AWS Glacier, a form of offsite backup probably via
>>>> radosgw. The classic Seagate enterprise class HDD provide ³too much²
>>>> performance for this use case, I could live with 1Ž4 of the performance
>>>> for that price point.
>>>>
>>>
>>> If you go down that route I suggest that you make a mixed cluster for RGW.
>>>
>>> A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
>>> PM863 or a Intel DC series.
>>>
>>> All pools by default should go to those OSDs.
>>>
>>> Only the RGW buckets data pool should go to the big SMR drives. However,
>>> again, expect very, very low performance of those disks.
>> One of the other concerns you should think about is recovery time when one
>> of these drives fail.  The more OSDs you have, the less this becomes an
>> issue, but on a small cluster is might take over a day to fully recover
>> from an OSD failure.  Which is a decent amount of time to have degraded
>> PGs.
>> Bryan
>> E-MAIL CONFIDENTIALITY NOTICE:
>> The contents of this e-mail message and any attachments are intended solely 
>> for the addressee(s) and may contain confidential and/or legally privileged 
>> information. If you are not the intended recipient of this message or if 
>> this message has been addressed to you in error, please immediately alert 
>> the sender by reply e-mail and then delete this message and any attachments. 
>> If you are not the intended recipient, you are notified that any use, 
>> dissemination, distribution, copying, or storage of this message or any 
>> attachment is strictly prohibited.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-03 Thread Maxime Guyot
Hi,

Interesting feedback!

 > In my opinion the SMR can be used exclusively for the RGW.
 > Unless it's something like a backup/archive cluster or pool with little to 
 > none concurrent R/W access, you're likely to run out of IOPS (again) long 
 > before filling these monsters up.

That’s exactly the use case I am considering those archive HDDs for: something 
like AWS Glacier, a form of offsite backup probably via radosgw. The classic 
Seagate enterprise class HDD provide “too much” performance for this use case, 
I could live with ¼ of the performance for that price point.

Cheers,
Maxime

On 03/02/17 09:40, "ceph-users on behalf of Wido den Hollander" 
<ceph-users-boun...@lists.ceph.com on behalf of w...@42on.com> wrote:


> Op 3 februari 2017 om 8:39 schreef Christian Balzer <ch...@gol.com>:
> 
> 
> 
> Hello,
> 
> On Fri, 3 Feb 2017 10:30:28 +0300 Irek Fasikhov wrote:
> 
> > Hi, Maxime.
> > 
> > Linux SMR is only starting with version 4.9 kernel.
> >
> What Irek said.
> 
> Also, SMR in general is probably a bad match for Ceph.
> Drives like that really want to be treated more like a tape than anything
> else.
>  

Yes, they are damn slow.

> 
> In general, do you really need all this space, what's your use case?
> 
> Unless it's something like a backup/archive cluster or pool with little to
> none concurrent R/W access, you're likely to run out of IOPS (again) long
> before filling these monsters up.
> 

I fully agree. These large disks have very low IOps specs and will probably 
work very, very bad with Ceph.

Wido

> Christian
    > > 
> > С уважением, Фасихов Ирек Нургаязович
> > Моб.: +79229045757
> > 
> > 2017-02-03 10:26 GMT+03:00 Maxime Guyot <maxime.gu...@elits.com>:
> > 
> > > Hi everyone,
> > >
> > >
> > >
> > > I’m wondering if anyone in the ML is running a cluster with archive 
type
> > > HDDs, like the HGST Ultrastar Archive (10TB@7.2k RPM) or the Seagate
> > > Enterprise Archive (8TB@5.9k RPM)?
> > >
> > > As far as I read they both fall in the enterprise class HDDs so 
**might**
> > > be suitable for a low performance, low cost cluster?
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Maxime
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >  
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Experience with 5k RPM/archive HDDs

2017-02-02 Thread Maxime Guyot
Hi everyone,

I’m wondering if anyone in the ML is running a cluster with archive type HDDs, 
like the HGST Ultrastar Archive (10TB@7.2k RPM) or the Seagate Enterprise 
Archive (8TB@5.9k RPM)?
As far as I read they both fall in the enterprise class HDDs so *might* be 
suitable for a low performance, low cost cluster?

Cheers,
Maxime
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimize data lost with PG incomplete

2017-01-31 Thread Maxime Guyot
Hi José

If you have some of the original OSDs (not zapped or erased) then you might be 
able to just re-add them to your cluster and have a happy cluster.
If you attempt the ceph_objectstore_tool –op export & import make sure to do it 
on a temporary OSD of weight 0 as recommended in the link provided.

Either way and from what I can see inthe pg dump you provided, if you restore 
osd.0, osd.3, osd.20, osd.21 and osd.22 it should be enough to bring back the 
pg that are down.

Cheers,
 
On 31/01/17 11:48, "ceph-users on behalf of José M. Martín" 
<ceph-users-boun...@lists.ceph.com on behalf of jmar...@onsager.ugr.es> wrote:

Any idea of how could I recover files from the filesystem mount?
Doing a cp, it hungs when find a damaged file/folder. I would be happy
getting no damaged files

Thanks

El 31/01/17 a las 11:19, José M. Martín escribió:
> Thanks.
> I just realized I keep some of the original OSD. If it contains some of
> the incomplete PGs , would be possible to add then into the new disks?
> Maybe following this steps? 
http://ceph.com/community/incomplete-pgs-oh-my/
>
    > El 31/01/17 a las 10:44, Maxime Guyot escribió:
>> Hi José,
>>
>> Too late, but you could have updated the CRUSHmap *before* moving the 
disks. Something like: “ceph osd crush set osd.0 0.90329 root=default 
rack=sala2.2  host=loki05” would move the osd.0 to loki05 and would trigger the 
appropriate PG movements before any physical move. Then the physical move is 
done as usual: set noout, stop osd, physically move, active osd, unnset noout.
>>
>> It’s a way to trigger the data movement overnight (maybe with a cron) 
and do the physical move at your own convenience in the morning.
>>
>> Cheers, 
>> Maxime 
>>
>> On 31/01/17 10:35, "ceph-users on behalf of José M. Martín" 
<ceph-users-boun...@lists.ceph.com on behalf of jmar...@onsager.ugr.es> wrote:
>>
>> Already min_size = 1
>> 
>> Thanks,
>> Jose M. Martín
>> 
>> El 31/01/17 a las 09:44, Henrik Korkuc escribió:
>> > I am not sure about "incomplete" part out of my head, but you can 
try
>> > setting min_size to 1 for pools toreactivate some PG, if they are
>> > down/inactive due to missing replicas.
>> >
>> > On 17-01-31 10:24, José M. Martín wrote:
>> >> # ceph -s
>> >>  cluster 29a91870-2ed2-40dc-969e-07b22f37928b
>> >>   health HEALTH_ERR
>> >>  clock skew detected on mon.loki04
>> >>  155 pgs are stuck inactive for more than 300 seconds
>> >>  7 pgs backfill_toofull
>> >>  1028 pgs backfill_wait
>> >>  48 pgs backfilling
>> >>  892 pgs degraded
>> >>  20 pgs down
>> >>  153 pgs incomplete
>> >>  2 pgs peering
>> >>  155 pgs stuck inactive
>> >>  1077 pgs stuck unclean
>> >>  892 pgs undersized
>> >>  1471 requests are blocked > 32 sec
>> >>  recovery 3195781/36460868 objects degraded (8.765%)
>> >>  recovery 5079026/36460868 objects misplaced (13.930%)
>> >>  mds0: Behind on trimming (175/30)
>> >>  noscrub,nodeep-scrub flag(s) set
>> >>  Monitor clock skew detected
>> >>   monmap e5: 5 mons at
>> >> 
{loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
>> >>
>> >>  election epoch 4028, quorum 0,1,2,3,4
>> >> loki01,loki02,loki03,loki04,loki05
>> >>fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
>> >>   osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
>> >>  flags noscrub,nodeep-scrub
>> >>pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 
kobjects
>> >>  45892 GB used, 34024 GB / 79916 GB avail
>> >>  3195781/36460868 objects degraded (8.765%)
>> >>  5079026/36460868 objects misplaced (13.930%)
&

Re: [ceph-users] Minimize data lost with PG incomplete

2017-01-31 Thread Maxime Guyot
Hi José,

Too late, but you could have updated the CRUSHmap *before* moving the disks. 
Something like: “ceph osd crush set osd.0 0.90329 root=default rack=sala2.2  
host=loki05” would move the osd.0 to loki05 and would trigger the appropriate 
PG movements before any physical move. Then the physical move is done as usual: 
set noout, stop osd, physically move, active osd, unnset noout.

It’s a way to trigger the data movement overnight (maybe with a cron) and do 
the physical move at your own convenience in the morning.

Cheers, 
Maxime 

On 31/01/17 10:35, "ceph-users on behalf of José M. Martín" 
 wrote:

Already min_size = 1

Thanks,
Jose M. Martín

El 31/01/17 a las 09:44, Henrik Korkuc escribió:
> I am not sure about "incomplete" part out of my head, but you can try
> setting min_size to 1 for pools toreactivate some PG, if they are
> down/inactive due to missing replicas.
>
> On 17-01-31 10:24, José M. Martín wrote:
>> # ceph -s
>>  cluster 29a91870-2ed2-40dc-969e-07b22f37928b
>>   health HEALTH_ERR
>>  clock skew detected on mon.loki04
>>  155 pgs are stuck inactive for more than 300 seconds
>>  7 pgs backfill_toofull
>>  1028 pgs backfill_wait
>>  48 pgs backfilling
>>  892 pgs degraded
>>  20 pgs down
>>  153 pgs incomplete
>>  2 pgs peering
>>  155 pgs stuck inactive
>>  1077 pgs stuck unclean
>>  892 pgs undersized
>>  1471 requests are blocked > 32 sec
>>  recovery 3195781/36460868 objects degraded (8.765%)
>>  recovery 5079026/36460868 objects misplaced (13.930%)
>>  mds0: Behind on trimming (175/30)
>>  noscrub,nodeep-scrub flag(s) set
>>  Monitor clock skew detected
>>   monmap e5: 5 mons at
>> 
{loki01=192.168.3.151:6789/0,loki02=192.168.3.152:6789/0,loki03=192.168.3.153:6789/0,loki04=192.168.3.154:6789/0,loki05=192.168.3.155:6789/0}
>>
>>  election epoch 4028, quorum 0,1,2,3,4
>> loki01,loki02,loki03,loki04,loki05
>>fsmap e95494: 1/1/1 up {0=zeus2=up:active}, 1 up:standby
>>   osdmap e275373: 42 osds: 42 up, 42 in; 1077 remapped pgs
>>  flags noscrub,nodeep-scrub
>>pgmap v36642778: 4872 pgs, 4 pools, 24801 GB data, 17087 kobjects
>>  45892 GB used, 34024 GB / 79916 GB avail
>>  3195781/36460868 objects degraded (8.765%)
>>  5079026/36460868 objects misplaced (13.930%)
>>  3640 active+clean
>>   838 active+undersized+degraded+remapped+wait_backfill
>>   184 active+remapped+wait_backfill
>>   134 incomplete
>>48 active+undersized+degraded+remapped+backfilling
>>19 down+incomplete
>> 6
>> active+undersized+degraded+remapped+wait_backfill+backfill_toofull
>> 1 active+remapped+backfill_toofull
>> 1 peering
>> 1 down+peering
>> recovery io 93909 kB/s, 10 keys/s, 67 objects/s
>>
>>
>>
>> # ceph osd tree
>> ID  WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>   -1 77.22777 root default
>>   -9 27.14778 rack sala1
>>   -2  5.41974 host loki01
>>   14  0.90329 osd.14   up  1.0  1.0
>>   15  0.90329 osd.15   up  1.0  1.0
>>   16  0.90329 osd.16   up  1.0  1.0
>>   17  0.90329 osd.17   up  1.0  1.0
>>   18  0.90329 osd.18   up  1.0  1.0
>>   25  0.90329 osd.25   up  1.0  1.0
>>   -4  3.61316 host loki03
>>0  0.90329 osd.0up  1.0  1.0
>>2  0.90329 osd.2up  1.0  1.0
>>   20  0.90329 osd.20   up  1.0  1.0
>>   24  0.90329 osd.24   up  1.0  1.0
>>   -3  9.05714 host loki02
>>1  0.90300 osd.1up  0.90002  1.0
>>   31  2.72198 osd.31   up  1.0  1.0
>>   29  0.90329 osd.29   up  1.0  1.0
>>   30  0.90329 osd.30   up  1.0  1.0
>>   33  0.90329 osd.33   up  1.0  1.0
>>   32  2.72229 osd.32   up  1.0  1.0
>>   -5  9.05774 host loki04
>>3  0.90329  

Re: [ceph-users] All SSD cluster performance

2017-01-16 Thread Maxime Guyot
Hi Kees,

Assuming 3 replicas and collocated journal each RBD write will trigger 6 SSD 
writes (excluding FS overhead and occasional re-balance).
Intel has 4 tiers of Data center SATA SSD (other manufacturers may have fewer):
- S31xx: ~0.1 DWPD (counted on 3 years): Very read intensive
- S35xx: ~1 DWPD: Read intensive
- S36xx: ~3 DWPD: Mixed workloads
- S37xx: ~10 DWPD: Write intensive
(DWPD = Disk write per day)

For example a cluster of 90* 960GB S3520 has an write endurance of 26.25 PB, so 
around 14 TB/day.
IMO the S3610 (maybe soon the S3620 :D) is a good enough middle of the road 
option if you don’t know the write volume of the RBD backed VMs. Then after a 
few months in production you can use the SMART data and re-evaluate.
I cannot highlight enough how important it is to monitor the SSD wear level.

Cheers,
Maxime

On 16/01/17 11:36, "ceph-users on behalf of Kees Meijs" 
<ceph-users-boun...@lists.ceph.com on behalf of k...@nefos.nl> wrote:

Hi Maxime,

Given your remark below, what kind of SATA SSD do you recommend for OSD
usage?

Thanks!

Regards,
Kees

On 15-01-17 21:33, Maxime Guyot wrote:
> I don’t have firsthand experience with the S3520, as Christian pointed 
out their endurance doesn’t make them suitable for OSDs in most cases. I can 
only advise you to keep a close eye on the SMART status of the SSDs.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All SSD cluster performance

2017-01-15 Thread Maxime Guyot
Hi,

I don’t have firsthand experience with the S3520, as Christian pointed out 
their endurance doesn’t make them suitable for OSDs in most cases. I can only 
advise you to keep a close eye on the SMART status of the SSDs.

Anyway, the S3520 960GB is advertised at 380 MB/s for write.
Assuming this cluster is with collocated journal and a replicated pool of size 
3, that would be a maximum theoretical throughput of 60MB/s per OSD, so 5.7GB/s 
theoretical maximum. IMO and for reasonably configured hosts, you can expect 
around 50% of theoretical maximum throughput for 4M I/O.

Maybe you want to share more info on your cluster and benchmark procedure?

Cheers,
Maxime

On 14/01/17 10:09, "ceph-users on behalf of Wido den Hollander" 
 wrote:


> Op 14 januari 2017 om 6:41 schreef Christian Balzer :
> 
> 
> 
> Hello,
> 
> On Fri, 13 Jan 2017 13:18:35 -0500 Mohammed Naser wrote:
> 
> > These Intel SSDs are more than capable of handling the workload, in 
addition, this cluster is used as an RBD backend for an OpenStack cluster. 
> >
> 
> I haven't tested the S3520s yet, them being the first 3D NAND offering
> from Intel they are slightly slower than the predecessors in terms of BW
> and IOPS, but have supposedly a slightly lower latency if the specs are to
> believed.
> 
> Given the history of Intel DC S SSDs I think it is safe to assume that 
they
> use the same/similar controller setup as their predecessors, meaning a
> large powercap backed cache which enables them to deal correctly and
> quickly with SYNC and DIRECT writes. 
> 
> What would worry me slight more (even at their 960GB size) is the 
endurance
> of 1 DWPD, which with journals inline comes down to 0.5 and with FS
> overhead and write amplification (depends a lot on the type of operations)
> you're looking a something along 0.3 DWPD to base your expectations on.
> Mind, that still leaves you with about 9.6TB per day, which is a decent
> enough number, but only equates to about 112MB/s.
> 
> Finally, most people start with looking at bandwidth/throughput when
> penultimately they discover it was IOPS they needed first and foremost.

Yes! Bandwidth isn't what people usually need, they need IOps. Low latency.

I see a lot of clusters doing 10k ~ 20k IOps with somewhere around 1Gbit/s 
of traffic.

Wido

> 
> Christian
> 
> > Sent from my iPhone
> > 
> > > On Jan 13, 2017, at 1:08 PM, Somnath Roy  
wrote:
> > > 
> > > Also, there are lot of discussion about SSDs not suitable for Ceph 
write workload (with filestore) in community as those are not good for 
odirect/odsync kind of writes. Hope your SSDs are tolerant of that.
> > > 
> > > -Original Message-
> > > From: Somnath Roy
> > > Sent: Friday, January 13, 2017 10:06 AM
> > > To: 'Mohammed Naser'; Wido den Hollander
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: RE: [ceph-users] All SSD cluster performance
> > > 
> > > << Both OSDs are pinned to two cores on the system Is there any 
reason you are pinning osds like that ? I would say for object workload there 
is no need to pin osds.
> > > The configuration you mentioned , Ceph with 4M object PUT it should 
be saturating your network first.
> > > 
> > > Have you run say 4M object GET to see what BW you are getting ?
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
Of Mohammed Naser
> > > Sent: Friday, January 13, 2017 9:51 AM
> > > To: Wido den Hollander
> > > Cc: ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] All SSD cluster performance
> > > 
> > > 
> > >> On Jan 13, 2017, at 12:41 PM, Wido den Hollander  
wrote:
> > >> 
> > >> 
> > >>> Op 13 januari 2017 om 18:39 schreef Mohammed Naser 
:
> > >>> 
> > >>> 
> > >>> 
> >  On Jan 13, 2017, at 12:37 PM, Wido den Hollander  
wrote:
> >  
> >  
> > > Op 13 januari 2017 om 18:18 schreef Mohammed Naser 
:
> > > 
> > > 
> > > Hi everyone,
> > > 
> > > We have a deployment with 90 OSDs at the moment which is all SSD 
that’s not hitting quite the performance that it should be in my opinion, a 
`rados bench` run gives something along these numbers:
> > > 
> > > Maintaining 16 concurrent writes of 4194304 bytes to objects of
> > > size 4194304 for up to 10 seconds or 0 objects Object prefix: 
benchmark_data_bench.vexxhost._30340
> > > sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  
avg 

Re: [ceph-users] All SSD Ceph Journal Placement

2016-12-20 Thread Maxime Guyot
Hi Jeldrik,

You are right. In this situation, you are better off collocating the journal on 
the new SSD OSDs and recycling your journal to an OSD (if its wear level allows 
it) once all its attached HDD OSDs are replaced.
As a side note, make sure to monitor the write endurance/wear level on the SSDs.

Cheers,
Maxime Guyot <mailto:maxime.gu...@elits.se>

On 20/12/16 15:59, "ceph-users on behalf of Jeldrik" 
<ceph-users-boun...@lists.ceph.com on behalf of jeld...@kopfsalat.org> wrote:

Hi all,

i know this topic has been discussed a few times from different
perspectives here, but I could not really get to the answer I need.

We're running a ceph cluster with the following setup:

3 Nodes with 6 OSDs (HDD) and 2 Journal Disks (SSD) each. This is a more
or less small setup for a private cluster environment. We now want to
replace the HDDs with SSDs because the customer needs more performance.
We use INTEL DC SSDs as journal devices and we want to use the same
model as OSDs. Because of hardware limitations we are not able to
upgrade the journal devices to let's say PCIe NVMe.

We could easily just go an replace the HDDs one by one. But the question
is: wouldn't the journal be the new bottleneck? The OSDs are the same
SSD model so they would have the same read/write performance as the
journal and every OSD could just get to about 1/3 of there performance
capabilities, am I right? Wouldn't it be better to place the journal of
each OSD on the very same SSD and use the old journals as additional
OSDs? We would get 6 OSDs more and they would only drop to 1/2 of there
performance capabilities. At least this is what I think :-)

So, am I right here that it would be better to place journal and OSD on
the same SSD in this setup?

Thanks and regards,

Jeldrik

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Production System Evaluation / Problems

2016-11-28 Thread Maxime Guyot
Hi,


1.   It is possible to do that with the primary affinity setting. The 
documentation gives an example with SSD as primary OSD and HDD as secondary. I 
think it would work for Active/Passive DC scenario might be tricky for 
Active/Active. If you do Ceph across 2 DCs you might have problems with quorum, 
a third location with 1 MON can help break ties.

2.   Zap & re-create?

3.   It is common to use 2 VLANs on a LACP bond instead of 1 NIC on each 
VLAN. You just need to size the pipes accordingly to avoid bottlenecks.

Cheers,

Maxime Guyot<mailto:maxime.gu...@elits.se>

From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Stefan 
Lissmats <ste...@trimmat.se>
Date: Monday 28 November 2016 11:12
To: "Strankowski, Florian" <fstrankow...@stadtwerke-norderstedt.de>, 
"'ceph-users@lists.ceph.com'" <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Production System Evaluation / Problems

Hey!

I have using ceph for a while bu is not a real expert but i will give you some 
pointers to make everyone able to help you further.

1. The crush map is kind of devided into two parts, the topology description, 
(which you provided us with) and also the crush rules that defines how the data 
is placed in the topology. Have you made any changes in the rules? If you have 
made any changes it would be great if you provided how the rules is defined. 
However i think you can get the data placed the way you want with some more 
advanced crush rules, but I don't think there is any possibility to have a read 
only copy. Guess you have seen this? 
http://docs.ceph.com/docs/jewel/rados/operations/crush-map/


2.  Have you looked into the osd logs server that osd.0 resides on? That could 
give some information why osd.0 never comes up. It should normally be in 
/var/log/ceph/ceph-
osd.0.log

Other notes:
You have 6 mons but you normally want an odd number and do not normally need 
more than 5 (or even 3 is).


Från: ceph-users [ceph-users-boun...@lists.ceph.com] för Strankowski, Florian 
[fstrankow...@stadtwerke-norderstedt.de]
Skickat: den 28 november 2016 10:29
Till: 'ceph-users@lists.ceph.com'
Ämne: [ceph-users] Production System Evaluation / Problems
Hey guys,

we’re evaluating ceph at the moment for a bigger production-ready 
implementation. So far we’ve had some success and
some problems with ceph. In combination with Proxmox CEPH works quite well, if 
taken out of the box. I’ve tried to coverup my questions
with existing answers and solutions but i still find some things unclear. Here 
are the things i’m having problems with:


1.   The first question is just for my understanding: How does CEPH account 
failure domains? For what i’ve read by now is
that i create a new CRUSH-Map with for example 2 datacenters, each DC has a 
rack and in this rack there is a chassis with nodes.
By using an own CRUSH-Map CEPH will „see“ it and deal with the data 
automatically. What i am missing here is some more possible adjustment.
For example i want to define that by using a replica of 3 i want CEPH to store 
the data 2 times in datacenter A and one time in datacenter B. Further
more i want read-access exclusivly within 1 datacenter (if possible and data is 
available) to keep rtt low. Is this possible?

2.   I’ve build my own CRUSH-Map and tried to get it working. No success at 
all. I’m literally „done with this s…“ ☺ thats why im here right now. Here is 
the state

of the cluster:



cluster 42f04e55-0a3f-4644-8543-516cd46cd4e9

 health HEALTH_WARN

79 pgs degraded

262 pgs stale

79 pgs stuck degraded

262 pgs stuck stale

512 pgs stuck unclean

79 pgs stuck undersized

79 pgs undersized

 monmap e8: 6 mons at 
{0=192.168.40.20:6789/0,1=192.168.40.21:6789/0,2=192.168.40.22:6789/0,3=192.168.40.23:6789/0,4=192.168.40.24:6789/0,5=192.168.40.25:6789/0}

election epoch 86, quorum 0,1,2,3,4,5 0,1,2,3,4,5

 mdsmap e2: 0/0/1 up

 osdmap e212: 6 osds: 5 up, 5 in; 250 remapped pgs

  pgmap v366013: 512 pgs, 2 pools, 0 bytes data, 0 objects

278 MB used, 900 GB / 901 GB avail

 250 active+remapped

 183 stale+active+remapped

  79 stale+active+undersized+degraded+remapped



Here the config:





ID  WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY

-27 1.07997 root default

-25 0.53998 datacenter datacenter1

-23 0.53998 chassis chassis1

-1 0.17999 blade blade3

  0 0.17999 osd.0 down0  1.0

-2 0.17999 blade blade4

  1 0.17999 osd.1   up  1.0  1.0

-3 0.17999 blade blade5

  2 0.17999 osd.2   up  1.0  1.0

-26 0.53999 datacenter datacent

Re: [ceph-users] general ceph cluster design

2016-11-25 Thread Maxime Guyot
Hi Nick,

See inline comments.

Cheers,
Maxime 

On 25/11/16 16:01, "ceph-users on behalf of nick" 
 wrote:

>Hi,
>we are currently planning a new ceph cluster which will be used for 
>virtualization (providing RBD storage for KVM machines) and we have some 
>general questions.
>
>* Is it advisable to have one ceph cluster spread over multiple 
> datacenters 
>(latency is low, as they are not so far from each other)? Is anybody doing 
>this in a production setup? We know that any network issue would affect 
> virtual 
>machines in all locations instead just one, but we can see a lot of 
> advantages 
>as well.

I think the general consensus is to limit the size of the failure domain. That 
said, it depends the use case and what you mean by “multiple datacenters” and 
“latency is low”: writes will have to be journal-ACK:ed  by the OSDs in the 
other datacenter. If there is 10ms latency between Location1 and Location2, 
then it would add 10ms to each write operation if crushmap requires replicas in 
each location. Speaking of which a 3rd location would help with sorting our 
quorum (1 mon at each location) in “triangle” configuration.

If this is for DR: RBD-mirroring is supposed to address that, you might not 
want to have 1 big cluster ( = failure domain).
If this is for VM live migration: Usually requires spread L2 adjacency (failure 
domain) or overlays (VXLAN and the likes), “network trombone” effect can be a 
problem depending on the setup

I know of Nantes University who used/is using a 3 datacenter Ceph cluster:  
http://dachary.org/?p=2087 

>
>* We are planning to combine the hosts for ceph and KVM (so far we are 
> using 
>seperate hosts for virtual machines and ceph storage). We see the big 
>advantage (next to the price drop) of an automatic ceph expansion when 
> adding 
>more compute nodes as we got into situations in the past where we had too 
> many 
>compute nodes and the ceph cluster was not expanded properly (performance 
>dropped over time). On the other side there would be changes to the crush 
> map 
>every time we add a compute node and that might end in a lot of data 
> movement 
>in ceph. Is anybody using combined servers for compute and ceph storage 
> and 
>has some experience?

The challenge is to avoid ceph-osd to become a noisy neighbor for the VMs 
hosted on the hypervisor, especially under recovery. I’ve heard people using 
CPU pinning, containers, and QoS to keep it under control.
Sebastian has an article on his blog this topic: 
https://www.sebastien-han.fr/blog/2016/07/11/Quick-dive-into-hyperconverged-architecture-with-OpenStack-and-Ceph/
 

For the performance dropped over time, you can look to improve your 
capacity:performance ratio.

>* is there a maximum amount of OSDs in a ceph cluster? We are planning to 
> use 
>a minimum of 8 OSDs per server and going to have a cluster with about 100 
>servers which would end in about 800 OSDs.

There are a couple of thread from the ML about this: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028371.html  and 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-November/014246.html 

>
>Thanks for any help...
>
>Cheers
>Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados cppool slooooooowness

2016-08-16 Thread Maxime Guyot
Hi Simon,

If everything is in the same Ceph cluster and you want to move the whole 
“.rgw.buckets” (I assume your RBD traffic is targeted into a “data” or “rbd” 
pool) to your cold storage OSD maybe you could edit the CRUSH map, then it’s 
just a matter of rebalancing.
You can check the ssd/platter example in the doc: 
http://docs.ceph.com/docs/master/rados/operations/crush-map/ or this article 
detailing different maps: 
http://cephnotes.ksperis.com/blog/2015/02/02/crushmap-example-of-a-hierarchical-cluster-map

Cheers,
Maxime
From: ceph-users  on behalf of Simon Murray 

Date: Tuesday 16 August 2016 12:25
To: "ceph-users@lists.ceph.com" 
Subject: [ceph-users] rados cppool slooowness

Morning guys,
I've got about 8 million objects sat in .rgw.buckets that wants moving out of 
the way of OpenStack RDB traffic onto its own (admittedly small) cold storage 
pool on separate OSDs.
I attempted to do this over the weekend during a 12h scheduled downtime, 
however my estimates had this pool completing in a rather un-customer friendly 
(think no backups...) 7 days.
Anyone had any experience in doing this quicker?  Any obvious reasons why I 
can't hack do_copy_pool() to spawn a bunch of threads and bang this off in a 
few hours?
Cheers
Si

DataCentred Limited registered in England and Wales no. 05611763
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High-performance way for access Windows of users to Ceph.

2016-08-12 Thread Maxime Guyot
Hi,


> “Clients run program written by them, which generates files of various sizes 
> - from 1 KB to 200 GB”
If the clients are running custom software on Windows and if at all possible, I 
would consider using 
librados. The 
library is available for C/C++, Java, PHP and Python. The object API is fairly 
simple and would lift the CephFS requirement.
Using Rados your client will be able to talk directly to the cluster (OSDs).

Some other options to access Ceph form Windows, but require a gateway (CephFS 
to NFS/Samba or RBD to NFS/Samba) which usually ends up being a bottleneck and 
a SPOF.

Regarding the performance, you mentioned 160GB/min, so that is 2.7 GB/s. That 
shouldn’t be too difficult to reach with Journals on SSDs.
In a previous thread you mentioned 468 OSDs. Doing a quick napkin calculation 
with a Journal:OSD ratio of 1:6 (usually 1:4 to 1:6), that should be 78 
Journals, if you estimate 400MB/s (like the Intel S3710 serie) Journal write 
speed and a replica factor of 3, you have a maximum theoretical write speed of 
~10GB/s. Say you get ~50% (I usually reach 50~60% of the theoretical write 
speed) of the theoretical write speed you are still above your target of 2.7 
GB/s.

Regards
Maxime G.


From: ceph-users  on behalf of Nick Fisk 

Reply-To: "n...@fisk.me.uk" 
Date: Friday 12 August 2016 09:33
To: 'Александр Пивушков' , 'ceph-users' 

Subject: Re: [ceph-users] High-performance way for access Windows of users to 
Ceph.

I’m not sure how stable that ceph dokan is, I would imagine the best way to 
present ceph-fs to windows users would be through samba.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
? 
Sent: 12 August 2016 07:53
To: ceph-users 
Subject: [ceph-users] High-performance way for access Windows of users to Ceph.

Hello,

I continue to design high-performance cluster Ceph, petascale.

Scheduled to purchase a high-performance server, OS Windows 2016, for  clients. 
Clients are in the Dockers.
https://docs.docker.com/engine/installation/windows/
Virtualization. It does not matter...

 Clients run program written by them, which generates files of various sizes - 
from 1 KB to 200 GB (yes, creepy single file size). Network planning to use 
Infiniband 40 GB/s between clients and Ceph. Clients work with Ceph always on 
one,  and always only either for record or for reading.

While I do not understand what Ceph technology is appropriate to use? Object, 
block, or file storage CephFS.
So far, it seems to me, i need to use MDS, CephFS and ceph-dokan
https://github.com/ketor/ceph-dokan

Please share the experience of how it is possible to provide access with 
minimal overhead (preferably zero :(  ) Windows Ceph users to the server?
 Ie  how to make sure that the files generated by the program on Windows very 
quickly proved to Ceph.

--
Александр Пивушков


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Maxime Guyot
Hi,

I haven’t had problems with Power_Loss_Cap_Test so far. 

Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available 
Reserved Space” (SMART ID: 232/E8h), the data sheet 
(http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
 reads:
"This attribute reports the number of reserve blocks

remaining. The normalized value 
begins at 100 (64h),
which corresponds to 100 percent availability of the
reserved space. The threshold value for this attribute is
10 percent availability."

According to the SMART data you copied, it should be about 84% of the over 
provisioning left? Since the drive is pretty young, it might be some form of 
defect?
I have a number of S3610 with ~150 DW, all SMART counters are their initial 
values (except for the temperature).

Cheers,
Maxime








On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
 wrote:

>Hi Christian,
>
>Intel drives are good, but apparently not infallible. I'm watching a DC
>S3610 480GB die from reallocated sectors.
>
>ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>  9 Power_On_Hours  -O--CK   100   100   000-1065
> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>183 Runtime_Bad_Block   -O--CK   100   100   000-0
>184 End-to-End_ErrorPO--CK   100   100   090-0
>187 Reported_Uncorrect  -O--CK   100   100   000-0
>190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>25/35)
>192 Power-Off_Retract_Count -O--CK   100   100   000-6
>194 Temperature_Celsius -O---K   100   100   000-30
>197 Current_Pending_Sector  -O--C-   100   100   000-1288
>199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>228 Power-off_Retract_Count -O--CK   100   100   000-63889
>232 Available_Reservd_Space PO--CK   084   084   010-0
>233 Media_Wearout_Indicator -O--CK   100   100   000-0
>241 Total_LBAs_Written  -O--CK   100   100   000-20131
>242 Total_LBAs_Read -O--CK   100   100   000-92945
>
>The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>sure how many reserved sectors the drive has, i.e., how soon before it
>starts throwing write IO errors.
>
>It's a very young drive, with only 1065 hours on the clock, and has not
>even done two full drive-writes:
>
>Device Statistics (GP Log 0x04)
>Page Offset Size Value  Description
>  1  =  ==  == General Statistics (rev 2) ==
>  1  0x008  47  Lifetime Power-On Resets
>  1  0x018  6   1319318736  Logical Sectors Written
>  1  0x020  6137121729  Number of Write Commands
>  1  0x028  6   6091245600  Logical Sectors Read
>  1  0x030  6115252407  Number of Read Commands
>
>Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>RAID5 array :-|
>
>Cheers,
>Daniel
>
>On 03/08/16 07:45, Christian Balzer wrote:
>> 
>> Hello,
>> 
>> not a Ceph specific issue, but this is probably the largest sample size of
>> SSD users I'm familiar with. ^o^
>> 
>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>> religious experience.
>> 
>> It turns out that the SMART check plugin I run to mostly get an early
>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>> 200GB DC S3700 used for journals.
>> 
>> While SMART is of the opinion that this drive is failing and will explode
>> spectacularly any moment that particular failure is of little worries to
>> me, never mind that I'll eventually replace this unit.
>> 
>> What brings me here is that this is the first time in over 3 years that an
>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>> this particular failure has been seen by others.
>> 
>> That of course entails people actually monitoring for these things. ^o^
>> 
>> Thanks,
>> 
>> Christian
>> 
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Maxime Guyot
Hi Adrian,

Looking at the documentation RadosGW has multi region support with the 
“federated gateways” 
(http://docs.ceph.com/docs/master/radosgw/federated-config/):
"When you deploy a Ceph Object Store service that spans geographical locales, 
configuring Ceph Object Gateway regions and metadata synchronization agents 
enables the service to maintain a global namespace, even though Ceph Object 
Gateway instances run in different geographic locales and potentially on 
different Ceph Storage Clusters.”

Maybe that could do the trick for your multi metro EC pools?

Disclaimer: I haven't tested the federated gateways RadosGW.

Best Regards 

Maxime Guyot
System Engineer









On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul" 
<ceph-users-boun...@lists.ceph.com on behalf of adrian.s...@tpgtelecom.com.au> 
wrote:

>Hello again Christian :)
>
>
>> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
>> > will be distributed over every major capital in Australia.The config
>> > will be dual sites in each city that will be coupled as HA pairs - 12
>> > sites in total.   The vast majority of CRUSH rules will place data
>> > either locally to the individual site, or replicated to the other HA
>> > site in that city.   However there are future use cases where I think we
>> > could use EC to distribute data wider or have some replication that puts
>> > small data sets across multiple cities.
>> This will very, very, VERY much depend on the data (use case) in question.
>
>The EC use case would be using RGW and to act as an archival backup store
>
>> > The concern I have is around the placement of mons.  In the current
>> > design there would be two monitors in each site, running separate to the
>> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
>> > will also be a "tiebreaker" mon placed on a separate host which will
>> > house some management infrastructure for the whole platform.
>> >
>> Yes, that's the preferable way, might want to up this to 5 mons so you can
>> loose one while doing maintenance on another one.
>> But if that would be a coupled, national cluster you're looking both at
>> significant MON traffic, interesting "split-brain" scenarios and latencies as
>> well (MONs get chosen randomly by clients AFAIK).
>
>In the case I am setting up it would be 2 per site plus the extra so 25 - but 
>I am fearing that would make the mon syncing become to heavy.  Once we build 
>up to multiple sites though we can maybe reduce to one per site to reduce the 
>workload on keeping the mons in sync.
>
>> > Obviously a concern is latency - the east coast to west coast latency
>> > is around 50ms, and on the east coast it is 12ms between Sydney and
>> > the other two sites, and 24ms Melbourne to Brisbane.
>> In any situation other than "write speed doesn't matter at all" combined with
>> "large writes, not small ones" and "read-mostly" you're going to be in severe
>> pain.
>
>For data yes, but the main case for that would be backup data where it would 
>be large writes, read rarely and as long as streaming performance keeps up 
>latency wont matter.   My concern with the latency would be how that impacts 
>the monitors having to keep in sync and how that would impact client 
>opertions, especially with the rate of change that would occur with the 
>predominant RBD use in most sites.
>
>> > Most of the data
>> > traffic will remain local but if we create a single national cluster
>> > then how much of an impact will it be having all the mons needing to
>> > keep in sync, as well as monitor and communicate with all OSDs (in the
>> > end goal design there will be some 2300+ OSDs).
>> >
>> Significant.
>> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
>> run/setup and sharing the experience with us. ^.^
>
>Someone has to be the canary right :)
>
>> > The other options I  am considering:
>> > - split into east and west coast clusters, most of the cross city need
>> > is in the east coast, any data moves between clusters can be done with
>> > snap replication
>> > - city based clusters (tightest latency) but loose the multi-DC EC
>> > option, do cross city replication using snapshots
>> >
>> The later, I seem to remember that there was work in progress to do this
>> (snapshot replication) in an automated fashion.
>>
>> > Just want to get a feel for what I need to consider when we start
>> > building at this scale.
>> >

Re: [ceph-users] 800TB - Ceph Physical Architecture Proposal

2016-04-08 Thread Maxime Guyot
Hello,

On 08/04/16 04:47, "ceph-users on behalf of Christian Balzer" 
 wrote:





>
>> 11 OSD nodes:
>> -SuperMicro 6047R-E1R36L
>> --2x E5-2603v2
>Vastly underpowered for 36 OSDs.
>> --128GB RAM
>> --36x 6TB OSD
>> --2x Intel P3700 (journals)
>Which exact model?
>If it's the 400GB one, that's 2GB/s maximum write speed combined.
>Slightly below what I'd expect your 36 HDDs to be able to write at about
>2.5GB/s (36*70MB/s), but not unreasonably so.
>However your initial network thoughts are massively overspec'ed for this
>kind of performance.

What I have seen about OSD server sizing is:
- 1GB of RAM per TB of OSD, 36x6TB for replicated pools
- 0.5 core or 1Ghz per OSD disk for replicated pools
- 1 or 2 core for SSDs

Source:
- Minimum hardware recommendations: 
http://docs.ceph.com/docs/hammer/start/hardware-recommendations/#minimum-hardware-recommendations
- Video (timestamp 12:00): https://www.youtube.com/watch?v=XBfYY-VhzpY
- Slides (slide 20): http://www.slideshare.net/mirantis/ceph-talk-vancouver-20

So you might want to increase the RAM to around 192-256GB and the CPU to 
something like a dual 10 cores 2 Ghz (or more), E5-2660 v2 for example.



>
>> 
>> 3 MDS nodes:
>> -SuperMicro 1028TP-DTR (one node from scale-out chassis)
>> --2x E5-2630v4
>> --128GB RAM
>> --2x 120GB SSD (RAID 1 for OS)
>Not using CephFS, but if the MDS are like all the other Ceph bits (MONs in
>particular) they are likely to do happy writes to leveldbs or the likes, do
>verify that.
>If that's the case, fast and durable SSDs will be needed.
>
>> 
>> 5 MON nodes:
>> -SuperMicro 1028TP-DTR (one node from scale-out chassis)
>> --2x E5-2630v4
>> --128GB RAM
>> --2x 120GB SSD (RAID 1 for OS)
>> 
>Total overkill, are you sure you didn't mix up the CPUs for the OSDs with
>the ones for the MONs?
>Also, while dedicated MONs are nice, they really can live rather frugally,
>except for the lust for fast, durable storage.
>If I were you, I'd get 2 dedicated MON nodes (with few, fastish cores) and
>32-64GB RAM, then put the other 3 on your MDS nodes which seem to have
>plenty resources to go around.
>You will want the dedicated MONs to have the lowest IPs in your network,
>the monitor leader is chosen by that.
>
>Christian
>> We'd use our existing Zabbix deployment for monitoring and ELK for log
>> aggregation.
>> 
>> Provisioning would be through puppet-razor (PXE) and puppet.
>> 
>> Again, thank you for any information you can provide
>> 
>> --Brady
>
>
>-- 
>Christian BalzerNetwork/Systems Engineer
>ch...@gol.com  Global OnLine Japan/Rakuten Communications
>http://www.gol.com/
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Regards,
Maxime G
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2016-03-03 Thread Maxime Guyot
Hello,

It looks like this thread is one of the main google hit on this issue, so let 
me bring some update. I experienced the same symptoms with Intel S3610 and 
LSI2208.

The logs reported “task abort!” messages on a daily basis since November:
Write(10): 2a 00 0e 92 88 90 00 00 10 00
scsi target6:0:1: handle(0x000a), sas_address(0x443322110100), phy(1)
scsi target6:0:1: enclosure_logical_id(0x500304801c84e000), slot(2)
sd 6:0:1:0: task abort: SUCCESS scmd(8805b30fa200)
sd 6:0:1:0: attempting task abort! scmd(8807ef9e9800)
sd 6:0:1:0: [sdf] CDB:

OSD would go down from time to time with:
XFS (sdf3): xfs_log_force: error 5 returned.
lost page write due to I/O error on sdf3


I was able to repeat “task abort!” messages with "rados -p data bench 30 write 
-b 1048576”. The OSD down and XFS errors on the other hand were harder to 
reproduce systemically.
To solve the problem I followed Christian’s recommendation to update the S3610 
SSDs’ firmware from G2010110 to G2010140 using the 
isdct<https://downloadcenter.intel.com/download/23931/Intel-Solid-State-Drive-Data-Center-Tool>
 utility. It was easy to convert the RPM package released by Intel into a .deb 
package using “alien”. Then just a matter of “isdct show –intelssd” and “isdct 
load –intelssd 0"

It has been a week since the cluster runs with the latest firmware, I can’t 
reproduce the problem so it looks like the issue is solved.

Thank you Christian for the info!

Regards

Maxime Guyot<mailto:maxime.gu...@elits.se>
System Engineer



> Hello,
>
> On Tue, 8 Sep 2015 13:40:36 +1200 Richard Bade wrote:
>
> > Hi Christian,
> > Thanks for the info. I'm just wondering, have you updated your S3610's
> > with the new firmware that was released on 21/08 as referred to in the
> > thread?
> I did so earlier today, see below.
>
> >We thought we weren't seeing the issue on the intel controller
> > also to start with, but after further investigation it turned out we
> > were, but it was reported as a different log item such as this:
> > ata5.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen
> > ata5.00: failed command: READ FPDMA QUEUED
> > ata5.00: cmd 60/10:a0:18:ca:ca/00:00:32:00:00/40 tag 20 ncq 8192 in
> >   res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > ata5.00: status: { DRDY }
> > ata5.00: failed command: READ FPDMA QUEUED
> > ata5.00: cmd 60/40:a8:48:ca:ca/00:00:32:00:00/40 tag 21 ncq 32768 in
> >  res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > ata5.00: status: { DRDY }
> > ata5: hard resetting link
> > ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> > ata5.00: configured for UDMA/133
> > ata5.00: device reported invalid CHS sector 0
> > ata5.00: device reported invalid CHS sector 0
> > ata5: EH complete
> > ata5.00: Enabling discard_zeroes_data
> >
> Didn't see any of these, but admittedly I tested this with fewer SSDs on
> the onboard controller and with fio/bonnie++, which do not trigger that
> behavior as easily.
>
> > I believe this to be the same thing as the LSI3008 which gives these log
> > messages:
> > sd 0:0:6:0: attempting task abort! scmd(8804cac00600)
> > sd 0:0:6:0: [sdg] CDB:
> > Read(10): 28 00 1c e7 76 a0 00 01 30 00
> > scsi target0:0:6: handle(0x000f), sas_address(0x443322110600), phy(6)
> > scsi target0:0:6: enclosure_logical_id(0x50030480), slot(6)
> > sd 0:0:6:0: task abort: SUCCESS scmd(8804cac00600)
> > sd 0:0:6:0: attempting task abort! scmd(8804cac03780)
> >
> Yup, I know that message all too well.
>
> > I appreciate your info with regards to nobarries. I assume by "alleviate
> > it, but didn't fix" you mean the number of occurrences is reduced?
> >
> Indeed. But first a word about the setup where I'm seeing this.
> These are 2 mailbox server clusters (2 nodes each), replicating via DRBD
> over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster
> with the Samsung DC SSDs, one with the Intel S3610.
> 2 of these chassis to be precise:
> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm
>
> Of course latest firmware and I tried this with any kernel from Debian
> 3.16 to stock 4.1.6.
>
> With nobarrier I managed to trigger the error only once yesterday on the
> DRBD replication target, not the machine that actual has the FS mounted.
> Usually I'd be able to trigger quite a bit more often during those tests.
>
> So this morning I updated the firmware of all S3610s on one node and
> removed the nobarrier flag. It took a lot of punishment, but eventually
> this happened:
> ---
> Sep  8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: at