[ceph-users] Receiving "failed to parse date for auth header"

2015-09-04 Thread Ramon Marco Navarro
Good day everyone!

I'm having a problem using aws-java-sdk to connect to Ceph using radosgw. I
am reading a " NOTICE: failed to parse date for auth header" message in the
logs. HTTP_DATE is "Fri, 04 Sep 2015 09:25:33 +00:00", which is I think a
valid rfc 1123 date...

Here's a link to the related lines in the log file:
https://gist.github.com/ramonmaruko/96e841167eda907f768b

Thank you for any help in advance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nova fails to download image from Glance backed with Ceph

2015-09-04 Thread Vasiliy Angapov
Hi all,

Not sure actually where does this bug belong to - OpenStack or Ceph -
but writing here in humble hope that anyone faced that issue also.

I configured test OpenStack instance with Glance images stored in Ceph
0.94.3. Nova has local storage.
But when I'm trying to launch instance from large image stored in Ceph
- it fails to spawn with such an error in nova-conductor.log:

2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
[req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
-] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
last):\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
in _do_build_and_run_instance\nfilter_properties)\n', u'  File
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
in _build_and_run_instance\ninstance_uuid=instance.uuid,
reason=six.text_type(e))\n', u'RescheduledException: Build of instance
18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
expected 4a7de2fbbd01be5c6a9e114df145b027\n']

So nova tries 3 different hosts with the same error messages on every
single one and then fails to spawn an instance.
I've tried Cirros little image and it works fine with it. Issue
happens with large images like 10Gb in size.
I also managed to look into /var/lib/nova/instances/_base folder and
found out that image is actually being downloaded but at some moment
the download process interrupts for some unknown reason and instance
gets deleted.

I looked at the syslog and found many messages like that:
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)
Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
(cutoff 2015-09-04 12:51:32.735011)

I've also tried to monitor nova-compute process file descriptors
number but it is never more than 102. ("echo
/proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
It also seems like problem appeared only in 0.94.3, in 0.94.2
everything worked just fine!

Would be very grateful for any help!

Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS and caching

2015-09-04 Thread Les
Cephfs can use fscache. I am testing it at the moment.

Some lines from my deployment process:

sudo apt-get install linux-generic-lts-utopic cachefilesd
sudo reboot
sudo mkdir /mnt/cephfs
sudo mkdir /mnt/ceph_cache
sudo mkfs -t xfs /dev/md3 # A 100gb local raid partition
sudo bash -c "echo /dev/md/3 /mnt/ceph_cache xfs defaults,noatime 0 0 >> 
/etc/fstab"
sudo bash -c "echo M1:6789,M2:6789:/ /mnt/cephfs ceph 
name=fsuser,secretfile=/etc/ceph/fsuser.secret,noatime,fsc,_netdev 0 0 >> 
/etc/fstab"
sudo bash -c "echo REDACTED > /etc/ceph/fsuser.secret"
sudo chmod 400 /etc/ceph/fsuser.secret
sudo mount /mnt/ceph_cache/
sudo sed -i 's/#RUN=yes/RUN=yes/g' /etc/default/cachefilesd
sudo vim /etc/cachefilesd.conf #Change dir to /mnt/ceph_cache, tag to 
ceph_cache and comment everything else
sudo service cachefilesd start
sudo mount /mnt/cephfs

Cheers, Les

On 04.09.2015 00:58, Kyle Hutson wrote:
I was wondering if anybody could give me some insight as to how CephFS does its 
caching - read-caching in particular.

We are using CephFS with an EC pool on the backend with a replicated cache pool 
in front of it. We're seeing some very slow read times. Trying to compute an 
md5sum on a 15GB file twice in a row (so it should be in cache) takes the time 
from 23 minutes down to 17 minutes, but this is over a 10Gbps network and with 
a crap-ton of OSDs (over 300), so I would expect it to be down in the 2-3 
minute range.

I'm just trying to figure out what we can do to increase the performance. I 
have over 300 TB of live data that I have to be careful with, though, so I have 
to have some level of caution.

Is there some other caching we can do (client-side or server-side) that might 
give us a decent performance boost?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Client parallized access?

2015-09-04 Thread Alexander Walker

Hi,
i've configured a CephFS and mouted this in fstab

ceph1:6789,ceph2:6789,ceph3:6789:/ /cephfsceph 
name=admin,secret=AQDVOOhVxEI7IBAAM+4el6WYbCwKvFxmW7ygcA==,noatime 0   2


it's mean:

1. Ceph Client can write data on all three server at the same time?
2. Client access the second server if first server is not reachable?


best regards
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Nick Fisk
Actually just thinking about this some more, shouldn't the PG's per OSD "golden 
rule" also depend on the size of the OSD? If this Directory splitting is a big 
deal then an 8TB OSD is going to need a lot more PG's than say a 1TB OSD.

Any thoughts Mark?

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 04 September 2015 13:08
> To: 'Wang, Warren' ; 'Mark Nelson'
> ; 'Ben Hines' 
> Cc: 'ceph-users' 
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
> 
> I've just made the same change ( 4 and 40 for now) on my cluster which is a
> similar size to yours. I didn't see any merging happening, although most of
> the directory's I looked at had more files in than the new merge threshold, so
> I guess this is to be expected
> 
> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
> bring
> things back into order.
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Wang, Warren
> > Sent: 04 September 2015 01:21
> > To: Mark Nelson ; Ben Hines 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] Ceph performance, empty vs part full
> >
> > I'm about to change it on a big cluster too. It totals around 30
> > million, so I'm a bit nervous on changing it. As far as I understood,
> > it would indeed move them around, if you can get underneath the
> > threshold, but it may be hard to do. Two more settings that I highly
> > recommend changing on a big prod cluster. I'm in favor of bumping these
> two up in the defaults.
> >
> > Warren
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Mark Nelson
> > Sent: Thursday, September 03, 2015 6:04 PM
> > To: Ben Hines 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] Ceph performance, empty vs part full
> >
> > Hrm, I think it will follow the merge/split rules if it's out of whack
> > given the new settings, but I don't know that I've ever tested it on
> > an existing cluster to see that it actually happens.  I guess let it
> > sit for a while and then check the OSD PG directories to see if the
> > object counts make sense given the new settings? :D
> >
> > Mark
> >
> > On 09/03/2015 04:31 PM, Ben Hines wrote:
> > > Hey Mark,
> > >
> > > I've just tweaked these filestore settings for my cluster -- after
> > > changing this, is there a way to make ceph move existing objects
> > > around to new filestore locations, or will this only apply to newly
> > > created objects? (i would assume the latter..)
> > >
> > > thanks,
> > >
> > > -Ben
> > >
> > > On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
> > wrote:
> > >> Basically for each PG, there's a directory tree where only a
> > >> certain number of objects are allowed in a given directory before
> > >> it splits into new branches/leaves.  The problem is that this has a
> > >> fair amount of overhead and also there's extra associated dentry
> > >> lookups to get at any
> > given object.
> > >>
> > >> You may want to try something like:
> > >>
> > >> "filestore merge threshold = 40"
> > >> "filestore split multiple = 8"
> > >>
> > >> This will dramatically increase the number of objects per directory
> > allowed.
> > >>
> > >> Another thing you may want to try is telling the kernel to greatly
> > >> favor retaining dentries and inodes in cache:
> > >>
> > >> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
> > >>
> > >> Mark
> > >>
> > >>
> > >> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
> > >>>
> > >>> If I create a new pool it is generally fast for a short amount of time.
> > >>> Not as fast as if I had a blank cluster, but close to.
> > >>>
> > >>> Bryn
> > 
> >  On 8 Jul 2015, at 13:55, Gregory Farnum  wrote:
> > 
> >  I think you're probably running into the internal PG/collection
> >  splitting here; try searching for those terms and seeing what
> >  your OSD folder structures look like. You could test by creating
> >  a new pool and seeing if it's faster or slower than the one
> >  you've already filled
> > up.
> >  -Greg
> > 
> >  On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
> >   wrote:
> > >
> > > Hi All,
> > >
> > >
> > > I’m perf testing a cluster again, This time I have re-built the
> > > cluster and am filling it for testing.
> > >
> > > on a 10 min run I get the following results from 5 load
> > > generators, each writing though 7 iocontexts, with a queue depth
> > > of
> > 50 async writes.
> > >
> > >
> > > Gen1
> > > Percentile 100 = 0.729775905609
> > > Max latencies = 0.729775905609, Min 

[ceph-users] Deep scrubbing OSD

2015-09-04 Thread Межов Игорь Александрович
Hi!

Just one simple question: how can we see, when deep-scrub of osd complete, 
if we execute 'ceph osd deep-scrub ' command?


Megov Igor
CIO, Yuterra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Nick Fisk
I've just made the same change ( 4 and 40 for now) on my cluster which is a 
similar size to yours. I didn't see any merging happening, although most of the 
directory's I looked at had more files in than the new merge threshold, so I 
guess this is to be expected

I'm currently splitting my PG's from 1024 to 2048 to see if that helps to bring 
things back into order.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Wang, Warren
> Sent: 04 September 2015 01:21
> To: Mark Nelson ; Ben Hines 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
> 
> I'm about to change it on a big cluster too. It totals around 30 million, so 
> I'm a
> bit nervous on changing it. As far as I understood, it would indeed move
> them around, if you can get underneath the threshold, but it may be hard to
> do. Two more settings that I highly recommend changing on a big prod
> cluster. I'm in favor of bumping these two up in the defaults.
> 
> Warren
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mark Nelson
> Sent: Thursday, September 03, 2015 6:04 PM
> To: Ben Hines 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
> 
> Hrm, I think it will follow the merge/split rules if it's out of whack given 
> the
> new settings, but I don't know that I've ever tested it on an existing 
> cluster to
> see that it actually happens.  I guess let it sit for a while and then check 
> the
> OSD PG directories to see if the object counts make sense given the new
> settings? :D
> 
> Mark
> 
> On 09/03/2015 04:31 PM, Ben Hines wrote:
> > Hey Mark,
> >
> > I've just tweaked these filestore settings for my cluster -- after
> > changing this, is there a way to make ceph move existing objects
> > around to new filestore locations, or will this only apply to newly
> > created objects? (i would assume the latter..)
> >
> > thanks,
> >
> > -Ben
> >
> > On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
> wrote:
> >> Basically for each PG, there's a directory tree where only a certain
> >> number of objects are allowed in a given directory before it splits
> >> into new branches/leaves.  The problem is that this has a fair amount
> >> of overhead and also there's extra associated dentry lookups to get at any
> given object.
> >>
> >> You may want to try something like:
> >>
> >> "filestore merge threshold = 40"
> >> "filestore split multiple = 8"
> >>
> >> This will dramatically increase the number of objects per directory
> allowed.
> >>
> >> Another thing you may want to try is telling the kernel to greatly
> >> favor retaining dentries and inodes in cache:
> >>
> >> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
> >>
> >> Mark
> >>
> >>
> >> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
> >>>
> >>> If I create a new pool it is generally fast for a short amount of time.
> >>> Not as fast as if I had a blank cluster, but close to.
> >>>
> >>> Bryn
> 
>  On 8 Jul 2015, at 13:55, Gregory Farnum  wrote:
> 
>  I think you're probably running into the internal PG/collection
>  splitting here; try searching for those terms and seeing what your
>  OSD folder structures look like. You could test by creating a new
>  pool and seeing if it's faster or slower than the one you've already 
>  filled
> up.
>  -Greg
> 
>  On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>   wrote:
> >
> > Hi All,
> >
> >
> > I’m perf testing a cluster again,
> > This time I have re-built the cluster and am filling it for testing.
> >
> > on a 10 min run I get the following results from 5 load
> > generators, each writing though 7 iocontexts, with a queue depth of
> 50 async writes.
> >
> >
> > Gen1
> > Percentile 100 = 0.729775905609
> > Max latencies = 0.729775905609, Min = 0.0320818424225, mean =
> > 0.0750389684542
> > Total objects writen = 113088 in time 604.259738207s gives
> > 187.151307376/s (748.605229503 MB/s)
> >
> > Gen2
> > Percentile 100 = 0.735981941223
> > Max latencies = 0.735981941223, Min = 0.0340068340302, mean =
> > 0.0745198070711
> > Total objects writen = 113822 in time 604.437897921s gives
> > 188.310495407/s (753.241981627 MB/s)
> >
> > Gen3
> > Percentile 100 = 0.828994989395
> > Max latencies = 0.828994989395, Min = 0.0349340438843, mean =
> > 0.0745455575197
> > Total objects writen = 113670 in time 604.352181911s gives
> > 188.085694736/s (752.342778944 MB/s)
> >
> > Gen4
> > Percentile 100 = 1.06834602356
> > Max latencies = 1.06834602356, Min = 0.0333499908447, mean =
> > 0.0752239764659
> > 

Re: [ceph-users] How to disable object-map and exclusive features ?

2015-09-04 Thread Jason Dillaman
> I have a coredump with the size of 1200M compressed .
> 
> Where shall i put the dump  ?
> 

I believe you can use the ceph-post-file utility [1] to upload the core and 
your current package list to ceph.com.

Jason

[1] http://ceph.com/docs/master/man/8/ceph-post-file/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nova fails to download image from Glance backed with Ceph

2015-09-04 Thread Jan Schermer
Didn't you run out of space? Happened to me when a customer tried to create a 
1TB image...

Z.

> On 04 Sep 2015, at 15:15, Sebastien Han  wrote:
> 
> Just to take away a possible issue from infra (LBs etc).
> Did you try to download the image on the compute node? Something like rbd 
> export?
> 
>> On 04 Sep 2015, at 11:56, Vasiliy Angapov  wrote:
>> 
>> Hi all,
>> 
>> Not sure actually where does this bug belong to - OpenStack or Ceph -
>> but writing here in humble hope that anyone faced that issue also.
>> 
>> I configured test OpenStack instance with Glance images stored in Ceph
>> 0.94.3. Nova has local storage.
>> But when I'm trying to launch instance from large image stored in Ceph
>> - it fails to spawn with such an error in nova-conductor.log:
>> 
>> 2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
>> [req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
>> 82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
>> -] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
>> host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
>> last):\n', u'  File
>> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
>> in _do_build_and_run_instance\nfilter_properties)\n', u'  File
>> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
>> in _build_and_run_instance\ninstance_uuid=instance.uuid,
>> reason=six.text_type(e))\n', u'RescheduledException: Build of instance
>> 18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
>> Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
>> expected 4a7de2fbbd01be5c6a9e114df145b027\n']
>> 
>> So nova tries 3 different hosts with the same error messages on every
>> single one and then fails to spawn an instance.
>> I've tried Cirros little image and it works fine with it. Issue
>> happens with large images like 10Gb in size.
>> I also managed to look into /var/lib/nova/instances/_base folder and
>> found out that image is actually being downloaded but at some moment
>> the download process interrupts for some unknown reason and instance
>> gets deleted.
>> 
>> I looked at the syslog and found many messages like that:
>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>> (cutoff 2015-09-04 12:51:32.735011)
>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>> (cutoff 2015-09-04 12:51:32.735011)
>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>> (cutoff 2015-09-04 12:51:32.735011)
>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>> (cutoff 2015-09-04 12:51:32.735011)
>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>> (cutoff 2015-09-04 12:51:32.735011)
>> 
>> I've also tried to monitor nova-compute process file descriptors
>> number but it is never more than 102. ("echo
>> /proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
>> It also seems like problem appeared only in 0.94.3, in 0.94.2
>> everything worked just fine!
>> 
>> Would be very grateful for any help!
>> 
>> Vasily.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> Cheers.
> 
> Sébastien Han
> Senior Cloud Architect
> 
> "Always give 100%. Unless you're giving blood."
> 
> Mail: s...@redhat.com
> Address: 11 bis, rue Roquépine - 75008 Paris
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] maximum number of mapped rbds?

2015-09-04 Thread Ilya Dryomov
On Fri, Sep 4, 2015 at 4:44 PM, Ilya Dryomov  wrote:
> On Fri, Sep 4, 2015 at 4:30 PM, Sebastien Han  wrote:
>> Which Kernel are you running on?
>> These days, the theoretical limit is 65536 AFAIK.
>>
>> Ilya would know the kernel needed for that.
>
> 3.14 or later, and, if you are loading your kernel modules by hand or
> have your distro load them for you during boot, you'll need to make
> sure rbd.ko is loaded with single_major=T.  (rbd cli tool loads rbd.ko
> with single_major=T but won't reload if the module is already loaded.)

Sorry, I meant single_major=Y (or =1) - I always mix these up.
And yes, 248 devices is about is high as you can go on pre-3.14 or on
post-3.14 with single_major=N.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Impact add PG

2015-09-04 Thread Jimmy Goffaux

English version  :

Hello everyone,

Recently we have increased the number of PG in a pool. We had a big 
performance problem because everything had CEPH cluster 0 on IOPS while 
there are production above.


So we did this:

ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'

This changes the priority actions and we found a functional cluster.

Hope it can help you;)


French Version :

Bonjour à tous,

Récemment nous avons augmenté le nombre de PG dans un pool. Nous avons 
eu un gros problème de performances car tout le cluster CEPH avait 0 en 
IOPS alors qu'il y a de la production dessus.


Nous avons donc fait ceci :

ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'

Cela change la priorité des actions et nous avons retrouvé un cluster 
fonctionnelle.


J'espère que ça pourra vous aider ;)

--

Jimmy Goffaux
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] maximum number of mapped rbds?

2015-09-04 Thread Sebastien Han
Which Kernel are you running on?
These days, the theoretical limit is 65536 AFAIK.

Ilya would know the kernel needed for that.

> On 03 Sep 2015, at 15:05, Jeff Epstein  wrote:
> 
> Hello,
> 
> In response to an rbd map command, we are getting a "Device or resource busy".
> 
> $ rbd -p platform map ceph:pzejrbegg54hi-stage-4ac9303161243dc71c75--php
> 
> rbd: sysfs write failed
> 
> rbd: map failed: (16) Device or resource busy
> 
> 
> We currently have over 200 rbds mapped on a single host. Can this be the 
> source of the problem? If so, is there a workaround?
> 
> $  rbd -p platform showmapped|wc -l
> 248
> 
> Thanks.
> 
> Best,
> Jeff
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.

Sébastien Han
Senior Cloud Architect

"Always give 100%. Unless you're giving blood."

Mail: s...@redhat.com
Address: 11 bis, rue Roquépine - 75008 Paris



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Mark Nelson
There's a lot of factors that play into all of this.  The more PGs you 
have, the more total objects you can store before you hit the 
thresholds.  More PGs also means slightly better random distribution 
across OSDs (Not really affected by the size of the OSD assuming all 
OSDs are uniform).  You have to be careful increasing the PG count 
though.  I've tested about a million PGs and things more or less worked 
but the mons were pretty laggy and I didn't test recovery.  For small 
clusters I personally like to use more PGs than our guidelines indicate 
and for very large clusters I suspect you might have to under-allocate 
but then probably use larger directory splitting thresholds to at least 
balance that part of the equation out.


Mark

On 09/04/2015 07:18 AM, Nick Fisk wrote:

Actually just thinking about this some more, shouldn't the PG's per OSD "golden 
rule" also depend on the size of the OSD? If this Directory splitting is a big deal 
then an 8TB OSD is going to need a lot more PG's than say a 1TB OSD.

Any thoughts Mark?


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Nick Fisk
Sent: 04 September 2015 13:08
To: 'Wang, Warren' ; 'Mark Nelson'
; 'Ben Hines' 
Cc: 'ceph-users' 
Subject: Re: [ceph-users] Ceph performance, empty vs part full

I've just made the same change ( 4 and 40 for now) on my cluster which is a
similar size to yours. I didn't see any merging happening, although most of
the directory's I looked at had more files in than the new merge threshold, so
I guess this is to be expected

I'm currently splitting my PG's from 1024 to 2048 to see if that helps to bring
things back into order.


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Wang, Warren
Sent: 04 September 2015 01:21
To: Mark Nelson ; Ben Hines 
Cc: ceph-users 
Subject: Re: [ceph-users] Ceph performance, empty vs part full

I'm about to change it on a big cluster too. It totals around 30
million, so I'm a bit nervous on changing it. As far as I understood,
it would indeed move them around, if you can get underneath the
threshold, but it may be hard to do. Two more settings that I highly
recommend changing on a big prod cluster. I'm in favor of bumping these

two up in the defaults.


Warren

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Mark Nelson
Sent: Thursday, September 03, 2015 6:04 PM
To: Ben Hines 
Cc: ceph-users 
Subject: Re: [ceph-users] Ceph performance, empty vs part full

Hrm, I think it will follow the merge/split rules if it's out of whack
given the new settings, but I don't know that I've ever tested it on
an existing cluster to see that it actually happens.  I guess let it
sit for a while and then check the OSD PG directories to see if the
object counts make sense given the new settings? :D

Mark

On 09/03/2015 04:31 PM, Ben Hines wrote:

Hey Mark,

I've just tweaked these filestore settings for my cluster -- after
changing this, is there a way to make ceph move existing objects
around to new filestore locations, or will this only apply to newly
created objects? (i would assume the latter..)

thanks,

-Ben

On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 

wrote:

Basically for each PG, there's a directory tree where only a
certain number of objects are allowed in a given directory before
it splits into new branches/leaves.  The problem is that this has a
fair amount of overhead and also there's extra associated dentry
lookups to get at any

given object.


You may want to try something like:

"filestore merge threshold = 40"
"filestore split multiple = 8"

This will dramatically increase the number of objects per directory

allowed.


Another thing you may want to try is telling the kernel to greatly
favor retaining dentries and inodes in cache:

echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure

Mark


On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:


If I create a new pool it is generally fast for a short amount of time.
Not as fast as if I had a blank cluster, but close to.

Bryn


On 8 Jul 2015, at 13:55, Gregory Farnum  wrote:

I think you're probably running into the internal PG/collection
splitting here; try searching for those terms and seeing what
your OSD folder structures look like. You could test by creating
a new pool and seeing if it's faster or slower than the one
you've already filled

up.

-Greg

On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
 wrote:


Hi All,


I’m perf testing a cluster again, This time I have re-built the
cluster and am filling it for testing.

on a 10 min run I get the following results from 5 load
generators, each writing though 7 

Re: [ceph-users] maximum number of mapped rbds?

2015-09-04 Thread Ilya Dryomov
On Fri, Sep 4, 2015 at 4:30 PM, Sebastien Han  wrote:
> Which Kernel are you running on?
> These days, the theoretical limit is 65536 AFAIK.
>
> Ilya would know the kernel needed for that.

3.14 or later, and, if you are loading your kernel modules by hand or
have your distro load them for you during boot, you'll need to make
sure rbd.ko is loaded with single_major=T.  (rbd cli tool loads rbd.ko
with single_major=T but won't reload if the module is already loaded.)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] crash on rbd bench-write

2015-09-04 Thread Jason Dillaman
Any particular reason why you have the image mounted via the kernel client 
while performing a benchmark?  Not to say this is the reason for the crash, but 
strange since 'rbd bench-write' will not test the kernel IO speed since it uses 
the user-mode library.  Are you able to test bench-write with a Ceph 
Hammer-release client?

--

Jason

> Hiya. Playing with a small cephs setup from the Quick start documentation.
> 
> Seeing an issue running rdb bench-write. Initial trace is provided
> below, let me know if you need other information. fwiw the rados bench
> works just fine.
> 
> Any idea what is causing this? Is it a parsing issue in the rbd command?
>
> Thanks
> --Glenn
> 
> root@ceph-client:~# uname -a
> Linux ceph-client 4.1.6-rh1-xenU #1 SMP Fri Sep 4 02:50:30 UTC 2015
> x86_64 x86_64 x86_64 GNU/Linux
> 
> root@ceph-client:~# rbd --version
> ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
> 
> root@ceph-client:~# rbd showmapped
> id pool image snap device
> 0  rbd  foo   -/dev/rbd0
> 
> root@ceph-client:/mnt/ceph-block-device# rbd info foo
> rbd image 'foo':
>  size 4096 MB in 1024 objects
>  order 22 (4096 kB objects)
>  block_name_prefix: rb.0.1073.238e1f29
>  format: 1
> root@ceph-client:/mnt/ceph-block-device# man rbd
> 
> root@ceph-client:/mnt/ceph-block-device# rbd bench-write foo
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>SEC   OPS   OPS/SEC   BYTES/SEC
> *** Error in `rbd': free(): invalid pointer: 0x56a727a8 ***
> *** Caught signal (Aborted) **
>   in thread f26feb40
>   ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
>   1: (()+0x24a20) [0x5664aa20]
>   2: [0xf77abe50]
>   3: [0xf77abe80]
>   4: (gsignal()+0x47) [0xf6a60607]
>   5: (abort()+0x143) [0xf6a63a33]
>   6: (()+0x68e53) [0xf6a9ae53]
>   7: (()+0x7333a) [0xf6aa533a]
>   8: (()+0x73fad) [0xf6aa5fad]
>   9: (operator delete(void*)+0x1f) [0xf6c4682f]
>   10: (librbd::C_AioWrite::~C_AioWrite()+0x26) [0xf76f90b6]
>   11: (Context::complete(int)+0x1f) [0xf76ca52f]
>   12: (librbd::rados_req_cb(void*, void*)+0x48) [0xf76d7f58]
>   13: (librados::C_AioSafe::finish(int)+0x2b) [0xf6eb719b]
>   14: (Context::complete(int)+0x17) [0xf6e8f6f7]
>   15: (Finisher::finisher_thread_entry()+0x1a8) [0xf6f5b3a8]
>   16: (Finisher::FinisherThread::entry()+0x1e) [0xf7746f7e]
>   17: (Thread::entry_wrapper()+0x4f) [0xf6f82ebf]
>   18: (Thread::_entry_func(void*)+0x1b) [0xf6f82efb]
>   19: (()+0x6f70) [0xf6d1ff70]
>   20: (clone()+0x5e) [0xf6b1dbee]
> 2015-09-04 04:30:46.755568 f26feb40 -1 *** Caught signal (Aborted) **
>   in thread f26feb40
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact add PG

2015-09-04 Thread Wang, Warren
Sadly, this is one of those things that people find out after running their 
first production Ceph cluster. Never run with the defaults. I know it's been 
recently reduced to 3 and 1 or 1 and 3, I forget, but I would advocate 1 and 1. 
Even that will cause a tremendous amount of traffic with any reasonable sized 
cluster.

Warren

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jimmy 
Goffaux
Sent: Friday, September 04, 2015 8:52 AM
To: Ceph users 
Subject: [ceph-users] Impact add PG

English version  :

Hello everyone,

Recently we have increased the number of PG in a pool. We had a big performance 
problem because everything had CEPH cluster 0 on IOPS while there are 
production above.

So we did this:

ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'

This changes the priority actions and we found a functional cluster.

Hope it can help you;)


French Version :

Bonjour à tous,

Récemment nous avons augmenté le nombre de PG dans un pool. Nous avons eu un 
gros problème de performances car tout le cluster CEPH avait 0 en IOPS alors 
qu'il y a de la production dessus.

Nous avons donc fait ceci :

ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd_recovery_max_active 1'

Cela change la priorité des actions et nous avons retrouvé un cluster 
fonctionnelle.

J'espère que ça pourra vous aider ;)

-- 

Jimmy Goffaux
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nova fails to download image from Glance backed with Ceph

2015-09-04 Thread Sebastien Han
Just to take away a possible issue from infra (LBs etc).
Did you try to download the image on the compute node? Something like rbd 
export?

> On 04 Sep 2015, at 11:56, Vasiliy Angapov  wrote:
> 
> Hi all,
> 
> Not sure actually where does this bug belong to - OpenStack or Ceph -
> but writing here in humble hope that anyone faced that issue also.
> 
> I configured test OpenStack instance with Glance images stored in Ceph
> 0.94.3. Nova has local storage.
> But when I'm trying to launch instance from large image stored in Ceph
> - it fails to spawn with such an error in nova-conductor.log:
> 
> 2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
> [req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
> 82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
> -] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
> host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
> last):\n', u'  File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
> in _do_build_and_run_instance\nfilter_properties)\n', u'  File
> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
> in _build_and_run_instance\ninstance_uuid=instance.uuid,
> reason=six.text_type(e))\n', u'RescheduledException: Build of instance
> 18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
> Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
> expected 4a7de2fbbd01be5c6a9e114df145b027\n']
> 
> So nova tries 3 different hosts with the same error messages on every
> single one and then fails to spawn an instance.
> I've tried Cirros little image and it works fine with it. Issue
> happens with large images like 10Gb in size.
> I also managed to look into /var/lib/nova/instances/_base folder and
> found out that image is actually being downloaded but at some moment
> the download process interrupts for some unknown reason and instance
> gets deleted.
> 
> I looked at the syslog and found many messages like that:
> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
> (cutoff 2015-09-04 12:51:32.735011)
> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
> (cutoff 2015-09-04 12:51:32.735011)
> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
> (cutoff 2015-09-04 12:51:32.735011)
> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
> (cutoff 2015-09-04 12:51:32.735011)
> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
> (cutoff 2015-09-04 12:51:32.735011)
> 
> I've also tried to monitor nova-compute process file descriptors
> number but it is never more than 102. ("echo
> /proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
> It also seems like problem appeared only in 0.94.3, in 0.94.2
> everything worked just fine!
> 
> Would be very grateful for any help!
> 
> Vasily.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Cheers.

Sébastien Han
Senior Cloud Architect

"Always give 100%. Unless you're giving blood."

Mail: s...@redhat.com
Address: 11 bis, rue Roquépine - 75008 Paris



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Jan Schermer
Mark could you please elaborate on this?
"use larger directory splitting thresholds to at least balance that part of the 
equation out"

Thanks
Jan

> On 04 Sep 2015, at 15:31, Mark Nelson  wrote:
> 
> There's a lot of factors that play into all of this.  The more PGs you have, 
> the more total objects you can store before you hit the thresholds.  More PGs 
> also means slightly better random distribution across OSDs (Not really 
> affected by the size of the OSD assuming all OSDs are uniform).  You have to 
> be careful increasing the PG count though.  I've tested about a million PGs 
> and things more or less worked but the mons were pretty laggy and I didn't 
> test recovery.  For small clusters I personally like to use more PGs than our 
> guidelines indicate and for very large clusters I suspect you might have to 
> under-allocate but then probably use larger directory splitting thresholds to 
> at least balance that part of the equation out.
> 
> Mark
> 
> On 09/04/2015 07:18 AM, Nick Fisk wrote:
>> Actually just thinking about this some more, shouldn't the PG's per OSD 
>> "golden rule" also depend on the size of the OSD? If this Directory 
>> splitting is a big deal then an 8TB OSD is going to need a lot more PG's 
>> than say a 1TB OSD.
>> 
>> Any thoughts Mark?
>> 
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Nick Fisk
>>> Sent: 04 September 2015 13:08
>>> To: 'Wang, Warren' ; 'Mark Nelson'
>>> ; 'Ben Hines' 
>>> Cc: 'ceph-users' 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>> 
>>> I've just made the same change ( 4 and 40 for now) on my cluster which is a
>>> similar size to yours. I didn't see any merging happening, although most of
>>> the directory's I looked at had more files in than the new merge threshold, 
>>> so
>>> I guess this is to be expected
>>> 
>>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>>> bring
>>> things back into order.
>>> 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Wang, Warren
 Sent: 04 September 2015 01:21
 To: Mark Nelson ; Ben Hines 
 Cc: ceph-users 
 Subject: Re: [ceph-users] Ceph performance, empty vs part full
 
 I'm about to change it on a big cluster too. It totals around 30
 million, so I'm a bit nervous on changing it. As far as I understood,
 it would indeed move them around, if you can get underneath the
 threshold, but it may be hard to do. Two more settings that I highly
 recommend changing on a big prod cluster. I'm in favor of bumping these
>>> two up in the defaults.
 
 Warren
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Mark Nelson
 Sent: Thursday, September 03, 2015 6:04 PM
 To: Ben Hines 
 Cc: ceph-users 
 Subject: Re: [ceph-users] Ceph performance, empty vs part full
 
 Hrm, I think it will follow the merge/split rules if it's out of whack
 given the new settings, but I don't know that I've ever tested it on
 an existing cluster to see that it actually happens.  I guess let it
 sit for a while and then check the OSD PG directories to see if the
 object counts make sense given the new settings? :D
 
 Mark
 
 On 09/03/2015 04:31 PM, Ben Hines wrote:
> Hey Mark,
> 
> I've just tweaked these filestore settings for my cluster -- after
> changing this, is there a way to make ceph move existing objects
> around to new filestore locations, or will this only apply to newly
> created objects? (i would assume the latter..)
> 
> thanks,
> 
> -Ben
> 
> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
 wrote:
>> Basically for each PG, there's a directory tree where only a
>> certain number of objects are allowed in a given directory before
>> it splits into new branches/leaves.  The problem is that this has a
>> fair amount of overhead and also there's extra associated dentry
>> lookups to get at any
 given object.
>> 
>> You may want to try something like:
>> 
>> "filestore merge threshold = 40"
>> "filestore split multiple = 8"
>> 
>> This will dramatically increase the number of objects per directory
 allowed.
>> 
>> Another thing you may want to try is telling the kernel to greatly
>> favor retaining dentries and inodes in cache:
>> 
>> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>> 
>> Mark
>> 
>> 
>> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>>> 
>>> If I create a new pool it is 

[ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread German Anders
Hi cephers,

   I've the following scheme:

7x OSD servers with:
4x 800GB SSD Intel DC S3510 (OSD-SSD)
3x 120GB SSD Intel DC S3500 (Journals)
5x 3TB SAS disks (OSD-SAS)

The OSD servers are located on two separate Racks with two power circuits
each.

   I would like to know what is the best way to implement this.. use the 4x
800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
suggestion? Also any advice for the crush design?

Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread Nick Fisk
I wouldn't advise upgrading yet if this cluster is going into production. I 
think several people got bitten last time round when they upgraded to pre 
hammer.

Here is a good example on how to create separate root's for SSD's and HDD's

http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds

The rulesets then enable you to pin pools to certain crush roots.

I highly recommend you use the "osd crush location hook =" config directive to 
use a script to auto place the OSD's on startup.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> German Anders
> Sent: 04 September 2015 17:18
> To: Nick Fisk 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Best layout for SSD & SAS OSDs
> 
> Thanks a lot Nick, regarding the power feeds, we only had two circuits for all
> the racks, so I'll to do in the crush the "rack" bucket and separate the osd
> servers on the rack buckets, then regarding the SSD pools, I've installed the
> hammer version and wondering to upgrade to Infernalis v9.0.3 and apply the
> SSD cache, or stay on Hammer and do the SSD pools and maybe left two
> 800GB SSD for later used as Cache (1.6TB per OSD server), do you have a
> crushmap example for this type of config?
> Thanks a lot,
> Best regards,
> 
> 
> German
> 
> 2015-09-04 13:10 GMT-03:00 Nick Fisk :
> Hi German,
> 
> Are the power feeds completely separate (ie 4 feeds in total), or just each
> rack has both feeds? If it’s the latter I don’t see any benefit from including
> this into the crushmap and would just create a “rack” bucket. Also assuming
> your servers have dual PSU’s, this also changes the power failure scenarios
> quite a bit as well.
> 
> In regards to the pools, unless you know your workload will easily fit into a
> cache pool with room to spare, I would suggest not going down that route
> currently. Performance in many cases can actually end up being worse if you
> end up doing a lot of promotions.
> 
> *However* I’ve been doing a bit of testing with the current master and
> there are a lot of improvements around cache tiering that are starting to
> have a massive improvement on performance. If you can get by with just the
> SAS disks for now and make a more informed decision about the cache
> tiering when Infernalis is released then that might be your best bet.
> 
> Otherwise you might just be best using them as a basic SSD only Pool.
> 
> Nick
> 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> German Anders
> Sent: 04 September 2015 16:30
> To: ceph-users 
> Subject: [ceph-users] Best layout for SSD & SAS OSDs
> 
> Hi cephers,
>I've the following scheme:
> 7x OSD servers with:
> 4x 800GB SSD Intel DC S3510 (OSD-SSD)
> 3x 120GB SSD Intel DC S3500 (Journals)
> 5x 3TB SAS disks (OSD-SAS)
> The OSD servers are located on two separate Racks with two power circuits
> each.
>I would like to know what is the best way to implement this.. use the 4x
> 800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
> suggestion? Also any advice for the crush design?
> Thanks in advance,
> 
> 
> German
> 
> 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Receiving "failed to parse date for auth header"

2015-09-04 Thread Ilya Dryomov
On Fri, Sep 4, 2015 at 12:42 PM, Ramon Marco Navarro
 wrote:
> Good day everyone!
>
> I'm having a problem using aws-java-sdk to connect to Ceph using radosgw. I
> am reading a " NOTICE: failed to parse date for auth header" message in the
> logs. HTTP_DATE is "Fri, 04 Sep 2015 09:25:33 +00:00", which is I think a
> valid rfc 1123 date...

Completely unfamiliar with rgw, but try "... +" (i.e. no colon)?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Ben Hines
Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
a ticket to request a way to tell an OSD to rebalance its directory
structure.

On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
> similar size to yours. I didn't see any merging happening, although most of 
> the directory's I looked at had more files in than the new merge threshold, 
> so I guess this is to be expected
>
> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
> bring things back into order.
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Wang, Warren
>> Sent: 04 September 2015 01:21
>> To: Mark Nelson ; Ben Hines 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> I'm about to change it on a big cluster too. It totals around 30 million, so 
>> I'm a
>> bit nervous on changing it. As far as I understood, it would indeed move
>> them around, if you can get underneath the threshold, but it may be hard to
>> do. Two more settings that I highly recommend changing on a big prod
>> cluster. I'm in favor of bumping these two up in the defaults.
>>
>> Warren
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Mark Nelson
>> Sent: Thursday, September 03, 2015 6:04 PM
>> To: Ben Hines 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> Hrm, I think it will follow the merge/split rules if it's out of whack given 
>> the
>> new settings, but I don't know that I've ever tested it on an existing 
>> cluster to
>> see that it actually happens.  I guess let it sit for a while and then check 
>> the
>> OSD PG directories to see if the object counts make sense given the new
>> settings? :D
>>
>> Mark
>>
>> On 09/03/2015 04:31 PM, Ben Hines wrote:
>> > Hey Mark,
>> >
>> > I've just tweaked these filestore settings for my cluster -- after
>> > changing this, is there a way to make ceph move existing objects
>> > around to new filestore locations, or will this only apply to newly
>> > created objects? (i would assume the latter..)
>> >
>> > thanks,
>> >
>> > -Ben
>> >
>> > On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
>> wrote:
>> >> Basically for each PG, there's a directory tree where only a certain
>> >> number of objects are allowed in a given directory before it splits
>> >> into new branches/leaves.  The problem is that this has a fair amount
>> >> of overhead and also there's extra associated dentry lookups to get at any
>> given object.
>> >>
>> >> You may want to try something like:
>> >>
>> >> "filestore merge threshold = 40"
>> >> "filestore split multiple = 8"
>> >>
>> >> This will dramatically increase the number of objects per directory
>> allowed.
>> >>
>> >> Another thing you may want to try is telling the kernel to greatly
>> >> favor retaining dentries and inodes in cache:
>> >>
>> >> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>> >>
>> >> Mark
>> >>
>> >>
>> >> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>> >>>
>> >>> If I create a new pool it is generally fast for a short amount of time.
>> >>> Not as fast as if I had a blank cluster, but close to.
>> >>>
>> >>> Bryn
>> 
>>  On 8 Jul 2015, at 13:55, Gregory Farnum  wrote:
>> 
>>  I think you're probably running into the internal PG/collection
>>  splitting here; try searching for those terms and seeing what your
>>  OSD folder structures look like. You could test by creating a new
>>  pool and seeing if it's faster or slower than the one you've already 
>>  filled
>> up.
>>  -Greg
>> 
>>  On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>>   wrote:
>> >
>> > Hi All,
>> >
>> >
>> > I’m perf testing a cluster again,
>> > This time I have re-built the cluster and am filling it for testing.
>> >
>> > on a 10 min run I get the following results from 5 load
>> > generators, each writing though 7 iocontexts, with a queue depth of
>> 50 async writes.
>> >
>> >
>> > Gen1
>> > Percentile 100 = 0.729775905609
>> > Max latencies = 0.729775905609, Min = 0.0320818424225, mean =
>> > 0.0750389684542
>> > Total objects writen = 113088 in time 604.259738207s gives
>> > 187.151307376/s (748.605229503 MB/s)
>> >
>> > Gen2
>> > Percentile 100 = 0.735981941223
>> > Max latencies = 0.735981941223, Min = 0.0340068340302, mean =
>> > 0.0745198070711
>> > Total objects writen = 113822 in time 604.437897921s gives
>> > 188.310495407/s (753.241981627 MB/s)
>> >
>> > Gen3
>> > Percentile 100 = 0.828994989395
>> > Max latencies = 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread James (Fei) Liu-SSI
Hi Quentin and Andrija,
Thanks so much for reporting the problems with Samsung.

Would be possible to get to know your configuration of your system?  What kind 
of workload are you running?  Do you use Samsung SSD as separate journaling 
disk, right?

Thanks so much.

James

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Quentin Hartman
Sent: Thursday, September 03, 2015 1:06 PM
To: Andrija Panic
Cc: ceph-users
Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
s3700

Yeah, we've ordered some S3700's to replace them already. Should be here early 
next week. Hopefully they arrive before we have multiple nodes die at once and 
can no longer rebalance successfully.

Most of the drives I have are the 850 Pro 128GB (specifically MZ7KE128HMGA)
There are a couple 120GB 850 EVOs in there too, but ironically, none of them 
have pooped out yet.

On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
> wrote:

I really advise removing the bastards becore they die...no rebalancing hapening 
just temp osd down while replacing journals...

What size and model are yours Samsungs?
On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
> wrote:
We also just started having our 850 Pros die one after the other after about 9 
months of service. 3 down, 11 to go... No warning at all, the drive is fine, 
and then it's not even visible to the machine. According to the stats in hdparm 
and the calcs I did they should have had years of life left, so it seems that 
ceph journals definitely do something they do not like, which is not reflected 
in their stats.

QH

On Wed, Aug 26, 2015 at 7:15 AM, 10 minus 
> wrote:
Hi ,
We got a good deal on 843T and we are using it in our Openstack setup ..as 
journals .
They have been running for last six months ... No issues .
When we compared with  Intel SSDs I think it was 3700 they  were shade slower 
for our workload and considerably cheaper.
We did not run any synthetic benchmark since we had a specific use case.
The performance was better than our old setup so it was good enough.
hth


On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
> wrote:

We have some 850 pro 256gb ssds if anyone interested to buy:)

And also there was new 850 pro firmware that broke peoples disk which was 
revoked later etc... I'm sticking with only vacuum cleaners from Samsung for 
now, maybe... :)
On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" 
> wrote:
To be honest, Samsung 850 PRO not 24/7 series... it's something about desktop+ 
series, but anyway - results from this drives - very very bad in any scenario 
acceptable by real life...

Possible 845 PRO more better, but we don't want to experiment anymore... So we 
choose S3500 240G. Yes, it's cheaper than S3700 (about 2x times), and no so 
durable for writes, but we think more better to replace 1 ssd per 1 year than 
to pay double price now.

2015-08-25 12:59 GMT+03:00 Andrija Panic 
>:

And should I mention that in another CEPH installation we had samsung 850 pro 
128GB and all of 6 ssds died in 2 month period - simply disappear from the 
system, so not wear out...

Never again we buy Samsung :)
On Aug 25, 2015 11:57 AM, "Andrija Panic" 
> wrote:

First read please:
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

We are getting 200 IOPS in comparison to Intels3500 18.000 iops - those are  
constant performance numbers, meaning avoiding drives cache and running for 
longer period of time...
Also if checking with FIO you will get better latencies on intel s3500 (model 
tested in our case) along with 20X better IOPS results...

We observed original issue by having high speed at begining of i.e. file 
transfer inside VM, which than halts to zero... We moved journals back to HDDs 
and performans was acceptable...no we are upgrading to intel S3500...

Best
any details on that ?

On Tue, 25 Aug 2015 11:42:47 +0200, Andrija Panic
> wrote:

> Make sure you test what ever you decide. We just learned this the hard way
> with samsung 850 pro, which is total crap, more than you could imagine...
>
> Andrija
> On Aug 25, 2015 11:25 AM, "Jan Schermer" 
> > wrote:
>
> > I would recommend Samsung 845 DC PRO (not EVO, not just PRO).
> > Very cheap, better than Intel 3610 for sure (and I think it beats even
> > 3700).
> >
> > Jan
> >
> > > On 25 Aug 2015, at 11:23, Christopher Kunz 
> > > >
> > wrote:
> > >
> > > Am 25.08.15 um 11:18 schrieb Götz 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Andrija Panic
Quentin,

try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than
1MB/s - that is common for most "home" drives - check the post down to
understand

We removed all Samsung 850 pro 256GB from our new CEPH installation and
replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with
O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can
imagine the difference...):

http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Best

On 4 September 2015 at 21:09, Quentin Hartman 
wrote:

> Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
> there just because I couldn't find 14 pros at the time we were ordering
> hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
> as the boot drive  and the journal for 3 OSDs. And similarly, mine just
> started disappearing a few weeks ago. I've now had four fail (three 850
> Pro, one 840 Pro). I expect the rest to fail any day.
>
> As it turns out I had a phone conversation with the support rep who has
> been helping me with RMA's today and he's putting together a report with my
> pertinent information in it to forward on to someone.
>
> FWIW, I tried to get your 845's for this deploy, but couldn't find them
> anywhere, and since the 850's looked about as durable on paper I figured
> they would do ok. Seems not to be the case.
>
> QH
>
> On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
> wrote:
>
>> Hi James,
>>
>> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
>> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
>> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
>> 3-4 months of being in production (VMs/KVM/CloudStack)
>>
>> Mine were also Samsung 850 PRO 128GB.
>>
>> Best,
>> Andrija
>>
>> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
>> james@ssi.samsung.com> wrote:
>>
>>> Hi Quentin and Andrija,
>>>
>>> Thanks so much for reporting the problems with Samsung.
>>>
>>>
>>>
>>> Would be possible to get to know your configuration of your system?
>>> What kind of workload are you running?  Do you use Samsung SSD as separate
>>> journaling disk, right?
>>>
>>>
>>>
>>> Thanks so much.
>>>
>>>
>>>
>>> James
>>>
>>>
>>>
>>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
>>> Behalf Of *Quentin Hartman
>>> *Sent:* Thursday, September 03, 2015 1:06 PM
>>> *To:* Andrija Panic
>>> *Cc:* ceph-users
>>> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T
>>> vs. Intel s3700
>>>
>>>
>>>
>>> Yeah, we've ordered some S3700's to replace them already. Should be here
>>> early next week. Hopefully they arrive before we have multiple nodes die at
>>> once and can no longer rebalance successfully.
>>>
>>>
>>>
>>> Most of the drives I have are the 850 Pro 128GB (specifically
>>> MZ7KE128HMGA)
>>>
>>> There are a couple 120GB 850 EVOs in there too, but ironically, none of
>>> them have pooped out yet.
>>>
>>>
>>>
>>> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
>>> wrote:
>>>
>>> I really advise removing the bastards becore they die...no rebalancing
>>> hapening just temp osd down while replacing journals...
>>>
>>> What size and model are yours Samsungs?
>>>
>>> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
>>> wrote:
>>>
>>> We also just started having our 850 Pros die one after the other after
>>> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
>>> is fine, and then it's not even visible to the machine. According to the
>>> stats in hdparm and the calcs I did they should have had years of life
>>> left, so it seems that ceph journals definitely do something they do not
>>> like, which is not reflected in their stats.
>>>
>>>
>>>
>>> QH
>>>
>>>
>>>
>>> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>>>
>>> Hi ,
>>>
>>> We got a good deal on 843T and we are using it in our Openstack setup
>>> ..as journals .
>>> They have been running for last six months ... No issues .
>>>
>>> When we compared with  Intel SSDs I think it was 3700 they  were shade
>>> slower for our workload and considerably cheaper.
>>>
>>> We did not run any synthetic benchmark since we had a specific use case.
>>>
>>> The performance was better than our old setup so it was good enough.
>>>
>>> hth
>>>
>>>
>>>
>>> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
>>> wrote:
>>>
>>> We have some 850 pro 256gb ssds if anyone interested to buy:)
>>>
>>> And also there was new 850 pro firmware that broke peoples disk which
>>> was revoked later etc... I'm sticking with only vacuum cleaners from
>>> Samsung for now, maybe... :)
>>>
>>> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" <
>>> igor.voloshane...@gmail.com> wrote:
>>>
>>> To be honest, Samsung 850 PRO not 24/7 series... it's something 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
Yeah, we've ordered some S3700's since we can't afford to have these sorts
of failures and haven't been able to find any of the DC-rated Samsung
drives anywhere.

fwiw, we didn't have any performance problems with the samsungs, it's
exclusively this sudden failure that's making us look elsewhere.

QH

On Fri, Sep 4, 2015 at 1:20 PM, Andrija Panic 
wrote:

> Quentin,
>
> try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than
> 1MB/s - that is common for most "home" drives - check the post down to
> understand
>
> We removed all Samsung 850 pro 256GB from our new CEPH installation and
> replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with
> O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can
> imagine the difference...):
>
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Best
>
> On 4 September 2015 at 21:09, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
>> there just because I couldn't find 14 pros at the time we were ordering
>> hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
>> as the boot drive  and the journal for 3 OSDs. And similarly, mine just
>> started disappearing a few weeks ago. I've now had four fail (three 850
>> Pro, one 840 Pro). I expect the rest to fail any day.
>>
>> As it turns out I had a phone conversation with the support rep who has
>> been helping me with RMA's today and he's putting together a report with my
>> pertinent information in it to forward on to someone.
>>
>> FWIW, I tried to get your 845's for this deploy, but couldn't find them
>> anywhere, and since the 850's looked about as durable on paper I figured
>> they would do ok. Seems not to be the case.
>>
>> QH
>>
>> On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
>> wrote:
>>
>>> Hi James,
>>>
>>> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
>>> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
>>> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
>>> 3-4 months of being in production (VMs/KVM/CloudStack)
>>>
>>> Mine were also Samsung 850 PRO 128GB.
>>>
>>> Best,
>>> Andrija
>>>
>>> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
>>> james@ssi.samsung.com> wrote:
>>>
 Hi Quentin and Andrija,

 Thanks so much for reporting the problems with Samsung.



 Would be possible to get to know your configuration of your system?
 What kind of workload are you running?  Do you use Samsung SSD as separate
 journaling disk, right?



 Thanks so much.



 James



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
 Behalf Of *Quentin Hartman
 *Sent:* Thursday, September 03, 2015 1:06 PM
 *To:* Andrija Panic
 *Cc:* ceph-users
 *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T
 vs. Intel s3700



 Yeah, we've ordered some S3700's to replace them already. Should be
 here early next week. Hopefully they arrive before we have multiple nodes
 die at once and can no longer rebalance successfully.



 Most of the drives I have are the 850 Pro 128GB (specifically
 MZ7KE128HMGA)

 There are a couple 120GB 850 EVOs in there too, but ironically, none of
 them have pooped out yet.



 On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
 wrote:

 I really advise removing the bastards becore they die...no rebalancing
 hapening just temp osd down while replacing journals...

 What size and model are yours Samsungs?

 On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
 wrote:

 We also just started having our 850 Pros die one after the other after
 about 9 months of service. 3 down, 11 to go... No warning at all, the drive
 is fine, and then it's not even visible to the machine. According to the
 stats in hdparm and the calcs I did they should have had years of life
 left, so it seems that ceph journals definitely do something they do not
 like, which is not reflected in their stats.



 QH



 On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:

 Hi ,

 We got a good deal on 843T and we are using it in our Openstack setup
 ..as journals .
 They have been running for last six months ... No issues .

 When we compared with  Intel SSDs I think it was 3700 they  were shade
 slower for our workload and considerably cheaper.

 We did not run any synthetic benchmark since we had a specific use case.

 The performance was better than our old setup so it was good enough.


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Andrija Panic
Hi James,

yes CEPH with Cloudstack. all 6 SSDs (2 SSDs in each of 3 nodes) vanished
in 2-3 weeks total time, and yes brand new Samsung 850 Pro 128GB - I also
checked wear_level atribute via smartctl prior to all drives dying - no
indication wear_level is low or anything...also all other parametres seemed
fine...

I cant reproduce setup, we returned all 850 pros...

Hardware configuration: server model (
http://www.quantaqct.com/Product/Servers/Rackmount-Servers/2U/STRATOS-S210-X22RQ-p7c77c70c83c118?search=S210-X22RQ)
= 64GB RAM, 2 x Intel 2620 v2 CPU - 12 HDDS connected from the front of
server to main disk backplain (12 OSDs) and 2 SSDs connected to embedded
Intel C601 controler on back of the servers (6 partitions on each SSD for
Jorunals + 1 partition used for OS)...

As for workload, I dont think we had very heavy workload at all, since not
to many VMs were running there, and it was mostly Linux web servers...

Best,
Andrija

On 4 September 2015 at 21:15, James (Fei) Liu-SSI  wrote:

> Hi Andrija,
>
> Thanks for your promptly response. Would be possible to have any change to
> know your hardware configuration including your server information?
> Secondly, Is there anyway to duplicate your workload with fio-rbd, rbd
> bench or rados bench?
>
>
>
>   “so 2 SSDs in 3 servers vanished in...2-3 weeks, after a 3-4 months of
> being in production (VMs/KVM/CloudStack)”
>
>
>
>What you mean over here is that you deploy Ceph with CloudStack , am I
> correct? The 2 SSDs vanished in 2~3 weeks is brand new Samsung 850 Pro
> 128GB, right?
>
>
>
> Thanks,
>
> James
>
>
>
> *From:* Andrija Panic [mailto:andrija.pa...@gmail.com]
> *Sent:* Friday, September 04, 2015 11:53 AM
> *To:* James (Fei) Liu-SSI
> *Cc:* Quentin Hartman; ceph-users
>
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Hi James,
>
>
>
> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
> 3-4 months of being in production (VMs/KVM/CloudStack)
>
> Mine were also Samsung 850 PRO 128GB.
>
>
>
> Best,
>
> Andrija
>
>
>
> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
> Hi Quentin and Andrija,
>
> Thanks so much for reporting the problems with Samsung.
>
>
>
> Would be possible to get to know your configuration of your system?  What
> kind of workload are you running?  Do you use Samsung SSD as separate
> journaling disk, right?
>
>
>
> Thanks so much.
>
>
>
> James
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Quentin Hartman
> *Sent:* Thursday, September 03, 2015 1:06 PM
> *To:* Andrija Panic
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Yeah, we've ordered some S3700's to replace them already. Should be here
> early next week. Hopefully they arrive before we have multiple nodes die at
> once and can no longer rebalance successfully.
>
>
>
> Most of the drives I have are the 850 Pro 128GB (specifically
> MZ7KE128HMGA)
>
> There are a couple 120GB 850 EVOs in there too, but ironically, none of
> them have pooped out yet.
>
>
>
> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
> wrote:
>
> I really advise removing the bastards becore they die...no rebalancing
> hapening just temp osd down while replacing journals...
>
> What size and model are yours Samsungs?
>
> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
> wrote:
>
> We also just started having our 850 Pros die one after the other after
> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
> is fine, and then it's not even visible to the machine. According to the
> stats in hdparm and the calcs I did they should have had years of life
> left, so it seems that ceph journals definitely do something they do not
> like, which is not reflected in their stats.
>
>
>
> QH
>
>
>
> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>
> Hi ,
>
> We got a good deal on 843T and we are using it in our Openstack setup ..as
> journals .
> They have been running for last six months ... No issues .
>
> When we compared with  Intel SSDs I think it was 3700 they  were shade
> slower for our workload and considerably cheaper.
>
> We did not run any synthetic benchmark since we had a specific use case.
>
> The performance was better than our old setup so it was good enough.
>
> hth
>
>
>
> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
> wrote:
>
> We have some 850 pro 256gb ssds if anyone interested to buy:)
>
> And also there was new 850 pro firmware that broke peoples disk which was
> revoked later etc... I'm sticking with only vacuum cleaners from Samsung
> for now, maybe... :)
>
> On Aug 25, 2015 

Re: [ceph-users] ESXi/LIO/RBD repeatable problem, hang when cloning VM

2015-09-04 Thread Alex Gorbachev
On Thu, Sep 3, 2015 at 3:20 AM, Nicholas A. Bellinger
 wrote:
> (RESENDING)
>
> On Wed, 2015-09-02 at 21:14 -0400, Alex Gorbachev wrote:
>> e have experienced a repeatable issue when performing the following:
>>
>> Ceph backend with no issues, we can repeat any time at will in lab and
>> production.  Cloning an ESXi VM to another VM on the same datastore on
>> which the original VM resides.  Practically instantly, the LIO machine
>> becomes unresponsive, Pacemaker fails over to another LIO machine and
>> that too becomes unresponsive.
>>
>> Both running Ubuntu 14.04, kernel 4.1 (4.1.0-040100-generic x86_64),
>> Ceph Hammer 0.94.2, and have been able to take quite a workoad with no
>> issues.
>>
>> output of /var/log/syslog below.  I also have a screen dump of a
>> frozen system - attached.
>>
>> Thank you,
>> Alex
>>
>
> The bug-fix patch to address this NULL pointer dereference with >= v4.1
> sbc_check_prot() sanity checks + EXTENDED_COPY I/O emulation has been
> sent-out with your Reported-by.
>
> Please verify with your v4.1 environment that it resolves the original
> ESX VAAI CLONE regression with a proper Tested-by tag.
>
> For now, it has also been queued to target-pending.git/for-next with a
> stable CC'.
>
> Thanks for reporting!

Thank you for the patch.  I have compiled the kernel and tried the
cloning - it completed successfully this morning.  I will now try to
build a package and deploy it on the larger systems where the failures
occurred.  Once completed I will learn about the Tested-by tag (never
done it before) and submit the results.

Best regards,
Alex

>
> --nab
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread James (Fei) Liu-SSI
Andrija,
In your email thread, (18.000 (4Kb) IOPS constant write speed stands for 18K 
iops with 4k block size, right? However, you can only achieve 200IOPS with 
Samsung 850Pro, right?

Theoretically, Samsung 850 Pro can get up to 100,000 IOPS with 4k Random Read 
with certain workload.  It is a little bit strange over here.

Regards,
James


From: Andrija Panic [mailto:andrija.pa...@gmail.com]
Sent: Friday, September 04, 2015 12:21 PM
To: Quentin Hartman
Cc: James (Fei) Liu-SSI; ceph-users
Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
s3700

Quentin,

try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than 1MB/s 
- that is common for most "home" drives - check the post down to understand
We removed all Samsung 850 pro 256GB from our new CEPH installation and 
replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with 
O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can imagine 
the difference...):
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Best

On 4 September 2015 at 21:09, Quentin Hartman 
> wrote:
Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in there 
just because I couldn't find 14 pros at the time we were ordering hardware. I 
have 14 nodes, each with a single 128 or 120GB SSD that serves as the boot 
drive  and the journal for 3 OSDs. And similarly, mine just started 
disappearing a few weeks ago. I've now had four fail (three 850 Pro, one 840 
Pro). I expect the rest to fail any day.

As it turns out I had a phone conversation with the support rep who has been 
helping me with RMA's today and he's putting together a report with my 
pertinent information in it to forward on to someone.

FWIW, I tried to get your 845's for this deploy, but couldn't find them 
anywhere, and since the 850's looked about as durable on paper I figured they 
would do ok. Seems not to be the case.

QH

On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
> wrote:
Hi James,

I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals 
partitions on each SSD) - SSDs just vanished with no warning, no smartctl 
errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a 3-4 
months of being in production (VMs/KVM/CloudStack)

Mine were also Samsung 850 PRO 128GB.

Best,
Andrija

On 4 September 2015 at 19:27, James (Fei) Liu-SSI 
> wrote:
Hi Quentin and Andrija,
Thanks so much for reporting the problems with Samsung.

Would be possible to get to know your configuration of your system?  What kind 
of workload are you running?  Do you use Samsung SSD as separate journaling 
disk, right?

Thanks so much.

James

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Quentin Hartman
Sent: Thursday, September 03, 2015 1:06 PM
To: Andrija Panic
Cc: ceph-users
Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
s3700

Yeah, we've ordered some S3700's to replace them already. Should be here early 
next week. Hopefully they arrive before we have multiple nodes die at once and 
can no longer rebalance successfully.

Most of the drives I have are the 850 Pro 128GB (specifically MZ7KE128HMGA)
There are a couple 120GB 850 EVOs in there too, but ironically, none of them 
have pooped out yet.

On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
> wrote:

I really advise removing the bastards becore they die...no rebalancing hapening 
just temp osd down while replacing journals...

What size and model are yours Samsungs?
On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
> wrote:
We also just started having our 850 Pros die one after the other after about 9 
months of service. 3 down, 11 to go... No warning at all, the drive is fine, 
and then it's not even visible to the machine. According to the stats in hdparm 
and the calcs I did they should have had years of life left, so it seems that 
ceph journals definitely do something they do not like, which is not reflected 
in their stats.

QH

On Wed, Aug 26, 2015 at 7:15 AM, 10 minus 
> wrote:
Hi ,
We got a good deal on 843T and we are using it in our Openstack setup ..as 
journals .
They have been running for last six months ... No issues .
When we compared with  Intel SSDs I think it was 3700 they  were shade slower 
for our workload and considerably cheaper.
We did not run any synthetic benchmark since we had a specific use case.
The performance was better than our old setup so it was good enough.
hth

On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 

Re: [ceph-users] OSD respawning -- FAILED assert(clone_size.count(clone))

2015-09-04 Thread David Zafman


Chris,

I see that you have stack traces that indicate some OSDs are running 
v0.94.2 (osd.23) and some running v0.94.3 (osd.30).  They should  be 
running the same release except briefly while upgrading.  I see some 
snapshot/cache tiering fixes went into 0.94.3.  So an OSD running 
v0.94.2 when you enabled cache tiering may have been the root cause of 
the SnapSet issue.  Once that had occurred any version OSD can crash 
because the bad SnapSet gets replicated.


I'd love to see the SnapSet per my ceph-dencoder instructions in a prior 
e-mail.  This would help me verify the root cause.


See my inlined comments, but to bring it all together:

1. Fix osd.10 by removing extra clone as you did on osd.30
2. If you can, get me the ceph-dencode output of the bad SnapSet
3. Verify cluster is stable and run scrub on pg 3.f9 and preferably all 
pool 3 PGs

4. Delete old rbd pool (pool 3) and create a new one
5. Restore RBD images from backup using new pool (make sure you have 
disk space as the pool delete removes objects asynchronously)


David

On 9/3/15 8:15 PM, Chris Taylor wrote:

On 09/03/2015 02:44 PM, David Zafman wrote:


Chris,

WARNING: Do this at your own risk.  You are deleting one of the 
snapshots of a specific portion of an rbd image.  I'm not sure how 
rbd will react.  Maybe you should repair the SnapSet instead of 
remove the inconsistency.   However, as far as I know there isn't a 
tool to it.



Would removing all the snapshots of an RBD image fix the SnapSet?
Now I'm not sure you can remove the images without causing a crash until 
you can scrub.  Fix osd.10 as indicated below.




If I remove the RBD image and re-import from backup with "rbd 
import-diff ..." will that fix it?
Once you have a stable cluster and can scrub this PG and probably all 
pool 3 PGs, then out of an abundance of caution, I would delete the 
pool, create a new one and restore the RBD images from backup.


If you are able to build from Ceph source, I happen to have an 
enhancement to ceph-objectstore-tool to output the SnapSet.


---

The message preceding the assert is in the same thread so " 
rb.0.8c2990.238e1f29.8cc0/23ed//3" has the object name in 
it.  The 23ed is the RADOS clone/snap ID.


First, get a backup by export the pg using the 
ceph-objectstore-tool.  Specify a --file somewhere with enough of 
disk space.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
export --pgid 3.f9 --file destination

Exporting 3.f9

Read 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

Export successful

I was able to export the PG.


Now you need the JSON of the object in question.  The 3rd line of 
output has the snapid 9197 which is 23ed in decimal.


$ ceph-objectstore-tool --data-path xx --journal-path xx --op 
list rb.0.8c2990.238e1f29.8cc0



["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9196,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid",9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9198,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 

["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":-2,"hash":###,"max":0,"pool":3,"namespace":"","max":0}] 



To remove it, cut and paste your output line with snapid 9197 inside 
single quotes like this:


$ ceph-objectstore-tool --data-path xx --journal-path xx 
'["3.f9",{"oid":"rb.0.8c2990.238e1f29.8cc0","key":"","snapid":9197,"hash":###,"max":0,"pool":3,"namespace":"","max":0}]' 
remove



remove 3/c55800f9/rb.0.8c2990.238e1f29.8cc0/23ed

I removed the object. After starting the OSD I'm now getting an error 
that a shard is missing and the OSD crashes.


   -4> 2015-09-03 20:11:52.471741 7fdc0d42c700  2 osd.30 pg_epoch: 
231748 pg[3.f9( v 231748'10799542 (222304'10796422,231748'10799542] 
local-les=231748 n=11658 ec=101 les/c 231748/231748 
231747/231747/231747) [30,10] r=0 lpr=231747 lua=231699'10799538 
crt=231699'10799538 lcod 231748'10799541 mlcod 0'0 
active+clean+scrubbing+deep] scrub_compare_maps   osd.30 has 25 items
-3> 2015-09-03 20:11:52.471772 7fdc0d42c700  2 osd.30 pg_epoch: 
231748 pg[3.f9( v 231748'10799542 (222304'10796422,231748'10799542] 
local-les=231748 n=11658 ec=101 les/c 231748/231748 
231747/231747/231747) [30,10] r=0 lpr=231747 lua=231699'10799538 
crt=231699'10799538 lcod 231748'10799541 mlcod 0'0 
active+clean+scrubbing+deep] scrub_compare_maps replica 10 has 26 items
-2> 2015-09-03 20:11:52.472015 7fdc0d42c700  2 osd.30 pg_epoch: 
231748 pg[3.f9( v 231748'10799542 (222304'10796422,231748'10799542] 
local-les=231748 n=11658 ec=101 les/c 231748/231748 
231747/231747/231747) [30,10] r=0 lpr=231747 lua=231699'10799538 
crt=231699'10799538 lcod 231748'10799541 mlcod 0'0 
active+clean+scrubbing+deep] be_compare_scrubmaps: 3.f9 shard 30 
missing 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
there just because I couldn't find 14 pros at the time we were ordering
hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
as the boot drive  and the journal for 3 OSDs. And similarly, mine just
started disappearing a few weeks ago. I've now had four fail (three 850
Pro, one 840 Pro). I expect the rest to fail any day.

As it turns out I had a phone conversation with the support rep who has
been helping me with RMA's today and he's putting together a report with my
pertinent information in it to forward on to someone.

FWIW, I tried to get your 845's for this deploy, but couldn't find them
anywhere, and since the 850's looked about as durable on paper I figured
they would do ok. Seems not to be the case.

QH

On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
wrote:

> Hi James,
>
> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
> 3-4 months of being in production (VMs/KVM/CloudStack)
>
> Mine were also Samsung 850 PRO 128GB.
>
> Best,
> Andrija
>
> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
>> Hi Quentin and Andrija,
>>
>> Thanks so much for reporting the problems with Samsung.
>>
>>
>>
>> Would be possible to get to know your configuration of your system?  What
>> kind of workload are you running?  Do you use Samsung SSD as separate
>> journaling disk, right?
>>
>>
>>
>> Thanks so much.
>>
>>
>>
>> James
>>
>>
>>
>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
>> Of *Quentin Hartman
>> *Sent:* Thursday, September 03, 2015 1:06 PM
>> *To:* Andrija Panic
>> *Cc:* ceph-users
>> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T
>> vs. Intel s3700
>>
>>
>>
>> Yeah, we've ordered some S3700's to replace them already. Should be here
>> early next week. Hopefully they arrive before we have multiple nodes die at
>> once and can no longer rebalance successfully.
>>
>>
>>
>> Most of the drives I have are the 850 Pro 128GB (specifically
>> MZ7KE128HMGA)
>>
>> There are a couple 120GB 850 EVOs in there too, but ironically, none of
>> them have pooped out yet.
>>
>>
>>
>> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
>> wrote:
>>
>> I really advise removing the bastards becore they die...no rebalancing
>> hapening just temp osd down while replacing journals...
>>
>> What size and model are yours Samsungs?
>>
>> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
>> wrote:
>>
>> We also just started having our 850 Pros die one after the other after
>> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
>> is fine, and then it's not even visible to the machine. According to the
>> stats in hdparm and the calcs I did they should have had years of life
>> left, so it seems that ceph journals definitely do something they do not
>> like, which is not reflected in their stats.
>>
>>
>>
>> QH
>>
>>
>>
>> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>>
>> Hi ,
>>
>> We got a good deal on 843T and we are using it in our Openstack setup
>> ..as journals .
>> They have been running for last six months ... No issues .
>>
>> When we compared with  Intel SSDs I think it was 3700 they  were shade
>> slower for our workload and considerably cheaper.
>>
>> We did not run any synthetic benchmark since we had a specific use case.
>>
>> The performance was better than our old setup so it was good enough.
>>
>> hth
>>
>>
>>
>> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
>> wrote:
>>
>> We have some 850 pro 256gb ssds if anyone interested to buy:)
>>
>> And also there was new 850 pro firmware that broke peoples disk which was
>> revoked later etc... I'm sticking with only vacuum cleaners from Samsung
>> for now, maybe... :)
>>
>> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" <
>> igor.voloshane...@gmail.com> wrote:
>>
>> To be honest, Samsung 850 PRO not 24/7 series... it's something about
>> desktop+ series, but anyway - results from this drives - very very bad in
>> any scenario acceptable by real life...
>>
>>
>>
>> Possible 845 PRO more better, but we don't want to experiment anymore...
>> So we choose S3500 240G. Yes, it's cheaper than S3700 (about 2x times), and
>> no so durable for writes, but we think more better to replace 1 ssd per 1
>> year than to pay double price now.
>>
>>
>>
>> 2015-08-25 12:59 GMT+03:00 Andrija Panic :
>>
>> And should I mention that in another CEPH installation we had samsung 850
>> pro 128GB and all of 6 ssds died in 2 month period - simply disappear from
>> the system, so not wear out...
>>
>> Never again we buy Samsung :)
>>
>> On Aug 25, 2015 11:57 AM, "Andrija Panic" 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
Oh, I forgot to mention, these drives have been in service for about 9
months.

If it's useful / interesting at all, here is the smartctl -a output from
one of the 840's I installed about the same time as the ones that failed
recently, but it has not yet failed:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.0-33-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 840 PRO Series
Serial Number:S1ANNSAF800928M
LU WWN Device Id: 5 002538 5a028ebe1
Firmware Version: DXM06B0Q
User Capacity:128,035,676,160 bytes [128 GB]
Sector Size:  512 bytes logical/physical
Rotation Rate:Solid State Device
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:Fri Sep  4 19:18:22 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine
completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (65476) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: (   2) minutes.
Extended self-test routine
recommended polling time: (  15) minutes.
SCT capabilities:   (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED
 WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010Pre-fail  Always
-   0
  9 Power_On_Hours  0x0032   098   098   000Old_age   Always
-   6768
 12 Power_Cycle_Count   0x0032   099   099   000Old_age   Always
-   6
177 Wear_Leveling_Count 0x0013   037   037   000Pre-fail  Always
-   2275
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010Pre-fail  Always
-   0
181 Program_Fail_Cnt_Total  0x0032   100   100   010Old_age   Always
-   0
182 Erase_Fail_Count_Total  0x0032   100   100   010Old_age   Always
-   0
183 Runtime_Bad_Block   0x0013   100   100   010Pre-fail  Always
-   0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000Old_age   Always
-   0
190 Airflow_Temperature_Cel 0x0032   072   064   000Old_age   Always
-   28
195 ECC_Error_Rate  0x001a   200   200   000Old_age   Always
-   0
199 CRC_Error_Count 0x003e   100   100   000Old_age   Always
-   0
235 POR_Recovery_Count  0x0012   099   099   000Old_age   Always
-   2
241 Total_LBAs_Written  0x0032   099   099   000Old_age   Always
-   68358879670

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
100  Not_testing
200  Not_testing
300  Not_testing
400  Not_testing
500  Not_testing
  255065535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


If I'm reading the total LBA right, and this whitepaper is correct (
http://www.samsung.com/global/business/semiconductor/minisite/SSD/global/html/whitepaper/whitepaper07.html)
that correlates to about 32.5TB written to the drive. The last time I
checked all the drives in this cluster they were about evenly worn.
Assuming that's right, and the wear has been constant, we should have at
least another nine months from these drives based on the information at (
http://www.samsung.com/us/pdf/memory-storage/840PRO_25_SATA_III_Spec.pdf)
which 

[ceph-users] Cannot add/create new monitor on ceph v0.94.3

2015-09-04 Thread Chang, Fangzhe (Fangzhe)
Hi,
I’m trying to add a second monitor using ‘ceph-deploy mon new ’.  However, the log file shows the following error:
2015-09-04 16:13:54.863479 7f4cbc3f7700  0 cephx: verify_reply couldn't decrypt 
with error: error decoding block for decryption
2015-09-04 16:13:54.863491 7f4cbc3f7700  0 -- :6789/0 >> 
:6789/0 pipe(0x413 sd=12 :57954 s=1 pgs=0 cs=0 l=0 
c=0x3f29600).failed verifying authorize reply

Does anyone know how to resolve this?
Thanks

Fangzhe

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Richard Bade
Hi Everyone,

We have a Ceph pool that is entirely made up of Intel S3700/S3710
enterprise SSD's.

We are seeing some significant I/O delays on the disks causing a “SCSI Task
Abort” from the OS. This seems to be triggered by the drive receiving a
“Synchronize cache command”.

My current thinking is that setting nobarriers in XFS will stop the drive
receiving a sync command and therefore stop the I/O delay associated with
it.

In the XFS FAQ it looks like the recommendation is that if you have a
Battery Backed raid controller you should set nobarriers for performance
reasons.

Our LSI card doesn’t have battery backed cache as it’s configured in HBA
mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor
backed cache though.

So is it recommended that barriers are turned off as the drive has a safe
cache (I am confident that the cache will write out to disk on power
failure)?

Has anyone else encountered this issue?

Any info or suggestions about this would be appreciated.

Regards,

Richard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread James (Fei) Liu-SSI
Hi Anrija,
Your feedback is greatly appreciated.

Regards,
James

From: Andrija Panic [mailto:andrija.pa...@gmail.com]
Sent: Friday, September 04, 2015 12:39 PM
To: James (Fei) Liu-SSI
Cc: Quentin Hartman; ceph-users
Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
s3700

James,

there are simple FIO tests or even DD test on Linux, which you can run to see 
how good SSD will perform as CEPH Journal device (CEPH does writes with 
O_DIRECT and D_SYNC flags to SSDs) - Samsung 850 perform here extremely bad, as 
many, many other vendors (D_SYNC kills performance for them...)

If you are not using D_SYNC flag, then Samsung can achieve some nice numbers...
dd if=/dev/zero of=/dev/sda bs=4k count=10 oflag=direct,dsync (where 
/dev/sda is raw drive, or replace that with mount point i.e. /root/ddfile)

Check post for more info please: 
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
Thanks

On 4 September 2015 at 21:31, James (Fei) Liu-SSI 
> wrote:
Andrija,
In your email thread, (18.000 (4Kb) IOPS constant write speed stands for 18K 
iops with 4k block size, right? However, you can only achieve 200IOPS with 
Samsung 850Pro, right?

Theoretically, Samsung 850 Pro can get up to 100,000 IOPS with 4k Random Read 
with certain workload.  It is a little bit strange over here.

Regards,
James


From: Andrija Panic 
[mailto:andrija.pa...@gmail.com]
Sent: Friday, September 04, 2015 12:21 PM
To: Quentin Hartman
Cc: James (Fei) Liu-SSI; ceph-users

Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
s3700

Quentin,

try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than 1MB/s 
- that is common for most "home" drives - check the post down to understand
We removed all Samsung 850 pro 256GB from our new CEPH installation and 
replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with 
O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can imagine 
the difference...):
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Best

On 4 September 2015 at 21:09, Quentin Hartman 
> wrote:
Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in there 
just because I couldn't find 14 pros at the time we were ordering hardware. I 
have 14 nodes, each with a single 128 or 120GB SSD that serves as the boot 
drive  and the journal for 3 OSDs. And similarly, mine just started 
disappearing a few weeks ago. I've now had four fail (three 850 Pro, one 840 
Pro). I expect the rest to fail any day.

As it turns out I had a phone conversation with the support rep who has been 
helping me with RMA's today and he's putting together a report with my 
pertinent information in it to forward on to someone.

FWIW, I tried to get your 845's for this deploy, but couldn't find them 
anywhere, and since the 850's looked about as durable on paper I figured they 
would do ok. Seems not to be the case.

QH

On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
> wrote:
Hi James,

I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals 
partitions on each SSD) - SSDs just vanished with no warning, no smartctl 
errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a 3-4 
months of being in production (VMs/KVM/CloudStack)

Mine were also Samsung 850 PRO 128GB.

Best,
Andrija

On 4 September 2015 at 19:27, James (Fei) Liu-SSI 
> wrote:
Hi Quentin and Andrija,
Thanks so much for reporting the problems with Samsung.

Would be possible to get to know your configuration of your system?  What kind 
of workload are you running?  Do you use Samsung SSD as separate journaling 
disk, right?

Thanks so much.

James

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com]
 On Behalf Of Quentin Hartman
Sent: Thursday, September 03, 2015 1:06 PM
To: Andrija Panic
Cc: ceph-users
Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
s3700

Yeah, we've ordered some S3700's to replace them already. Should be here early 
next week. Hopefully they arrive before we have multiple nodes die at once and 
can no longer rebalance successfully.

Most of the drives I have are the 850 Pro 128GB (specifically MZ7KE128HMGA)
There are a couple 120GB 850 EVOs in there too, but ironically, none of them 
have pooped out yet.

On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
> wrote:

I really advise removing the bastards becore they die...no rebalancing hapening 
just temp osd down while replacing journals...

What size and model are yours 

Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Quentin Hartman
I just went through and ran this on all my currently running SSDs:

echo "$(smartctl -a /dev/sda | grep Total_LBAs_Written | awk '{ print $NF
}') * 512 /1025/1024/1024/1024" | bc

which is showing about 32TB written on the oldest nodes, about 20 on the
newer ones, and 1 on the first one I've RMA'd and replaced last week. So
the numbers are in-line with the test I did a few months ago in that they
are even, but looking back when I checked on them last my numbers were off
by 1024.

Note that this invocation of bc only outputs integers so the results will
be roudned.

On Fri, Sep 4, 2015 at 1:40 PM, James (Fei) Liu-SSI <
james@ssi.samsung.com> wrote:

> Hi Anrija,
>
> Your feedback is greatly appreciated.
>
>
>
> Regards,
>
> James
>
>
>
> *From:* Andrija Panic [mailto:andrija.pa...@gmail.com]
> *Sent:* Friday, September 04, 2015 12:39 PM
> *To:* James (Fei) Liu-SSI
> *Cc:* Quentin Hartman; ceph-users
>
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> James,
>
>
>
> there are simple FIO tests or even DD test on Linux, which you can run to
> see how good SSD will perform as CEPH Journal device (CEPH does writes with
> O_DIRECT and D_SYNC flags to SSDs) - Samsung 850 perform here extremely
> bad, as many, many other vendors (D_SYNC kills performance for them...)
>
>
>
> If you are not using D_SYNC flag, then Samsung can achieve some nice
> numbers...
>
> dd if=/dev/zero of=/dev/sda bs=4k count=10 oflag=direct,dsync (where
> /dev/sda is raw drive, or replace that with mount point i.e. /root/ddfile)
>
>
>
> Check post for more info please:
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Thanks
>
>
>
> On 4 September 2015 at 21:31, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
> Andrija,
>
> In your email thread, (18.000 (4Kb) IOPS constant write speed stands for
> 18K iops with 4k block size, right? However, you can only achieve 200IOPS
> with Samsung 850Pro, right?
>
>
>
> Theoretically, Samsung 850 Pro can get up to 100,000 IOPS with 4k Random
> Read with certain workload.  It is a little bit strange over here.
>
>
>
> Regards,
>
> James
>
>
>
>
>
> *From:* Andrija Panic [mailto:andrija.pa...@gmail.com]
> *Sent:* Friday, September 04, 2015 12:21 PM
> *To:* Quentin Hartman
> *Cc:* James (Fei) Liu-SSI; ceph-users
>
>
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Quentin,
>
>
>
> try fio or dd with O_DIRECT and D_SYNC flags, and you will see less than
> 1MB/s - that is common for most "home" drives - check the post down to
> understand
>
> We removed all Samsung 850 pro 256GB from our new CEPH installation and
> replaced with Intel S3500 (18.000 (4Kb) IOPS constant write speed with
> O_DIRECT, D_SYNC, in comparison to 200 IOPS for Samsun 850pro - you can
> imagine the difference...):
>
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
>
>
> Best
>
>
>
> On 4 September 2015 at 21:09, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
> Mine are also mostly 850 Pros. I have a few 840s, and a few 850 EVOs in
> there just because I couldn't find 14 pros at the time we were ordering
> hardware. I have 14 nodes, each with a single 128 or 120GB SSD that serves
> as the boot drive  and the journal for 3 OSDs. And similarly, mine just
> started disappearing a few weeks ago. I've now had four fail (three 850
> Pro, one 840 Pro). I expect the rest to fail any day.
>
>
>
> As it turns out I had a phone conversation with the support rep who has
> been helping me with RMA's today and he's putting together a report with my
> pertinent information in it to forward on to someone.
>
>
>
> FWIW, I tried to get your 845's for this deploy, but couldn't find them
> anywhere, and since the 850's looked about as durable on paper I figured
> they would do ok. Seems not to be the case.
>
>
>
> QH
>
>
>
> On Fri, Sep 4, 2015 at 12:53 PM, Andrija Panic 
> wrote:
>
> Hi James,
>
>
>
> I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
> partitions on each SSD) - SSDs just vanished with no warning, no smartctl
> errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
> 3-4 months of being in production (VMs/KVM/CloudStack)
>
> Mine were also Samsung 850 PRO 128GB.
>
>
>
> Best,
>
> Andrija
>
>
>
> On 4 September 2015 at 19:27, James (Fei) Liu-SSI <
> james@ssi.samsung.com> wrote:
>
> Hi Quentin and Andrija,
>
> Thanks so much for reporting the problems with Samsung.
>
>
>
> Would be possible to get to know your configuration of your system?  What
> kind of workload are you running?  Do you use Samsung SSD as separate
> journaling disk, right?
>
>
>
> Thanks so much.
>
>
>
> James
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Quentin Hartman
> *Sent:* Thursday, September 03, 2015 

Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Jan Schermer
>> We are seeing some significant I/O delays on the disks causing a “SCSI Task 
>> Abort” from the OS. This seems to be triggered by the drive receiving a 
>> “Synchronize cache command”.
>> 
>> 


How exactly do you know this is the cause? This is usually just an effect of 
something going wrong and part of error recovery process.
Preceding this event should be the real error/root cause...

It is _supposedly_ safe to disable barriers in this scenario, but IMO the 
assumptions behind that are deeply flawed, and from what I've seen it is not 
necessary with fast drives (such as S3700).

Take a look in the mailing list archives, I elaborated on this quite a bit in 
the past, including my experience with Kingston drives + XFS + LSI (and the 
effect is present even on Intels, but because they are much faster it shouldn't 
cause any real problems).

Jan


> On 04 Sep 2015, at 21:55, Richard Bade  wrote:
> 
> Hi Everyone,
> 
> We have a Ceph pool that is entirely made up of Intel S3700/S3710 enterprise 
> SSD's.
> 
> We are seeing some significant I/O delays on the disks causing a “SCSI Task 
> Abort” from the OS. This seems to be triggered by the drive receiving a 
> “Synchronize cache command”.
> 
> My current thinking is that setting nobarriers in XFS will stop the drive 
> receiving a sync command and therefore stop the I/O delay associated with it.
> 
> In the XFS FAQ it looks like the recommendation is that if you have a Battery 
> Backed raid controller you should set nobarriers for performance reasons.
> 
> Our LSI card doesn’t have battery backed cache as it’s configured in HBA mode 
> (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor backed 
> cache though.
> 
> So is it recommended that barriers are turned off as the drive has a safe 
> cache (I am confident that the cache will write out to disk on power failure)?
> 
> Has anyone else encountered this issue?
> 
> Any info or suggestions about this would be appreciated. 
> 
> Regards,
> 
> Richard
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nova fails to download image from Glance backed with Ceph

2015-09-04 Thread Vasiliy Angapov
Thanks for response!

The free space on /var/lib/nova/instances is very large on every compute host.
Glance image-download works as expected.

2015-09-04 21:27 GMT+08:00 Jan Schermer :
> Didn't you run out of space? Happened to me when a customer tried to create a 
> 1TB image...
>
> Z.
>
>> On 04 Sep 2015, at 15:15, Sebastien Han  wrote:
>>
>> Just to take away a possible issue from infra (LBs etc).
>> Did you try to download the image on the compute node? Something like rbd 
>> export?
>>
>>> On 04 Sep 2015, at 11:56, Vasiliy Angapov  wrote:
>>>
>>> Hi all,
>>>
>>> Not sure actually where does this bug belong to - OpenStack or Ceph -
>>> but writing here in humble hope that anyone faced that issue also.
>>>
>>> I configured test OpenStack instance with Glance images stored in Ceph
>>> 0.94.3. Nova has local storage.
>>> But when I'm trying to launch instance from large image stored in Ceph
>>> - it fails to spawn with such an error in nova-conductor.log:
>>>
>>> 2015-09-04 11:52:35.076 3605449 ERROR nova.scheduler.utils
>>> [req-c6af3eca-f166-45bd-8edc-b8cfadeb0d0b
>>> 82c1f134605e4ee49f65015dda96c79a 448cc6119e514398ac2793d043d4fa02 - -
>>> -] [instance: 18c9f1d5-50e8-426f-94d5-167f43129ea6] Error from last
>>> host: slpeah005 (node slpeah005.cloud): [u'Traceback (most recent call
>>> last):\n', u'  File
>>> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2220,
>>> in _do_build_and_run_instance\nfilter_properties)\n', u'  File
>>> "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2363,
>>> in _build_and_run_instance\ninstance_uuid=instance.uuid,
>>> reason=six.text_type(e))\n', u'RescheduledException: Build of instance
>>> 18c9f1d5-50e8-426f-94d5-167f43129ea6 was re-scheduled: [Errno 32]
>>> Corrupt image download. Checksum was 625d0686a50f6b64e57b1facbc042248
>>> expected 4a7de2fbbd01be5c6a9e114df145b027\n']
>>>
>>> So nova tries 3 different hosts with the same error messages on every
>>> single one and then fails to spawn an instance.
>>> I've tried Cirros little image and it works fine with it. Issue
>>> happens with large images like 10Gb in size.
>>> I also managed to look into /var/lib/nova/instances/_base folder and
>>> found out that image is actually being downloaded but at some moment
>>> the download process interrupts for some unknown reason and instance
>>> gets deleted.
>>>
>>> I looked at the syslog and found many messages like that:
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735094
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.22 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735099
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.23 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735104
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.24 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735108
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.26 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>> Sep  4 12:51:37 slpeah003 ceph-osd: 2015-09-04 12:51:37.735118
>>> 7f092dfd1700 -1 osd.3 3025 heartbeat_check: no reply from osd.27 since
>>> back 2015-09-04 12:51:31.834203 front 2015-09-04 12:51:31.834203
>>> (cutoff 2015-09-04 12:51:32.735011)
>>>
>>> I've also tried to monitor nova-compute process file descriptors
>>> number but it is never more than 102. ("echo
>>> /proc/NOVA_COMPUTE_PID/fd/* | wc -w" like Jan advised in this ML).
>>> It also seems like problem appeared only in 0.94.3, in 0.94.2
>>> everything worked just fine!
>>>
>>> Would be very grateful for any help!
>>>
>>> Vasily.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> Cheers.
>> 
>> Sébastien Han
>> Senior Cloud Architect
>>
>> "Always give 100%. Unless you're giving blood."
>>
>> Mail: s...@redhat.com
>> Address: 11 bis, rue Roquépine - 75008 Paris
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread Christian Balzer

Hello,

On Fri, 4 Sep 2015 12:30:12 -0300 German Anders wrote:

> Hi cephers,
> 
>I've the following scheme:
> 
> 7x OSD servers with:
>
Is this a new cluster, total initial deployment?

What else are these nodes made of, CPU/RAM/network?
While uniform nodes have some appeal (interchangeability, one node down
does impact the cluster uniformly) they tend to be compromise solutions.
I personally would go with optimized HDD and SSD nodes.

> 4x 800GB SSD Intel DC S3510 (OSD-SSD)
Only 0.3DWPD, 450TB total in 5 years.  
If you can correctly predict your write volume and it is below that per
SSD, fine. I'd use 3610s, with internal journals.

> 3x 120GB SSD Intel DC S3500 (Journals)
In this case even more so the S3500 is a bad choice. 3x 135MB/s is
nowhere near your likely network speed of 10Gb/s.
 
You will vastly superior performance and endurance with two 200GB S3610
(2x 230MB/s) or S3700 (2x365 MB/s)

Why the uneven number of journals SSDs?
You want uniform utilization, wear. 2 journal SSDs for 6 HDDs would be a
good ratio.

> 5x 3TB SAS disks (OSD-SAS)
>
See above, even numbers make a lot more sense.

> 
> The OSD servers are located on two separate Racks with two power circuits
> each.
> 
>I would like to know what is the best way to implement this.. use the
> 4x 800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
> suggestion? Also any advice for the crush design?
> 
Nick touched on that already, for right now SSD pools would be definitely
better.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd prepare btrfs

2015-09-04 Thread German Anders
Trying to do a prepare on a osd with btrfs, and getting this error:

[cibosd04][INFO  ] Running command: sudo ceph-disk -v prepare --cluster
ceph --fs-type btrfs -- /dev/sdc
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=osd_journal_size
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_type
[cibosd04][WARNIN] INFO:ceph-disk:Will colocate journal with data on
/dev/sdc
[cibosd04][WARNIN] DEBUG:ceph-disk:Creating journal partition num 2 size
5120 on /dev/sdc
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk
--new=2:0:5120M --change-name=2:ceph journal
--partition-guid=2:2d7cd194-6185-4515-ae32-40b88524d03a
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdc
[cibosd04]*[WARNIN] Invalid partition data!*
[cibosd04][WARNIN] ceph-disk: Error: Command '['/sbin/sgdisk',
'--new=2:0:5120M', '--change-name=2:ceph journal',
'--partition-guid=2:2d7cd194-6185-4515-ae32-40b88524d03a',
'--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106', '--mbrtogpt', '--',
'/dev/sdc']' returned non-zero exit status 2
[cibosd04][ERROR ] *RuntimeError: command returned non-zero exit status: 1*
[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare
--cluster ceph --fs-type btrfs -- /dev/sdc
[ceph_deploy][ERROR ] *GenericError: Failed to create 1 OSDs*


I try to format the device but with no luck, any ideas?

Thanks in advance,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Andrija Panic
Hi James,

I had 3 CEPH nodes as folowing: 12 OSDs(HDD) and 2 SSDs (2x 6 Journals
partitions on each SSD) - SSDs just vanished with no warning, no smartctl
errors nothing... so 2 SSDs in 3 servers vanished in...2-3 weeks, after a
3-4 months of being in production (VMs/KVM/CloudStack)

Mine were also Samsung 850 PRO 128GB.

Best,
Andrija

On 4 September 2015 at 19:27, James (Fei) Liu-SSI  wrote:

> Hi Quentin and Andrija,
>
> Thanks so much for reporting the problems with Samsung.
>
>
>
> Would be possible to get to know your configuration of your system?  What
> kind of workload are you running?  Do you use Samsung SSD as separate
> journaling disk, right?
>
>
>
> Thanks so much.
>
>
>
> James
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Quentin Hartman
> *Sent:* Thursday, September 03, 2015 1:06 PM
> *To:* Andrija Panic
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
>
>
> Yeah, we've ordered some S3700's to replace them already. Should be here
> early next week. Hopefully they arrive before we have multiple nodes die at
> once and can no longer rebalance successfully.
>
>
>
> Most of the drives I have are the 850 Pro 128GB (specifically
> MZ7KE128HMGA)
>
> There are a couple 120GB 850 EVOs in there too, but ironically, none of
> them have pooped out yet.
>
>
>
> On Thu, Sep 3, 2015 at 1:58 PM, Andrija Panic 
> wrote:
>
> I really advise removing the bastards becore they die...no rebalancing
> hapening just temp osd down while replacing journals...
>
> What size and model are yours Samsungs?
>
> On Sep 3, 2015 7:10 PM, "Quentin Hartman" 
> wrote:
>
> We also just started having our 850 Pros die one after the other after
> about 9 months of service. 3 down, 11 to go... No warning at all, the drive
> is fine, and then it's not even visible to the machine. According to the
> stats in hdparm and the calcs I did they should have had years of life
> left, so it seems that ceph journals definitely do something they do not
> like, which is not reflected in their stats.
>
>
>
> QH
>
>
>
> On Wed, Aug 26, 2015 at 7:15 AM, 10 minus  wrote:
>
> Hi ,
>
> We got a good deal on 843T and we are using it in our Openstack setup ..as
> journals .
> They have been running for last six months ... No issues .
>
> When we compared with  Intel SSDs I think it was 3700 they  were shade
> slower for our workload and considerably cheaper.
>
> We did not run any synthetic benchmark since we had a specific use case.
>
> The performance was better than our old setup so it was good enough.
>
> hth
>
>
>
> On Tue, Aug 25, 2015 at 12:07 PM, Andrija Panic 
> wrote:
>
> We have some 850 pro 256gb ssds if anyone interested to buy:)
>
> And also there was new 850 pro firmware that broke peoples disk which was
> revoked later etc... I'm sticking with only vacuum cleaners from Samsung
> for now, maybe... :)
>
> On Aug 25, 2015 12:02 PM, "Voloshanenko Igor" 
> wrote:
>
> To be honest, Samsung 850 PRO not 24/7 series... it's something about
> desktop+ series, but anyway - results from this drives - very very bad in
> any scenario acceptable by real life...
>
>
>
> Possible 845 PRO more better, but we don't want to experiment anymore...
> So we choose S3500 240G. Yes, it's cheaper than S3700 (about 2x times), and
> no so durable for writes, but we think more better to replace 1 ssd per 1
> year than to pay double price now.
>
>
>
> 2015-08-25 12:59 GMT+03:00 Andrija Panic :
>
> And should I mention that in another CEPH installation we had samsung 850
> pro 128GB and all of 6 ssds died in 2 month period - simply disappear from
> the system, so not wear out...
>
> Never again we buy Samsung :)
>
> On Aug 25, 2015 11:57 AM, "Andrija Panic"  wrote:
>
> First read please:
>
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> We are getting 200 IOPS in comparison to Intels3500 18.000 iops - those
> are  constant performance numbers, meaning avoiding drives cache and
> running for longer period of time...
> Also if checking with FIO you will get better latencies on intel s3500
> (model tested in our case) along with 20X better IOPS results...
>
> We observed original issue by having high speed at begining of i.e. file
> transfer inside VM, which than halts to zero... We moved journals back to
> HDDs and performans was acceptable...no we are upgrading to intel S3500...
>
> Best
>
> any details on that ?
>
> On Tue, 25 Aug 2015 11:42:47 +0200, Andrija Panic
>  wrote:
>
> > Make sure you test what ever you decide. We just learned this the hard
> way
> > with samsung 850 pro, which is total crap, more than you could imagine...
> >
> > Andrija
> > On Aug 25, 2015 

Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread Nick Fisk
Hi German,

 

Are the power feeds completely separate (ie 4 feeds in total), or just each 
rack has both feeds? If it’s the latter I don’t see any benefit from including 
this into the crushmap and would just create a “rack” bucket. Also assuming 
your servers have dual PSU’s, this also changes the power failure scenarios 
quite a bit as well.

 

In regards to the pools, unless you know your workload will easily fit into a 
cache pool with room to spare, I would suggest not going down that route 
currently. Performance in many cases can actually end up being worse if you end 
up doing a lot of promotions.

 

*However* I’ve been doing a bit of testing with the current master and there 
are a lot of improvements around cache tiering that are starting to have a 
massive improvement on performance. If you can get by with just the SAS disks 
for now and make a more informed decision about the cache tiering when 
Infernalis is released then that might be your best bet.

 

Otherwise you might just be best using them as a basic SSD only Pool.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German 
Anders
Sent: 04 September 2015 16:30
To: ceph-users 
Subject: [ceph-users] Best layout for SSD & SAS OSDs

 

Hi cephers,

   I've the following scheme:

7x OSD servers with:

4x 800GB SSD Intel DC S3510 (OSD-SSD)

3x 120GB SSD Intel DC S3500 (Journals)

5x 3TB SAS disks (OSD-SAS)

The OSD servers are located on two separate Racks with two power circuits each.

   I would like to know what is the best way to implement this.. use the 4x 
800GB SSD like a SSD-pool, or used them us a Cache pool? or any other 
suggestion? Also any advice for the crush design?

Thanks in advance,




German




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread German Anders
Thanks a lot Nick, regarding the power feeds, we only had two circuits for
all the racks, so I'll to do in the crush the "rack" bucket and separate
the osd servers on the rack buckets, then regarding the SSD pools, I've
installed the hammer version and wondering to upgrade to Infernalis v9.0.3
and apply the SSD cache, or stay on Hammer and do the SSD pools and maybe
left two 800GB SSD for later used as Cache (1.6TB per OSD server), do you
have a crushmap example for this type of config?

Thanks a lot,

Best regards,


*German*

2015-09-04 13:10 GMT-03:00 Nick Fisk :

> Hi German,
>
>
>
> Are the power feeds completely separate (ie 4 feeds in total), or just
> each rack has both feeds? If it’s the latter I don’t see any benefit from
> including this into the crushmap and would just create a “rack” bucket.
> Also assuming your servers have dual PSU’s, this also changes the power
> failure scenarios quite a bit as well.
>
>
>
> In regards to the pools, unless you know your workload will easily fit
> into a cache pool with room to spare, I would suggest not going down that
> route currently. Performance in many cases can actually end up being worse
> if you end up doing a lot of promotions.
>
>
>
> **However** I’ve been doing a bit of testing with the current master and
> there are a lot of improvements around cache tiering that are starting to
> have a massive improvement on performance. If you can get by with just the
> SAS disks for now and make a more informed decision about the cache tiering
> when Infernalis is released then that might be your best bet.
>
>
>
> Otherwise you might just be best using them as a basic SSD only Pool.
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* 04 September 2015 16:30
> *To:* ceph-users 
> *Subject:* [ceph-users] Best layout for SSD & SAS OSDs
>
>
>
> Hi cephers,
>
>I've the following scheme:
>
> 7x OSD servers with:
>
> 4x 800GB SSD Intel DC S3510 (OSD-SSD)
>
> 3x 120GB SSD Intel DC S3500 (Journals)
>
> 5x 3TB SAS disks (OSD-SAS)
>
> The OSD servers are located on two separate Racks with two power circuits
> each.
>
>I would like to know what is the best way to implement this.. use the
> 4x 800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
> suggestion? Also any advice for the crush design?
>
> Thanks in advance,
>
>
> *German*
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Shinobu Kinjo
> IIRC, it only triggers the move (merge or split) when that folder is hit by a 
> request, so most likely it happens gradually.

Do you know what causes this?
I would like to be more clear "gradually".

Shinobu

- Original Message -
From: "GuangYang" 
To: "Ben Hines" , "Nick Fisk" 
Cc: "ceph-users" 
Sent: Saturday, September 5, 2015 9:27:31 AM
Subject: Re: [ceph-users] Ceph performance, empty vs part full

IIRC, it only triggers the move (merge or split) when that folder is hit by a 
request, so most likely it happens gradually.

Another thing might be helpful (and we have had good experience with), is that 
we do the folder splitting at the pool creation time, so that we avoid the 
performance impact with runtime splitting (which is high if you have a large 
cluster). In order to do that:

1. You will need to configure "filestore merge threshold" with a negative value 
so that it disables merging.
2. When creating the pool, there is a parameter named "expected_num_objects", 
by specifying that number, the folder will splitted to the right level with the 
pool creation.

Hope that helps.

Thanks,
Guang



> From: bhi...@gmail.com
> Date: Fri, 4 Sep 2015 12:05:26 -0700
> To: n...@fisk.me.uk
> CC: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
> a ticket to request a way to tell an OSD to rebalance its directory
> structure.
>
> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>> similar size to yours. I didn't see any merging happening, although most of 
>> the directory's I looked at had more files in than the new merge threshold, 
>> so I guess this is to be expected
>>
>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>> bring things back into order.
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Wang, Warren
>>> Sent: 04 September 2015 01:21
>>> To: Mark Nelson ; Ben Hines 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>
>>> I'm about to change it on a big cluster too. It totals around 30 million, 
>>> so I'm a
>>> bit nervous on changing it. As far as I understood, it would indeed move
>>> them around, if you can get underneath the threshold, but it may be hard to
>>> do. Two more settings that I highly recommend changing on a big prod
>>> cluster. I'm in favor of bumping these two up in the defaults.
>>>
>>> Warren
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Mark Nelson
>>> Sent: Thursday, September 03, 2015 6:04 PM
>>> To: Ben Hines 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>
>>> Hrm, I think it will follow the merge/split rules if it's out of whack 
>>> given the
>>> new settings, but I don't know that I've ever tested it on an existing 
>>> cluster to
>>> see that it actually happens. I guess let it sit for a while and then check 
>>> the
>>> OSD PG directories to see if the object counts make sense given the new
>>> settings? :D
>>>
>>> Mark
>>>
>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
 Hey Mark,

 I've just tweaked these filestore settings for my cluster -- after
 changing this, is there a way to make ceph move existing objects
 around to new filestore locations, or will this only apply to newly
 created objects? (i would assume the latter..)

 thanks,

 -Ben

 On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
>>> wrote:
> Basically for each PG, there's a directory tree where only a certain
> number of objects are allowed in a given directory before it splits
> into new branches/leaves. The problem is that this has a fair amount
> of overhead and also there's extra associated dentry lookups to get at any
>>> given object.
>
> You may want to try something like:
>
> "filestore merge threshold = 40"
> "filestore split multiple = 8"
>
> This will dramatically increase the number of objects per directory
>>> allowed.
>
> Another thing you may want to try is telling the kernel to greatly
> favor retaining dentries and inodes in cache:
>
> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>
> Mark
>
>
> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>>
>> If I create a new pool it is generally fast for a short amount of time.
>> Not as fast as if I had a blank cluster, but close to.
>>
>> Bryn
>>>
>>> On 8 Jul 2015, 

Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread GuangYang

> Date: Fri, 4 Sep 2015 20:31:59 -0400
> From: ski...@redhat.com
> To: yguan...@outlook.com
> CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
>> IIRC, it only triggers the move (merge or split) when that folder is hit by 
>> a request, so most likely it happens gradually.
>
> Do you know what causes this?
A requests (read/write/setxattr, etc) hitting objects in that folder.
> I would like to be more clear "gradually".
>
> Shinobu
>
> - Original Message -
> From: "GuangYang" 
> To: "Ben Hines" , "Nick Fisk" 
> Cc: "ceph-users" 
> Sent: Saturday, September 5, 2015 9:27:31 AM
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> IIRC, it only triggers the move (merge or split) when that folder is hit by a 
> request, so most likely it happens gradually.
>
> Another thing might be helpful (and we have had good experience with), is 
> that we do the folder splitting at the pool creation time, so that we avoid 
> the performance impact with runtime splitting (which is high if you have a 
> large cluster). In order to do that:
>
> 1. You will need to configure "filestore merge threshold" with a negative 
> value so that it disables merging.
> 2. When creating the pool, there is a parameter named "expected_num_objects", 
> by specifying that number, the folder will splitted to the right level with 
> the pool creation.
>
> Hope that helps.
>
> Thanks,
> Guang
>
>
> 
>> From: bhi...@gmail.com
>> Date: Fri, 4 Sep 2015 12:05:26 -0700
>> To: n...@fisk.me.uk
>> CC: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
>> a ticket to request a way to tell an OSD to rebalance its directory
>> structure.
>>
>> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
>>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>>> similar size to yours. I didn't see any merging happening, although most of 
>>> the directory's I looked at had more files in than the new merge threshold, 
>>> so I guess this is to be expected
>>>
>>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>>> bring things back into order.
>>>
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Wang, Warren
 Sent: 04 September 2015 01:21
 To: Mark Nelson ; Ben Hines 
 Cc: ceph-users 
 Subject: Re: [ceph-users] Ceph performance, empty vs part full

 I'm about to change it on a big cluster too. It totals around 30 million, 
 so I'm a
 bit nervous on changing it. As far as I understood, it would indeed move
 them around, if you can get underneath the threshold, but it may be hard to
 do. Two more settings that I highly recommend changing on a big prod
 cluster. I'm in favor of bumping these two up in the defaults.

 Warren

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: Thursday, September 03, 2015 6:04 PM
 To: Ben Hines 
 Cc: ceph-users 
 Subject: Re: [ceph-users] Ceph performance, empty vs part full

 Hrm, I think it will follow the merge/split rules if it's out of whack 
 given the
 new settings, but I don't know that I've ever tested it on an existing 
 cluster to
 see that it actually happens. I guess let it sit for a while and then 
 check the
 OSD PG directories to see if the object counts make sense given the new
 settings? :D

 Mark

 On 09/03/2015 04:31 PM, Ben Hines wrote:
> Hey Mark,
>
> I've just tweaked these filestore settings for my cluster -- after
> changing this, is there a way to make ceph move existing objects
> around to new filestore locations, or will this only apply to newly
> created objects? (i would assume the latter..)
>
> thanks,
>
> -Ben
>
> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
 wrote:
>> Basically for each PG, there's a directory tree where only a certain
>> number of objects are allowed in a given directory before it splits
>> into new branches/leaves. The problem is that this has a fair amount
>> of overhead and also there's extra associated dentry lookups to get at 
>> any
 given object.
>>
>> You may want to try something like:
>>
>> "filestore merge threshold = 40"
>> "filestore split multiple = 8"
>>
>> This will dramatically increase the number of objects per directory
 

Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread Shinobu Kinjo
Very nice.
You're my hero!

 Shinobu

- Original Message -
From: "GuangYang" 
To: "Shinobu Kinjo" 
Cc: "Ben Hines" , "Nick Fisk" , "ceph-users" 

Sent: Saturday, September 5, 2015 9:40:06 AM
Subject: RE: [ceph-users] Ceph performance, empty vs part full


> Date: Fri, 4 Sep 2015 20:31:59 -0400
> From: ski...@redhat.com
> To: yguan...@outlook.com
> CC: bhi...@gmail.com; n...@fisk.me.uk; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
>> IIRC, it only triggers the move (merge or split) when that folder is hit by 
>> a request, so most likely it happens gradually.
>
> Do you know what causes this?
A requests (read/write/setxattr, etc) hitting objects in that folder.
> I would like to be more clear "gradually".
>
> Shinobu
>
> - Original Message -
> From: "GuangYang" 
> To: "Ben Hines" , "Nick Fisk" 
> Cc: "ceph-users" 
> Sent: Saturday, September 5, 2015 9:27:31 AM
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> IIRC, it only triggers the move (merge or split) when that folder is hit by a 
> request, so most likely it happens gradually.
>
> Another thing might be helpful (and we have had good experience with), is 
> that we do the folder splitting at the pool creation time, so that we avoid 
> the performance impact with runtime splitting (which is high if you have a 
> large cluster). In order to do that:
>
> 1. You will need to configure "filestore merge threshold" with a negative 
> value so that it disables merging.
> 2. When creating the pool, there is a parameter named "expected_num_objects", 
> by specifying that number, the folder will splitted to the right level with 
> the pool creation.
>
> Hope that helps.
>
> Thanks,
> Guang
>
>
> 
>> From: bhi...@gmail.com
>> Date: Fri, 4 Sep 2015 12:05:26 -0700
>> To: n...@fisk.me.uk
>> CC: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>
>> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
>> a ticket to request a way to tell an OSD to rebalance its directory
>> structure.
>>
>> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
>>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>>> similar size to yours. I didn't see any merging happening, although most of 
>>> the directory's I looked at had more files in than the new merge threshold, 
>>> so I guess this is to be expected
>>>
>>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>>> bring things back into order.
>>>
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Wang, Warren
 Sent: 04 September 2015 01:21
 To: Mark Nelson ; Ben Hines 
 Cc: ceph-users 
 Subject: Re: [ceph-users] Ceph performance, empty vs part full

 I'm about to change it on a big cluster too. It totals around 30 million, 
 so I'm a
 bit nervous on changing it. As far as I understood, it would indeed move
 them around, if you can get underneath the threshold, but it may be hard to
 do. Two more settings that I highly recommend changing on a big prod
 cluster. I'm in favor of bumping these two up in the defaults.

 Warren

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: Thursday, September 03, 2015 6:04 PM
 To: Ben Hines 
 Cc: ceph-users 
 Subject: Re: [ceph-users] Ceph performance, empty vs part full

 Hrm, I think it will follow the merge/split rules if it's out of whack 
 given the
 new settings, but I don't know that I've ever tested it on an existing 
 cluster to
 see that it actually happens. I guess let it sit for a while and then 
 check the
 OSD PG directories to see if the object counts make sense given the new
 settings? :D

 Mark

 On 09/03/2015 04:31 PM, Ben Hines wrote:
> Hey Mark,
>
> I've just tweaked these filestore settings for my cluster -- after
> changing this, is there a way to make ceph move existing objects
> around to new filestore locations, or will this only apply to newly
> created objects? (i would assume the latter..)
>
> thanks,
>
> -Ben
>
> On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
 wrote:
>> Basically for each PG, there's a directory tree where only a certain
>> number of objects are allowed in a given directory before it splits
>> into new branches/leaves. The 

Re: [ceph-users] НА: which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Christian Balzer

Hello,

On Fri, 4 Sep 2015 22:37:06 + Межов Игорь Александрович wrote:

> Hi!
> 
> 
> Have worked with Intel DC S3700 200Gb. Due to budget restrictions, one
> 
> ssd hosts a system volume and 1:12 OSD journals. 6 nodes, 120Tb raw
> space.
>
Meaning you're limited to 360MB/s writes per node at best.
But yes, I do understand budget constraints. ^o^
 
> Cluster serves as RBD storage for ~100VM.
> 
> 
> Not a  single failure per year - all devices are healthy.
> 
> The remainig resource (by smart) is ~92%.
> 
I use 1:2 or 1:3 journals and haven't made any dent into my 200GB S3700
yet.

> 
> Now we're try to use DC S3710 for journals.

As I wrote a few days ago, unless you go for the 400GB version the the
200GB S3710 is actually slower (for journal purposes) than the 3700, as
sequential write speed is the key factor here.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] НА: which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-04 Thread Межов Игорь Александрович
Hi!


Have worked with Intel DC S3700 200Gb. Due to budget restrictions, one

ssd hosts a system volume and 1:12 OSD journals. 6 nodes, 120Tb raw space.

Cluster serves as RBD storage for ~100VM.


Not a  single failure per year - all devices are healthy.

The remainig resource (by smart) is ~92%.


Now we're try to use DC S3710 for journals.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS and nobarriers on Intel SSD

2015-09-04 Thread Richard Bade
Hi Jan,
Thanks for your response.


> *How exactly do you know this is the cause? This is usually just an effect
> of something going wrong and part of error recovery process.**Preceding
> this event should be the real error/root cause...*

We have been working with LSI/Avago to resolve this. We get a bunch of
these type log events:

2015-09-04T14:58:59.169677+12:00  ceph-osd: - ceph-osd:
2015-09-04 14:58:59.168444 7fbc5ec71700  0 log [WRN] : slow request
30.894936 seconds old, received at 2015-09-04 14:58:28.272976:
osd_op(client.42319583.0:1185218039
rbd_data.1d8a5a92eb141f2.56a0 [read 3579392~8192] 4.f9f016cb
ack+read e66603) v4 currently no flag points reached

Followed by the task abort I mentioned:
 sd 11:0:4:0: attempting task abort! scmd(8804c07d0480)
 sd 11:0:4:0: [sdf] CDB:
 Write(10): 2a 00 24 6f 01 a8 00 00 08 00
 scsi target11:0:4: handle(0x000d), sas_address(0x443322110400), phy(4)
 scsi target11:0:4: enclosure_logical_id(0x50030480), slot(4)
 sd 11:0:4:0: task abort: SUCCESS scmd(8804c07d0480)

LSI had us enable debugging on our card and send them many logs and
debugging data. Their response was:

Please do not send in the Synchronize cache command(35h). That’s the one
> causing the drive from not responding to Read/write commands quick enough.

A Synchronize cache command instructs the ATA device to flush the cache
> contents to medium and so while the disk is in the process of doing it,
> it’s probably causing the read/write commands to take longer time to
> complete.

LSI/Avago believe this to be the root cause of the IO delay based on the
debugging info.

*and from what I've seen it is not necessary with fast drives (such as
> S3700).*

While I agree with you that it should not be necessary as the S3700's
should be very fast, our current experience does not show this to be the
case.

Just a little more about our setup. We're using Ceph Firefly (0.80.10) on
Ubuntu 14.04. We see this same thing on every S3700/10 on four hosts. We do
not see this happening on the spinning disks in the same cluster but
different pool on similar hardware.

If you know of any other reason this may be happening, we would appreciate
it. Otherwise we will need to continue investigating the possibility of
setting nobarriers.

Regards,
Richard

On 5 September 2015 at 09:32, Jan Schermer  wrote:

> We are seeing some significant I/O delays on the disks causing a “SCSI
> Task Abort” from the OS. This seems to be triggered by the drive receiving
> a “Synchronize cache command”.
>
>
> How exactly do you know this is the cause? This is usually just an effect
> of something going wrong and part of error recovery process.
> Preceding this event should be the real error/root cause...
>
> It is _supposedly_ safe to disable barriers in this scenario, but IMO the
> assumptions behind that are deeply flawed, and from what I've seen it is
> not necessary with fast drives (such as S3700).
>
> Take a look in the mailing list archives, I elaborated on this quite a bit
> in the past, including my experience with Kingston drives + XFS + LSI (and
> the effect is present even on Intels, but because they are much faster it
> shouldn't cause any real problems).
>
> Jan
>
>
> On 04 Sep 2015, at 21:55, Richard Bade  wrote:
>
> Hi Everyone,
>
> We have a Ceph pool that is entirely made up of Intel S3700/S3710
> enterprise SSD's.
>
> We are seeing some significant I/O delays on the disks causing a “SCSI
> Task Abort” from the OS. This seems to be triggered by the drive receiving
> a “Synchronize cache command”.
>
> My current thinking is that setting nobarriers in XFS will stop the drive
> receiving a sync command and therefore stop the I/O delay associated with
> it.
>
> In the XFS FAQ it looks like the recommendation is that if you have a
> Battery Backed raid controller you should set nobarriers for performance
> reasons.
>
> Our LSI card doesn’t have battery backed cache as it’s configured in HBA
> mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor
> backed cache though.
>
> So is it recommended that barriers are turned off as the drive has a safe
> cache (I am confident that the cache will write out to disk on power
> failure)?
>
> Has anyone else encountered this issue?
>
> Any info or suggestions about this would be appreciated.
>
> Regards,
>
> Richard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance, empty vs part full

2015-09-04 Thread GuangYang
IIRC, it only triggers the move (merge or split) when that folder is hit by a 
request, so most likely it happens gradually.

Another thing might be helpful (and we have had good experience with), is that 
we do the folder splitting at the pool creation time, so that we avoid the 
performance impact with runtime splitting (which is high if you have a large 
cluster). In order to do that:

1. You will need to configure "filestore merge threshold" with a negative value 
so that it disables merging.
2. When creating the pool, there is a parameter named "expected_num_objects", 
by specifying that number, the folder will splitted to the right level with the 
pool creation.

Hope that helps.

Thanks,
Guang



> From: bhi...@gmail.com
> Date: Fri, 4 Sep 2015 12:05:26 -0700
> To: n...@fisk.me.uk
> CC: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>
> Yeah, i'm not seeing stuff being moved at all. Perhaps we should file
> a ticket to request a way to tell an OSD to rebalance its directory
> structure.
>
> On Fri, Sep 4, 2015 at 5:08 AM, Nick Fisk  wrote:
>> I've just made the same change ( 4 and 40 for now) on my cluster which is a 
>> similar size to yours. I didn't see any merging happening, although most of 
>> the directory's I looked at had more files in than the new merge threshold, 
>> so I guess this is to be expected
>>
>> I'm currently splitting my PG's from 1024 to 2048 to see if that helps to 
>> bring things back into order.
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Wang, Warren
>>> Sent: 04 September 2015 01:21
>>> To: Mark Nelson ; Ben Hines 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>
>>> I'm about to change it on a big cluster too. It totals around 30 million, 
>>> so I'm a
>>> bit nervous on changing it. As far as I understood, it would indeed move
>>> them around, if you can get underneath the threshold, but it may be hard to
>>> do. Two more settings that I highly recommend changing on a big prod
>>> cluster. I'm in favor of bumping these two up in the defaults.
>>>
>>> Warren
>>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Mark Nelson
>>> Sent: Thursday, September 03, 2015 6:04 PM
>>> To: Ben Hines 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Ceph performance, empty vs part full
>>>
>>> Hrm, I think it will follow the merge/split rules if it's out of whack 
>>> given the
>>> new settings, but I don't know that I've ever tested it on an existing 
>>> cluster to
>>> see that it actually happens. I guess let it sit for a while and then check 
>>> the
>>> OSD PG directories to see if the object counts make sense given the new
>>> settings? :D
>>>
>>> Mark
>>>
>>> On 09/03/2015 04:31 PM, Ben Hines wrote:
 Hey Mark,

 I've just tweaked these filestore settings for my cluster -- after
 changing this, is there a way to make ceph move existing objects
 around to new filestore locations, or will this only apply to newly
 created objects? (i would assume the latter..)

 thanks,

 -Ben

 On Wed, Jul 8, 2015 at 6:39 AM, Mark Nelson 
>>> wrote:
> Basically for each PG, there's a directory tree where only a certain
> number of objects are allowed in a given directory before it splits
> into new branches/leaves. The problem is that this has a fair amount
> of overhead and also there's extra associated dentry lookups to get at any
>>> given object.
>
> You may want to try something like:
>
> "filestore merge threshold = 40"
> "filestore split multiple = 8"
>
> This will dramatically increase the number of objects per directory
>>> allowed.
>
> Another thing you may want to try is telling the kernel to greatly
> favor retaining dentries and inodes in cache:
>
> echo 1 | sudo tee /proc/sys/vm/vfs_cache_pressure
>
> Mark
>
>
> On 07/08/2015 08:13 AM, MATHIAS, Bryn (Bryn) wrote:
>>
>> If I create a new pool it is generally fast for a short amount of time.
>> Not as fast as if I had a blank cluster, but close to.
>>
>> Bryn
>>>
>>> On 8 Jul 2015, at 13:55, Gregory Farnum  wrote:
>>>
>>> I think you're probably running into the internal PG/collection
>>> splitting here; try searching for those terms and seeing what your
>>> OSD folder structures look like. You could test by creating a new
>>> pool and seeing if it's faster or slower than the one you've already 
>>> filled
>>> up.
>>> -Greg
>>>
>>> On Wed, Jul 8, 2015 at 1:25 PM, MATHIAS, Bryn (Bryn)
>>> 

Re: [ceph-users] libvirt rbd issue

2015-09-04 Thread Rafael Lopez
We don't have thousands but these RBDs are in a pool backed by ~600ish.

I can see the fd count is up well past 10k, closer to 15k when I use a
decent number of RBDs (eg. 16 or 32) and seems to increase more the bigger
the file I write. Procs are almost 30k when writing a 50GB file across that
number of OSDs.

the change in qemu.conf worked for me, using rhel7.1 with systemd.


On 3 September 2015 at 19:46, Jan Schermer  wrote:

> You're like the 5th person here (including me) that was hit by this.
>
> Could I get some input from someone using CEPH with RBD and thousands of
> OSDs? How high did you have to go?
>
> I only have ~200 OSDs and I had to bump the limit up to 1 for VMs that
> have multiple volumes attached, this doesn't seem right? I understand this
> is the effect of striping a volume accross multiple PGs, but shouldn't this
> be more limited or somehow garbage collected?
>
> And to get deeper - I suppose there will be one connection from QEMU to
> OSD for each NCQ queue? Or how does this work? blk-mq will likely be
> different again... Or is it decoupled from the virtio side of things by RBD
> cache if that's enabled?
>
> Anyway, out of the box, at least on OpenStack installations
> 1) anyone having more than a few OSDs should really bump this up by
> default.
> 2) librbd should handle this situation gracefully by recycling
> connections, instead of hanging
> 3) at least we should get a warning somewhere (in the libvirt/qemu log) -
> I don't think there's anything when the issue hits
>
> Should I make tickets for this?
>
> Jan
>
> On 03 Sep 2015, at 02:57, Rafael Lopez  wrote:
>
> Hi Jan,
>
> Thanks for the advice, hit the nail on the head.
>
> I checked the limits and watched the no. of fd's and as it reached the
> soft limit (1024) thats when the transfer came to a grinding halt and the
> vm started locking up.
>
> After your reply I also did some more googling and found another old
> thread:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-December/026187.html
>
> I increased the max_files in qemu.conf and restarted libvirtd and the VM
> (as per Dan's solution in thread above), and now it seems to be happy
> copying any size files to the rbd. Confirmed the fd count is going past the
> previous soft limit of 1024 also.
>
> Thanks again!!
> Raf
>
> On 2 September 2015 at 18:44, Jan Schermer  wrote:
>
>> 1) Take a look at the number of file descriptors the QEMU process is
>> using, I think you are over the limits
>>
>> pid=pid of qemu process
>>
>> cat /proc/$pid/limits
>> echo /proc/$pid/fd/* | wc -w
>>
>> 2) Jumbo frames may be the cause, are they enabled on the rest of the
>> network? In any case, get rid of NetworkManager ASAP and set it manually,
>> though it looks like your NIC might not support them.
>>
>> Jan
>>
>>
>>
>> > On 02 Sep 2015, at 01:44, Rafael Lopez  wrote:
>> >
>> > Hi ceph-users,
>> >
>> > Hoping to get some help with a tricky problem. I have a rhel7.1 VM
>> guest (host machine also rhel7.1) with root disk presented from ceph
>> 0.94.2-0 (rbd) using libvirt.
>> >
>> > The VM also has a second rbd for storage presented from the same ceph
>> cluster, also using libvirt.
>> >
>> > The VM boots fine, no apparent issues with the OS root rbd. I am able
>> to mount the storage disk in the VM, and create a file system. I can even
>> transfer small files to it. But when I try to transfer a moderate size
>> files, eg. greater than 1GB, it seems to slow to a grinding halt and
>> eventually it locks up the whole system, and generates the kernel messages
>> below.
>> >
>> > I have googled some *similar* issues around, but haven't come across
>> some solid advice/fix. So far I have tried modifying the libvirt disk cache
>> settings, tried using the latest mainline kernel (4.2+), different file
>> systems (ext4, xfs, zfs) all produce similar results. I suspect it may be
>> network related, as when I was using the mainline kernel I was transferring
>> some files to the storage disk and this message came up, and the transfer
>> seemed to stop at the same time:
>> >
>> > Sep  1 15:31:22 nas1-rds NetworkManager[724]: 
>> [1441085482.078646] [platform/nm-linux-platform.c:2133] sysctl_set():
>> sysctl: failed to set '/proc/sys/net/ipv6/conf/eth0/mtu' to '9000': (22)
>> Invalid argument
>> >
>> > I think maybe the key info to troubleshooting is that it seems to be OK
>> for files under 1GB.
>> >
>> > Any ideas would be appreciated.
>> >
>> > Cheers,
>> > Raf
>> >
>> >
>> > Sep  1 16:04:15 nas1-rds kernel: INFO: task kworker/u8:1:60 blocked for
>> more than 120 seconds.
>> > Sep  1 16:04:15 nas1-rds kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> > Sep  1 16:04:15 nas1-rds kernel: kworker/u8:1D 88023fd93680
>>  060  2 0x
>> > Sep  1 16:04:15 nas1-rds kernel: Workqueue: writeback
>> bdi_writeback_workfn (flush-252:80)
>> > Sep  1 

Re: [ceph-users] high density machines

2015-09-04 Thread Gurvinder Singh
On 09/04/2015 02:31 AM, Wang, Warren wrote:
> In the minority on this one. We have a number of the big SM 72 drive units w/ 
> 40 Gbe. Definitely not as fast as even the 36 drive units, but it isn't awful 
> for our average mixed workload. We can exceed all available performance with 
> some workloads though.
> 
> So while we can't extract all the performance out of the box, as long as we 
> don't max out on performance, the cost is very appealing,
I am wondering how much the cost difference you have seen with SM 72
drive compare to lets say
http://www.supermicro.com/products/system/1U/6017/SYS-6017R-73THDP_.cfm
or any other smaller machine which you have compared with. As with the
discussion on this thread it is clear that the 72 drive box are actually
4 * 18 drive boxes sharing power and cooling. Regarding performance I
think network might be bottleneck (may be cpu too), as it is 40 Gbit for
whole box so you get 10 Gbit per box (18 drives each) which can be peaked.

Gurvinder
 and as far as filling a unit, I'm not sure how many folks have filled
big prod clusters, but you really don't want them even running into the
70+% range due to some inevitable uneven filling, and room for failure.
> 
> Also, I'm betting that Ceph will continue to optimize things like the 
> messenger, and reduce some of the massive CPU and TCP overhead, so we can 
> claw back performance. I would love to see a thread count reduction. These 
> can see over 130K threads per box.
> 
> Warren
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
> Nelson
> Sent: Thursday, September 03, 2015 3:58 PM
> To: Gurvinder Singh ; 
> ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] high density machines
> 
> 
> 
> On 09/03/2015 02:49 PM, Gurvinder Singh wrote:
>> Thanks everybody for the feedback.
>> On 09/03/2015 05:09 PM, Mark Nelson wrote:
>>> My take is that you really only want to do these kinds of systems if 
>>> you have massive deployments.  At least 10 of them, but probably more 
>>> like
>>> 20-30+.  You do get massive density with them, but I think if you are
>>> considering 5 of these, you'd be better off with 10 of the 36 drive 
>>> units.  An even better solution might be ~30-40 of these:
>>>
>>> http://www.supermicro.com/products/system/1U/6017/SYS-6017R-73THDP_.c
>>> fm
>>>
>> This one does look interesting.
>>> An extremely compelling solution would be if they took this system:
>>>
>>> http://www.supermicro.com/products/system/1U/5018/SSG-5018A-AR12L.cfm
>>> ?parts=SHOW
>>>
>> This one can be really good solution for archiving purpose with 
>> replaced CPU to get more juice into it.
>>>
>>> and replaced the C2750 with a Xeon-D 1540 (but keep the same number 
>>> of SATA ports).
>>>
>>> Potentially you could have:
>>>
>>> - 8x 2.0GHz Xeon Broadwell-DE Cores, 45W TDP
>>> - Up to 128GB RAM (32GB probably the sweet spot)
>>> - 2x 10GbE
>>> - 12x 3.5" spinning disks
>>> - single PCIe slot for PCIe SSD/NVMe
>> I am wondering does single PCIe SSD/NVMe device can support 12 OSDs 
>> journals and still perform the same as 4 OSD per SSD ?
> 
> Basically the limiting factor is how fast the device can do O_DSYNC writes.  
> We've seen that some PCIe SSD and NVME devices can do 1-2GB/s depending on 
> the capacity which is enough to reasonably support 12-24 OSDs.  Whether or 
> not it's good to have a single PCIe card to be a point of failure is a 
> worthwhile topic (Probably only high write endurance cards should be 
> considered).  There are plenty of other things that can bring the node down 
> too though (motherboard, ram, cpu, etc) though.  A single node failure will 
> also have less impact if there are lots of small nodes vs a couple big ones.
> 
>>>
>>> The density would be higher than the 36 drive units but lower than 
>>> the
>>> 72 drive units (though with shorter rack depth afaik).
>> You mean the 1U solution with 12 disk is longer in length than 72 disk 
>> 4U version ?
> 
> Sorry, the other way around I believe.
> 
>>
>> - Gurvinder
>>Probably more
>>> CPU per OSD and far better distribution of OSDs across servers.  
>>> Given that the 10GbE and processor are embedded on the motherboard, 
>>> there's a decent chance these systems could be priced reasonably and 
>>> wouldn't have excessive power/cooling requirements.
>>>
>>> Mark
>>>
>>> On 09/03/2015 09:13 AM, Jan Schermer wrote:
 It's not exactly a single system

 SSG-F618H-OSD288P*
 4U-FatTwin, 4x 1U 72TB per node, Ceph-OSD-Storage Node

 This could actually be pretty good, it even has decent CPU power.

 I'm not a big fan of blades and blade-like systems - sooner or later 
 a backplane will die and you'll need to power off everything, which 
 is a huge PITA.
 But assuming you get 3 of these it could be pretty cool!
 It would be interesting to have a price comparison to a SC216 
 chassis or similiar, I'm afraid it won't be