Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
 On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote:
 Hi Cephers,

 Have anyone of you decided to put Giant into production instead of Firefly?

This is very interesting to me too: we are going to deploy a large
ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
the rbd module in Ubuntu Trusty doesn't seem compatible with giant:

feature set mismatch, my 4a042a42  server's 2104a042a42, missing
210

I tried with different ceph osd tunables but nothing seems to fix the issue

However, this cluster will be mainly used for OpenStack, and qemu is
able to access the rbd volume, so this might not be a big problem for
me.

.a.

-- 
antonio.s.mess...@gmail.com
antonio.mess...@uzh.ch +41 (0)44 635 42 22
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] weird 'ceph-deploy disk list nodename' command output, Invalid partition data

2014-12-05 Thread 张帆

hi, all


When I run command 'ceph-deploy disk list nodename', there are some warning 
messages indicate partition table error, but the ceph cluster is working 
normally.
What is the problem, should I run sgdisk command to repair the partition table? 


Below is the warning messages:


ceph@controller-11:~$ ceph-deploy disk list c13
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.5): /usr/bin/ceph-deploy disk list c13
[c13][DEBUG ] connected to host: c13
[c13][DEBUG ] detect platform information from remote host
[c13][DEBUG ] detect machine type
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 13.04 raring
[ceph_deploy.osd][DEBUG ] Listing disks on c13...
[c13][INFO  ] Running command: sudo ceph-disk list
[c13][DEBUG ] /dev/sda :
[c13][DEBUG ]  /dev/sda1 other, ext4, mounted on /
[c13][DEBUG ]  /dev/sda2 other, ext3
[c13][DEBUG ] /dev/sdb :
[c13][DEBUG ]  /dev/sdb1 ceph journal, for /dev/sdc1
[c13][DEBUG ]  /dev/sdb2 ceph journal, for /dev/sdd1
[c13][DEBUG ]  /dev/sdb3 ceph journal, for /dev/sde1
[c13][DEBUG ]  /dev/sdb4 ceph journal, for /dev/sdf1
[c13][DEBUG ] /dev/sdc :
[c13][DEBUG ]  /dev/sdc1 ceph data, active, cluster ceph, osd.1, journal 
/dev/sdb1
[c13][DEBUG ] /dev/sdd :
[c13][DEBUG ]  /dev/sdd1 ceph data, active, cluster ceph, osd.6, journal 
/dev/sdb2
[c13][DEBUG ] /dev/sde :
[c13][DEBUG ]  /dev/sde1 ceph data, active, cluster ceph, osd.7, journal 
/dev/sdb3
[c13][DEBUG ] /dev/sdf :
[c13][DEBUG ]  /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal 
/dev/sdb4
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sda
[c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating 
main header
[c13][WARNIN] from backup!
[c13][WARNIN]
[c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 
'e' options
[c13][WARNIN] on the recovery  transformation menu to examine the two tables.
[c13][WARNIN]
[c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk!
[c13][WARNIN]
[c13][WARNIN] Invalid partition data!
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sda
[c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating 
main header
[c13][WARNIN] from backup!
[c13][WARNIN]
[c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 
'e' options
[c13][WARNIN] on the recovery  transformation menu to examine the two tables.
[c13][WARNIN]
[c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk!
[c13][WARNIN]
[c13][WARNIN] Invalid partition data!
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 2 /dev/sda
[c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating 
main header
[c13][WARNIN] from backup!
[c13][WARNIN]
[c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 
'e' options
[c13][WARNIN] on the recovery  transformation menu to examine the two tables.
[c13][WARNIN]
[c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk!
[c13][WARNIN]
[c13][WARNIN] Invalid partition data!
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sda
[c13][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating 
main header
[c13][WARNIN] from backup!
[c13][WARNIN]
[c13][WARNIN] Warning! Main and backup partition tables differ! Use the 'c' and 
'e' options
[c13][WARNIN] on the recovery  transformation menu to examine the two tables.
[c13][WARNIN]
[c13][WARNIN] Warning! One or more CRCs don't match. You should repair the disk!
[c13][WARNIN]
[c13][WARNIN] Invalid partition data!
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 2 /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 3 /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 4 /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdb
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdc
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdc
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/blkid -s TYPE /dev/sdc1
[c13][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o  -- 
/dev/sdc1 /var/lib/ceph/tmp/mnt.bNqfD1
[c13][WARNIN] INFO:ceph-disk:Running command: /bin/umount -- 
/var/lib/ceph/tmp/mnt.bNqfD1
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdd
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -p /dev/sdd
[c13][WARNIN] INFO:ceph-disk:Running command: /sbin/blkid -s TYPE /dev/sdd1
[c13][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t xfs -o  -- 
/dev/sdd1 /var/lib/ceph/tmp/mnt.k913Cm

Re: [ceph-users] Giant osd problems - loss of IO

2014-12-05 Thread Andrei Mikhailovsky
Jake, 

very usefull indeed. 

It looks like I had a similar problem regarding the heartbeat and as you' have 
mentioned, I've not seen such issues on Firefly. However, i've not seen any osd 
crashes. 

Could you please let me know where you got the sysctrl.conf tunings from? Was 
it recommended by the network vendor? 

Also, did you make similar sysctrl.conf changes to your host servers? 

A while ago i've read the tunning guide for IP over Infiniband and the Mellanox 
recommends setting something like this: 

net.ipv4.tcp_timestamps = 0 
net.ipv4.tcp_sack = 1 
net.core.netdev_max_backlog = 25 
net.core.rmem_max = 4194304 
net.core.wmem_max = 4194304 
net.core.rmem_default = 4194304 
net.core.wmem_default = 4194304 
net.core.optmem_max = 4194304 
net.ipv4.tcp_rmem = 4096 87380 4194304 
net.ipv4.tcp_wmem = 4096 65536 4194304 
net.ipv4.tcp_mem =4194304 4194304 4194304 
net.ipv4.tcp_low_latency=1 

which is what I have. Not sure if these are optimal. 

I can see that the values are pretty conservative compare to yours. I guess my 
values should be different as I am running a 40gbit/s network with ipoib. The 
actual throughput on ipoib is about 20gbit/s according iperf and alike. 

Andrei 

- Original Message -

 From: Jake Young jak3...@gmail.com
 To: Andrei Mikhailovsky and...@arhont.com
 Cc: ceph-users@lists.ceph.com
 Sent: Thursday, 4 December, 2014 4:57:47 PM
 Subject: Re: [ceph-users] Giant osd problems - loss of IO

 On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky 
 and...@arhont.com  wrote:
 
  Any other suggestions why several osds are going down on Giant and
  causing IO to stall? This was not happening on Firefly.
 
  Thanks
 
 

 I had a very similar probem to yours which started after upgrading
 from Firefly to Giant and then later I added two new osd nodes, with
 7 osds on each.

 My cluster originally had 4 nodes, with 7 osds on each node, 28 osds
 total, running Gian. I did not have any problems at this time.

 My problems started after adding two new nodes, so I had 6 nodes and
 42 total osds. It would run fine on low load, but when the request
 load increased, osds started to fall over.

 I was able to set the debug_ms to 10 and capture the logs from a
 failed OSD. There were a few different reasons the osds were going
 down. This example shows it terminating normally for an unspecified
 reason a minute after it notices it is marked down in the map.

 Osd 25 actually marks this osd (osd 35) down. For some reason many
 osds cannot communicate with each other.

 There are other examples where I see the heartbeat_check: no reply
 from osd.blah message for long periods of time (hours) and neither
 osd crashes or terminates.

 2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
 16:26:46.772608)
 2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
 16:26:47.772759)
 2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
 16:26:48.772982)
 2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
 16:26:49.559891)
 2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01
 16:26:49.773173)
 2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01
 16:26:50.773299)
 2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01
 16:26:51.261554)
 2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
 16:26:51.773504)
 2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
 16:26:52.773733)
 2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
 16:26:53.773876)
 2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679
 heartbeat_check: no reply from osd.25 since back 2014-12-01
 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
 16:26:54.163366)
 2014-12-01 16:27:14.507632 7f8b4fb7f700 0 

[ceph-users] AWS SDK and MultiPart Problem

2014-12-05 Thread Georgios Dimitrakakis

Hi all!

I am using AWS SDK JS v.2.0.29 to perform a multipart upload into 
Radosgw with ceph version 0.80.7 
(6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403 error.



I believe that the id which is send to all requests and has been 
urlencoded by the aws-sdk-js doesn't match with the one in rados because 
it's not urlencoded.


Is that the case? Can you confirm it?

Is there something I can do?


Regards,

George

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] AWS SDK and MultiPart Problem

2014-12-05 Thread Georgios Dimitrakakis
For example if I try to perform the same multipart upload at an older 
version ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)



I can see the upload ID in the apache log as:

PUT 
/test/.dat?partNumber=25uploadId=I3yihBFZmHx9CCqtcDjr8d-RhgfX8NW 
HTTP/1.1 200 - - aws-sdk-nodejs/2.0.29 linux/v0.10.33


but when I try the same at ceph version 0.80.7 
(6c0127fcb58008793d3c8b62d925bc91963672a3)


I get the following:

PUT 
/test/.dat?partNumber=12uploadId=2%2Ff9UgnHhdK0VCnMlpT-XA8ttia1HjK36 
HTTP/1.1 403 78 - aws-sdk-nodejs/2.0.29 linux/v0.10.33



and my guess is that the %2F at the latter is the one that is causing 
the problem and hence the 403 error.




What do you think???


Best,

George




Hi all!

I am using AWS SDK JS v.2.0.29 to perform a multipart upload into
Radosgw with ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403
error.


I believe that the id which is send to all requests and has been
urlencoded by the aws-sdk-js doesn't match with the one in rados
because it's not urlencoded.

Is that the case? Can you confirm it?

Is there something I can do?


Regards,

George

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Chinese translation of Ceph Documentation

2014-12-05 Thread Drunkard Zhang
Hi,

I have migrated my Chinese translation from PDF to Ceph official
doc build system. Just replace doc/ in ceph repository with this repo,
the building should be work. The  official doc build guide should be
work for this too.

Old PDF: https://github.com/drunkard/docs_zh
New: https://github.com/drunkard/ceph-doc-zh_CN
The html output: https://github.com/drunkard/docs_zh/tree/master/output/html

There's a lot changes to sync with mainline since my last updates,
will do it in spare time :)

There's some issues to resolve:

1. The build system doesn't support python3 yet, so if you are using
python3 as default interpreter, you shoule switch to python2
temporarily.

2. Building of man pages will fail, changes needed to support
non-ascii encoding. HTML version is fine.

Any way, I'm improving it ;) Any help is welcome!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] AWS SDK and MultiPart Problem

2014-12-05 Thread Georgios Dimitrakakis

It would be nice to see where and how uploadId

is being calculated...


Thanks,


George



For example if I try to perform the same multipart upload at an older
version ceph version 0.72.2 
(a913ded2ff138aefb8cb84d347d72164099cfd60)



I can see the upload ID in the apache log as:

PUT
/test/.dat?partNumber=25uploadId=I3yihBFZmHx9CCqtcDjr8d-RhgfX8NW
HTTP/1.1 200 - - aws-sdk-nodejs/2.0.29 linux/v0.10.33

but when I try the same at ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3)

I get the following:

PUT

/test/.dat?partNumber=12uploadId=2%2Ff9UgnHhdK0VCnMlpT-XA8ttia1HjK36
HTTP/1.1 403 78 - aws-sdk-nodejs/2.0.29 linux/v0.10.33


and my guess is that the %2F at the latter is the one that is
causing the problem and hence the 403 error.



What do you think???


Best,

George




Hi all!

I am using AWS SDK JS v.2.0.29 to perform a multipart upload into
Radosgw with ceph version 0.80.7
(6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403
error.


I believe that the id which is send to all requests and has been
urlencoded by the aws-sdk-js doesn't match with the one in rados
because it's not urlencoded.

Is that the case? Can you confirm it?

Is there something I can do?


Regards,

George

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote:
 Hi Cephers,

 Have anyone of you decided to put Giant into production instead of Firefly?

This is very interesting to me too: we are going to deploy a large
ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
the rbd module in Ubuntu Trusty doesn't seem compatible with giant:

feature set mismatch, my 4a042a42  server's 2104a042a42, missing
210

I tried with different ceph osd tunables but nothing seems to fix the issue

However, this cluster will be mainly used for OpenStack, and qemu is
able to access the rbd volume, so this might not be a big problem for
me.

.a.

-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Encoding Chunks

2014-12-05 Thread Nick Fisk
Hi All,

 

Does anybody have any input on what the best ratio + total numbers of Data +
Coding chunks you would choose?

 

For example I could create a pool with 7 data chunks and 3 coding chunks and
get an efficiency of 70%, or I could create a pool with 17 data chunks and 3
coding chunks and get an efficiency of 85% with a similar probability of
protecting against OSD failure.

 

What's the reason I would choose 10 total chunks over 20 chunks, is it
purely down to the overhead of having potentially double the number of
chunks per object?

 

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Nick Fisk
This is probably due to the Kernel RBD client not being recent enough. Have
you tried upgrading your kernel to a newer version? 3.16 should contain all
the relevant features required by Giant. 


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Antonio Messina
Sent: 05 December 2014 09:37
To: Anthony Alba
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Giant or Firefly for production

On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com
wrote:
 Hi Cephers,

 Have anyone of you decided to put Giant into production instead of
Firefly?

This is very interesting to me too: we are going to deploy a large ceph
cluster on Ubuntu 14.04 LTS, and so far what I have found is that the rbd
module in Ubuntu Trusty doesn't seem compatible with giant:

feature set mismatch, my 4a042a42  server's 2104a042a42, missing
210

I tried with different ceph osd tunables but nothing seems to fix the
issue

However, this cluster will be mainly used for OpenStack, and qemu is able to
access the rbd volume, so this might not be a big problem for me.

.a.

-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread David Moreau Simard
What are the kernel versions involved ?

We have Ubuntu precise clients talking to a Ubuntu trusty cluster without 
issues - with tunables optimal.
0.88 (Giant) and 0.89 has been working well for us as far the client and 
Openstack are concerned.

This link provides some insight as to the possible problems:
http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client

Things to look for:
- Kernel versions
- Cache tiering
- Tunables
- hashpspool

--
David Moreau Simard


 On Dec 5, 2014, at 4:36 AM, Antonio Messina antonio.mess...@s3it.uzh.ch 
 wrote:
 
 On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote:
 Hi Cephers,
 
 Have anyone of you decided to put Giant into production instead of Firefly?
 
 This is very interesting to me too: we are going to deploy a large
 ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
 the rbd module in Ubuntu Trusty doesn't seem compatible with giant:
 
feature set mismatch, my 4a042a42  server's 2104a042a42, missing
 210
 
 I tried with different ceph osd tunables but nothing seems to fix the issue
 
 However, this cluster will be mainly used for OpenStack, and qemu is
 able to access the rbd volume, so this might not be a big problem for
 me.
 
 .a.
 
 -- 
 antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
 antonio.s.mess...@gmail.com
 S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
 University of Zurich
 Winterthurerstrasse 190
 CH-8057 Zurich Switzerland
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Virtual machines using RBD remount read-only on OSD slow requests

2014-12-05 Thread Paulo Almeida
Hi,

I recently e-mailed ceph-users about a problem with virtual machine RBD
disks remounting read-only because of OSD slow requests[1]. I just
wanted to report that although I'm still seeing OSDs from one particular
machine going down sometimes (probably some hardware problem on that
node), the virtual machines haven't been remounting their disks
read-only. 

I can't be sure of the cause, because I didn't do any controlled tests
(or any tests at all), but one thing I changed was the
osd_recovery_op_priority, from the default 10 to 5. I had seen some
suggestions on this list regarding that parameter and they may well have
been useful in my case.

Cheers,
Paulo

[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044887.html

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
On Fri, Dec 5, 2014 at 4:25 PM, David Moreau Simard dmsim...@iweb.com wrote:
 What are the kernel versions involved ?

 We have Ubuntu precise clients talking to a Ubuntu trusty cluster without 
 issues - with tunables optimal.
 0.88 (Giant) and 0.89 has been working well for us as far the client and 
 Openstack are concerned.

 This link provides some insight as to the possible problems:

Both servers and clients are Ubuntu Trusty. Kernel versions are a bit different:

client: 3.13.0-39-generic #66
server: 3.13.0-32-generic #57
ceph version on both: 0.87

 http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client

 Things to look for:
 - Kernel versions
 - Cache tiering
 - Tunables
 - hashpspool

I have already read the blogpost, but I don't have much experience
with tunables.
From what I understood I am missing:

* CEPH_FEATURE_CRUSH_TUNABLES3
* CEPH_FEATURE_CRUSH_V2

but I don't know how to disable them, and I can't see them set in the
crushmap I get from ceph osd getcrushmap

.a.


-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 This is probably due to the Kernel RBD client not being recent enough. Have
 you tried upgrading your kernel to a newer version? 3.16 should contain all
 the relevant features required by Giant.

I would rather tune the tunables, as upgrading the kernel would
require a reboot of the client.
Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need
to recompile...

.a.

-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread James Devine
http://kernel.ubuntu.com/~kernel-ppa/mainline/

I'm running 3.17 on my trusty clients without issue

On Fri, Dec 5, 2014 at 9:37 AM, Antonio Messina antonio.mess...@s3it.uzh.ch
 wrote:

 On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote:
  This is probably due to the Kernel RBD client not being recent enough.
 Have
  you tried upgrading your kernel to a newer version? 3.16 should contain
 all
  the relevant features required by Giant.

 I would rather tune the tunables, as upgrading the kernel would
 require a reboot of the client.
 Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need
 to recompile...

 .a.

 --
 antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
 antonio.s.mess...@gmail.com
 S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
 University of Zurich
 Winterthurerstrasse 190
 CH-8057 Zurich Switzerland
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Nick Fisk
Ok sorry, I thought you had a need for some of the features in Giant, using
tunables is probably easier in that case.

However if you do want to upgrade there are debs available:-

http://kernel.ubuntu.com/~kernel-ppa/mainline/

and I believe 3.16 should be available in the 14.04.2 release, which should
be released early next year.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Antonio Messina
Sent: 05 December 2014 15:38
To: Nick Fisk
Cc: ceph-users@lists.ceph.com; Antonio Messina
Subject: Re: [ceph-users] Giant or Firefly for production

On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 This is probably due to the Kernel RBD client not being recent enough. 
 Have you tried upgrading your kernel to a newer version? 3.16 should 
 contain all the relevant features required by Giant.

I would rather tune the tunables, as upgrading the kernel would require a
reboot of the client.
Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need to
recompile...

.a.

-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
Thank you James and Nick,

On Fri, Dec 5, 2014 at 4:46 PM, Nick Fisk n...@fisk.me.uk wrote:
 Ok sorry, I thought you had a need for some of the features in Giant, using
 tunables is probably easier in that case.

I'm not sure :) I never played with the tunables before (still running
a testbed only)

I will test it again with 14.04.2 and default kernel beginning of next
year, I prefer to use the official kernel for the production
cluster, but since it's going to be deployed Q1-Q2 next year I should
be safe.

.a.

 However if you do want to upgrade there are debs available:-

 http://kernel.ubuntu.com/~kernel-ppa/mainline/

 and I believe 3.16 should be available in the 14.04.2 release, which should
 be released early next year.

 Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Antonio Messina
 Sent: 05 December 2014 15:38
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com; Antonio Messina
 Subject: Re: [ceph-users] Giant or Firefly for production

 On Fri, Dec 5, 2014 at 4:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 This is probably due to the Kernel RBD client not being recent enough.
 Have you tried upgrading your kernel to a newer version? 3.16 should
 contain all the relevant features required by Giant.

 I would rather tune the tunables, as upgrading the kernel would require a
 reboot of the client.
 Besides, Ubuntu Trusty does not provide a 3.16 kernel, so I would need to
 recompile...

 .a.

 --
 antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
 antonio.s.mess...@gmail.com
 S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
 University of Zurich
 Winterthurerstrasse 190
 CH-8057 Zurich Switzerland
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Sage Weil
On Fri, 5 Dec 2014, Antonio Messina wrote:
 On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote:
  Hi Cephers,
 
  Have anyone of you decided to put Giant into production instead of Firefly?
 
 This is very interesting to me too: we are going to deploy a large
 ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
 the rbd module in Ubuntu Trusty doesn't seem compatible with giant:
 
 feature set mismatch, my 4a042a42  server's 2104a042a42, missing
 210

Can you attach the output of 

 ceph osd crush show-tunables -f json-pretty
 ceph osd crush dump -f json-pretty

Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-12-05 Thread David Moreau Simard
I've flushed everything - data, pools, configs and reconfigured the whole thing.

I was particularly careful with cache tiering configurations (almost leaving 
defaults when possible) and it's not locking anymore.
It looks like the cache tiering configuration I had was causing the problem ? I 
can't put my finger on exactly what/why and I don't have the luxury of time to 
do this lengthy testing again.

Here's what I dumped as far as config goes before wiping:

# for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; 
do ceph osd pool get volumes $var; done
size: 5
min_size: 2
pg_num: 7200
pgp_num: 7200
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type 
hit_set_period hit_set_count target_max_objects target_max_bytes 
cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age 
cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 7200
pgp_num: 7200
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 1000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 600
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van


And now:

# for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; 
do ceph osd pool get volumes $var; done
size: 5
min_size: 3
pg_num: 2048
pgp_num: 2048
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type 
hit_set_period hit_set_count target_max_objects target_max_bytes 
cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age 
cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 2048
pgp_num: 2048
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 1500
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van


Crush map hasn't really changed before and after.

FWIW, the benchmarks I pulled out of the setup: 
https://gist.github.com/dmsimard/2737832d077cfc5eff34
Definite overhead going from krbd to krbd + LIO...
--
David Moreau Simard


 On Nov 20, 2014, at 4:14 PM, Nick Fisk n...@fisk.me.uk wrote:
 
 Here you go:-
 
 Erasure Profile
 k=2
 m=1
 plugin=jerasure
 ruleset-failure-domain=osd
 ruleset-root=hdd
 technique=reed_sol_van
 
 Cache Settings
 hit_set_type: bloom
 hit_set_period: 3600
 hit_set_count: 1
 target_max_objects
 target_max_objects: 0
 target_max_bytes: 10
 cache_target_dirty_ratio: 0.4
 cache_target_full_ratio: 0.8
 cache_min_flush_age: 0
 cache_min_evict_age: 0
 
 Crush Dump
 # begin crush map
 tunable choose_local_tries 0
 tunable choose_local_fallback_tries 0
 tunable choose_total_tries 50
 tunable chooseleaf_descend_once 1
 
 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 
 # types
 type 0 osd
 type 1 host
 type 2 chassis
 type 3 rack
 type 4 row
 type 5 pdu
 type 6 pod
 type 7 room
 type 8 datacenter
 type 9 region
 type 10 root
 
 # buckets
 host ceph-test-hdd {
id -5   # do not change unnecessarily
# weight 2.730
alg straw
hash 0  # rjenkins1
item osd.1 weight 0.910
item osd.2 weight 0.910
item osd.0 weight 0.910
 }
 root hdd {
id -3   # do not change unnecessarily
# weight 2.730
alg straw
hash 0  # rjenkins1
item ceph-test-hdd weight 2.730
 }
 host ceph-test-ssd {
id -6   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
item osd.3 weight 1.000
 }
 root ssd {
id -4   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
item ceph-test-ssd weight 1.000
 }
 
 # rules
 rule hdd {
ruleset 0
type replicated
min_size 0
max_size 10
step take hdd
step chooseleaf firstn 0 type osd
step emit
 }
 rule ssd {
ruleset 1
type replicated
min_size 0
max_size 4
step take ssd
step chooseleaf firstn 0 type osd
step emit
 }
 rule ecpool {
ruleset 2
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take hdd
step chooseleaf indep 0 type osd
step emit
 }
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 David Moreau Simard
 Sent: 20 November 2014 20:03
 

Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
On Fri, Dec 5, 2014 at 4:59 PM, Sage Weil s...@newdream.net wrote:
 On Fri, 5 Dec 2014, Antonio Messina wrote:
 On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com wrote:
  Hi Cephers,
 
  Have anyone of you decided to put Giant into production instead of Firefly?

 This is very interesting to me too: we are going to deploy a large
 ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
 the rbd module in Ubuntu Trusty doesn't seem compatible with giant:

 feature set mismatch, my 4a042a42  server's 2104a042a42, missing
 210

 Can you attach the output of


I modified the crushmap and set:

tunable chooseleaf_vary_r 0

(it was 1 before)
Now the cluster is rebalancing, and since it's on crappy hardware is
taking some time.

I'm pasting the output of the two commands, but please keep in mind
that this is the output *after* I've updated the chooseleaf_vary_r
tunable.

  ceph osd crush show-tunables -f json-pretty

{ choose_local_tries: 0,
  choose_local_fallback_tries: 0,
  choose_total_tries: 50,
  chooseleaf_descend_once: 1,
  profile: bobtail,
  optimal_tunables: 0,
  legacy_tunables: 0,
  require_feature_tunables: 1,
  require_feature_tunables2: 1,
  require_feature_tunables3: 0,
  has_v2_rules: 1,
  has_v3_rules: 0}

  ceph osd crush dump -f json-pretty

I'm attaching it as a text file, as it is quite big and unreadable.
However, from the output I see the following tunables:

  tunables: { choose_local_tries: 0,
  choose_local_fallback_tries: 0,
  choose_total_tries: 50,
  chooseleaf_descend_once: 1,
  profile: bobtail,
  optimal_tunables: 0,
  legacy_tunables: 0,
  require_feature_tunables: 1,
  require_feature_tunables2: 1,
  require_feature_tunables3: 0,
  has_v2_rules: 1,
  has_v3_rules: 0}}

.a.

-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland


crushmap.json
Description: application/json
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
Hi all, just an update

After setting chooseleaf_vary_r to 0 _and_ removing an pool with
erasure coding, I was able to run rbd map.

Thank you all for the help

.a.

On Fri, Dec 5, 2014 at 5:07 PM, Antonio Messina
antonio.mess...@s3it.uzh.ch wrote:
 On Fri, Dec 5, 2014 at 4:59 PM, Sage Weil s...@newdream.net wrote:
 On Fri, 5 Dec 2014, Antonio Messina wrote:
 On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com 
 wrote:
  Hi Cephers,
 
  Have anyone of you decided to put Giant into production instead of 
  Firefly?

 This is very interesting to me too: we are going to deploy a large
 ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
 the rbd module in Ubuntu Trusty doesn't seem compatible with giant:

 feature set mismatch, my 4a042a42  server's 2104a042a42, missing
 210

 Can you attach the output of


 I modified the crushmap and set:

 tunable chooseleaf_vary_r 0

 (it was 1 before)
 Now the cluster is rebalancing, and since it's on crappy hardware is
 taking some time.

 I'm pasting the output of the two commands, but please keep in mind
 that this is the output *after* I've updated the chooseleaf_vary_r
 tunable.

  ceph osd crush show-tunables -f json-pretty

 { choose_local_tries: 0,
   choose_local_fallback_tries: 0,
   choose_total_tries: 50,
   chooseleaf_descend_once: 1,
   profile: bobtail,
   optimal_tunables: 0,
   legacy_tunables: 0,
   require_feature_tunables: 1,
   require_feature_tunables2: 1,
   require_feature_tunables3: 0,
   has_v2_rules: 1,
   has_v3_rules: 0}

  ceph osd crush dump -f json-pretty

 I'm attaching it as a text file, as it is quite big and unreadable.
 However, from the output I see the following tunables:

   tunables: { choose_local_tries: 0,
   choose_local_fallback_tries: 0,
   choose_total_tries: 50,
   chooseleaf_descend_once: 1,
   profile: bobtail,
   optimal_tunables: 0,
   legacy_tunables: 0,
   require_feature_tunables: 1,
   require_feature_tunables2: 1,
   require_feature_tunables3: 0,
   has_v2_rules: 1,
   has_v3_rules: 0}}

 .a.

 --
 antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
 antonio.s.mess...@gmail.com
 S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
 University of Zurich
 Winterthurerstrasse 190
 CH-8057 Zurich Switzerland



-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Encoding Chunks

2014-12-05 Thread Loic Dachary


On 05/12/2014 16:21, Nick Fisk wrote: Hi All,
 
  
 
 Does anybody have any input on what the best ratio + total numbers of Data + 
 Coding chunks you would choose?
 
  
 
 For example I could create a pool with 7 data chunks and 3 coding chunks and 
 get an efficiency of 70%, or I could create a pool with 17 data chunks and 3 
 coding chunks and get an efficiency of 85% with a similar probability of 
 protecting against OSD failure.
 
  
 
 What’s the reason I would choose 10 total chunks over 20 chunks, is it purely 
 down to the overhead of having potentially double the number of chunks per 
 object?

Hi Nick,

Assuming you have a large number of OSD (a thousand or more) with cold data, 20 
is probably better. When you try to read the data it involves 20 OSDs instead 
of 10 but you probably don't care if reads are rare. 

Disclaimer : I'm a developer not an architect ;-) It would help to know the 
target use case, the size of the data set and the expected read/write rate.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Sage Weil
On Fri, 5 Dec 2014, Antonio Messina wrote:
 On Fri, Dec 5, 2014 at 4:59 PM, Sage Weil s...@newdream.net wrote:
  On Fri, 5 Dec 2014, Antonio Messina wrote:
  On Fri, Dec 5, 2014 at 2:24 AM, Anthony Alba ascanio.al...@gmail.com 
  wrote:
   Hi Cephers,
  
   Have anyone of you decided to put Giant into production instead of 
   Firefly?
 
  This is very interesting to me too: we are going to deploy a large
  ceph cluster on Ubuntu 14.04 LTS, and so far what I have found is that
  the rbd module in Ubuntu Trusty doesn't seem compatible with giant:
 
  feature set mismatch, my 4a042a42  server's 2104a042a42, missing
  210
 
  Can you attach the output of
 
 
 I modified the crushmap and set:
 
 tunable chooseleaf_vary_r 0
 
 (it was 1 before)
 Now the cluster is rebalancing, and since it's on crappy hardware is
 taking some time.
 
 I'm pasting the output of the two commands, but please keep in mind
 that this is the output *after* I've updated the chooseleaf_vary_r
 tunable.
 
   ceph osd crush show-tunables -f json-pretty
 
 { choose_local_tries: 0,
   choose_local_fallback_tries: 0,
   choose_total_tries: 50,
   chooseleaf_descend_once: 1,
   profile: bobtail,
   optimal_tunables: 0,
   legacy_tunables: 0,
   require_feature_tunables: 1,
   require_feature_tunables2: 1,
   require_feature_tunables3: 0,
   has_v2_rules: 1,
   has_v3_rules: 0}

The v2 rule means you have a crush rule for erasure coding.  Do you have 
an EC pool in your cluster?

The tunables3 feature bit is set because you set the vary_r parameter.

If you want older kernels to talk to the cluster, you need to avoid the 
new tunables and features!

sage


 
   ceph osd crush dump -f json-pretty
 
 I'm attaching it as a text file, as it is quite big and unreadable.
 However, from the output I see the following tunables:
 
   tunables: { choose_local_tries: 0,
   choose_local_fallback_tries: 0,
   choose_total_tries: 50,
   chooseleaf_descend_once: 1,
   profile: bobtail,
   optimal_tunables: 0,
   legacy_tunables: 0,
   require_feature_tunables: 1,
   require_feature_tunables2: 1,
   require_feature_tunables3: 0,
   has_v2_rules: 1,
   has_v3_rules: 0}}
 
 .a.
 
 -- 
 antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
 antonio.s.mess...@gmail.com
 S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
 University of Zurich
 Winterthurerstrasse 190
 CH-8057 Zurich Switzerland
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant or Firefly for production

2014-12-05 Thread Antonio Messina
On Fri, Dec 5, 2014 at 5:24 PM, Sage Weil s...@newdream.net wrote:
 The v2 rule means you have a crush rule for erasure coding.  Do you have
 an EC pool in your cluster?

Yes indeed. I didn't know EC pool was incompatible with the current
kernel, I only tested it with rados bench and VMs, I guess.

 The tunables3 feature bit is set because you set the vary_r parameter.

This I don't really know where it comes from. I think at a certain
point I ran ceph osd crush tunables optimal, and it probably added
vary_r, but then I run ceph osd crush tunables firefly and it
didn't remove it... is it normal?

 If you want older kernels to talk to the cluster, you need to avoid the
 new tunables and features!

Well, as I said, I'm not a ceph expert, I didn't even know I enabled
features the kernel of the distribution did not support.

I guess the problem is that I am using packages from the ceph.com
repo, while the kernel comes from ubuntu.

However, it's at least curious that when I was running firefly from
ubuntu repositories I could create an EC pool, but the kernel was not
compatible with EC2 pools...

.a.

-- 
antonio.mess...@s3it.uzh.ch +41 (0)44 635 42 22
antonio.s.mess...@gmail.com
S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Virtual machines using RBD remount read-only on OSD slow requests

2014-12-05 Thread Haomai Wang
I hope you can provide with more runtime infos like logs

On Fri, Dec 5, 2014 at 11:32 PM, Paulo Almeida
palme...@igc.gulbenkian.pt wrote:
 Hi,

 I recently e-mailed ceph-users about a problem with virtual machine RBD
 disks remounting read-only because of OSD slow requests[1]. I just
 wanted to report that although I'm still seeing OSDs from one particular
 machine going down sometimes (probably some hardware problem on that
 node), the virtual machines haven't been remounting their disks
 read-only.

 I can't be sure of the cause, because I didn't do any controlled tests
 (or any tests at all), but one thing I changed was the
 osd_recovery_op_priority, from the default 10 to 5. I had seen some
 suggestions on this list regarding that parameter and they may well have
 been useful in my case.

 Cheers,
 Paulo

 [1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044887.html

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Encoding Chunks

2014-12-05 Thread Loic Dachary


On 05/12/2014 17:41, Nick Fisk wrote:
 Hi Loic,
 
 Thanks for your response.
 
 The idea for this cluster will be for our VM Replica storage in our
 secondary site. Initially we are planning to have a 40 disk EC pool sitting
 behind a cache pool of around 1TB post replica size.
 
 This storage will be presented as RBD's and then exported as a HA iSCSI
 target to ESX hosts. The VM's will be replicated from our primary site via a
 software product called Veeam.
 
 I'm hoping that the 1TB cache layer should be big enough to hold most of the
 hot data meaning that the EC pool shouldn't see a large amount of IO, just
 the trickle of the cache layer flushing back to disk. We can switch back to
 a 3 way replica pool if the EC pool doesn't work out for us, but we are
 interested in testing out the EC technology.
 
 I hope that provides an insight to what I am trying to achieve.

When the erasure coded object has to be promoted back to the replicated pool, 
you want that to happen as fast as possible. The read will return when all 6 
OSDs give their data chunk to the primary OSD (holding the 7th chunk). The 6 
read happen in parallel and will complete when the slower OSD returns. If you 
have 16 OSDs instead of 6 you increase the odds of slowing the whole read down 
because one of them is significantly slower than the others. If you have 40 
OSDs you probably don't need a sophisticated monitoring system detecting hard 
drive misbehavior and a slow disk could go unnoticed and degrade your 
performances significantly because more than a third of the objects use it 
(each object is using 20 OSDs total, 17 of which are for data you need to 
promote to the replicated pool). If you had over 1000 OSDs, you would probably 
need to monitor the hard drives accurately and detect slow OSDs sooner and move 
them out of the cluster. And only a fraction of the objects would be impacted 
by a slow OSD. 

I would love to hear what an architect would advise.

Cheers


 
 Thanks,
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Loic Dachary
 Sent: 05 December 2014 16:23
 To: Nick Fisk; 'Ceph Users'
 Subject: Re: [ceph-users] Erasure Encoding Chunks
 
 
 
 On 05/12/2014 16:21, Nick Fisk wrote: Hi All,

  

 Does anybody have any input on what the best ratio + total numbers of Data
 + Coding chunks you would choose?

  

 For example I could create a pool with 7 data chunks and 3 coding chunks
 and get an efficiency of 70%, or I could create a pool with 17 data chunks
 and 3 coding chunks and get an efficiency of 85% with a similar probability
 of protecting against OSD failure.

  

 What’s the reason I would choose 10 total chunks over 20 chunks, is it
 purely down to the overhead of having potentially double the number of
 chunks per object?
 
 Hi Nick,
 
 Assuming you have a large number of OSD (a thousand or more) with cold data,
 20 is probably better. When you try to read the data it involves 20 OSDs
 instead of 10 but you probably don't care if reads are rare. 
 
 Disclaimer : I'm a developer not an architect ;-) It would help to know the
 target use case, the size of the data set and the expected read/write rate.
 
 Cheers
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] AWS SDK and MultiPart Problem

2014-12-05 Thread Yehuda Sadeh
It looks like a bug. Can you open an issue on tracker.ceph.com,
describing what you see?

Thanks,
Yehuda

On Fri, Dec 5, 2014 at 7:17 AM, Georgios Dimitrakakis
gior...@acmac.uoc.gr wrote:
 It would be nice to see where and how uploadId

 is being calculated...


 Thanks,


 George



 For example if I try to perform the same multipart upload at an older
 version ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)


 I can see the upload ID in the apache log as:

 PUT
 /test/.dat?partNumber=25uploadId=I3yihBFZmHx9CCqtcDjr8d-RhgfX8NW
 HTTP/1.1 200 - - aws-sdk-nodejs/2.0.29 linux/v0.10.33

 but when I try the same at ceph version 0.80.7
 (6c0127fcb58008793d3c8b62d925bc91963672a3)

 I get the following:

 PUT

 /test/.dat?partNumber=12uploadId=2%2Ff9UgnHhdK0VCnMlpT-XA8ttia1HjK36
 HTTP/1.1 403 78 - aws-sdk-nodejs/2.0.29 linux/v0.10.33


 and my guess is that the %2F at the latter is the one that is
 causing the problem and hence the 403 error.



 What do you think???


 Best,

 George



 Hi all!

 I am using AWS SDK JS v.2.0.29 to perform a multipart upload into
 Radosgw with ceph version 0.80.7
 (6c0127fcb58008793d3c8b62d925bc91963672a3) and I am getting a 403
 error.


 I believe that the id which is send to all requests and has been
 urlencoded by the aws-sdk-js doesn't match with the one in rados
 because it's not urlencoded.

 Is that the case? Can you confirm it?

 Is there something I can do?


 Regards,

 George

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] experimental features

2014-12-05 Thread Sage Weil
A while back we merged Haomai's experimental OSD backend KeyValueStore.  
We named the config option 'keyvaluestore_dev', hoping to make it clear to 
users that it was still under development, not fully tested, and not yet 
ready for production.  In retrospect, I don't think '_dev' was 
sufficiently scary because many users tried it and ran into 
unexpectd trouble.

There are several other features we've recently added or are considering 
adding that fall into this category.  Having them in the tree is great 
because it streamlines QA and testing, but I want to make sure that 
users are not able to enable the features without being aware of the 
risks.

A few possible suggestions:

- scarier option names, like

  osd objectstore = keyvaluestore_experimental_danger_danger
  ms type = async_experimental_danger_danger
  ms type = xio_experimental_danger_danger

  Once the feature becomes stable, they'll have to adjust their 
config, or we'll need to support both names going forward.

- a separate config option that allows any experimental option

  allow experimental features danger danger = true
  osd objectstore = keyvaluestore
  ms type = xio

  This runs the risk that the user will enable experimental features to 
get X, and later start using Y without realizing Y is also 
experiemental.

- enumerate experiemntal options we want to enable

  allow experimental features danger danger = keyvaluestore, xio
  ms type = xio
  osd objectstore = keyvaluestore

  This has the property that no config change is necessary when the 
feature drops its experimental status.

In all of these cases, we can also make a point of sending something to 
the log on daemon startup.  I don't think too many people will notice 
this, but it is better than nothing.

Other ideas?
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] experimental features

2014-12-05 Thread David Champion
* On 05 Dec 2014, Sage Weil wrote: 
 adding that fall into this category.  Having them in the tree is great 
 because it streamlines QA and testing, but I want to make sure that 
 users are not able to enable the features without being aware of the 
 risks.
 
 A few possible suggestions:
 
 - scarier option names, like
 - a separate config option that allows any experimental option
 - enumerate experiemntal options we want to enable
 Other ideas?

A separate config file for experimental options:
/etc/ceph/danger-danger.conf

-- 
   David Champion • d...@uchicago.edu • University of Chicago
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] experimental features

2014-12-05 Thread Mark Nelson



On 12/05/2014 11:39 AM, Gregory Farnum wrote:

On Fri, Dec 5, 2014 at 9:36 AM, Sage Weil sw...@redhat.com wrote:

A while back we merged Haomai's experimental OSD backend KeyValueStore.
We named the config option 'keyvaluestore_dev', hoping to make it clear to
users that it was still under development, not fully tested, and not yet
ready for production.  In retrospect, I don't think '_dev' was
sufficiently scary because many users tried it and ran into
unexpectd trouble.

There are several other features we've recently added or are considering
adding that fall into this category.  Having them in the tree is great
because it streamlines QA and testing, but I want to make sure that
users are not able to enable the features without being aware of the
risks.

A few possible suggestions:

- scarier option names, like

   osd objectstore = keyvaluestore_experimental_danger_danger
   ms type = async_experimental_danger_danger
   ms type = xio_experimental_danger_danger

   Once the feature becomes stable, they'll have to adjust their
config, or we'll need to support both names going forward.

- a separate config option that allows any experimental option

   allow experimental features danger danger = true
   osd objectstore = keyvaluestore
   ms type = xio

   This runs the risk that the user will enable experimental features to
get X, and later start using Y without realizing Y is also
experiemental.

- enumerate experiemntal options we want to enable

   allow experimental features danger danger = keyvaluestore, xio
   ms type = xio
   osd objectstore = keyvaluestore

   This has the property that no config change is necessary when the
feature drops its experimental status.

In all of these cases, we can also make a point of sending something to
the log on daemon startup.  I don't think too many people will notice
this, but it is better than nothing.

Other ideas?


I don't think these should even be going into release packages for
users to work with. We can build them on the dev gitbuilders for QA
and testing without them ever reaching the hands of users grabbing our
production packages. ;)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'm in favor of the allow experimental features but instead call it:

ALLOW UNRECOVERABLE DATA CORRUPTING FEATURES which makes things a 
little more explicit. With great power comes great responsibility.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] experimental features

2014-12-05 Thread Mark Nelson

On 12/05/2014 11:47 AM, David Champion wrote:

* On 05 Dec 2014, Sage Weil wrote:

adding that fall into this category.  Having them in the tree is great
because it streamlines QA and testing, but I want to make sure that
users are not able to enable the features without being aware of the
risks.

A few possible suggestions:

- scarier option names, like
- a separate config option that allows any experimental option
- enumerate experiemntal options we want to enable
Other ideas?


A separate config file for experimental options:
/etc/ceph/danger-danger.conf



One of the questions I have in this is once you've enabled experimental 
features, should the cluster be considered experimental forever, even 
after the feature has become stable?  Maybe some kind of subtle 
corruption has worked it's way in it will take a while to manifest.  It 
seems to me like if you've enabled experimental features on a cluster 
that all bets are off.


It seems to me like having the features in a separate ceph.conf file 
would imply that you just get rid of the danger.conf file and things are 
back to normal, but that's not really how it is imho.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Radosgw with SSL enabled

2014-12-05 Thread lakshmi k s
Hello  - 
I have rados gateway setup working with http. But when I enable SSL on gateway 
node, I am having trouble making successful swift requests over https. 
root@hrados:~# swift -V 1.0 -A https://hrados1.ex.com/auth/v1.0 -U 
s3User:swiftUser -K 8fJfd6YW2poqhvBI+uUYJZE1uscnmrDncRXrkjHR list[Errno bad 
handshake] [('SSL routines', 'SSL3_GET_SERVER_CERTIFICATE', 'certificate verify 
failed')]

Output of CURL command is as follows. root@hrados:~# curl --insecure -X GET -i 
-H X-Auth-Key:8fJfd6YW2poqhvBI+uUYJZE1uscnmrDncRXrkjHR -H 
X-Auth-User:s3User:swiftUser https://hrados1.ex.com/auth/v1.0HTTP/1.1 204 No 
ContentDate: Fri, 05 Dec 2014 17:53:58 GMTServer: Apache/2.4.10 
(Debian)X-Storage-Url: https://hrados1.ex.com/swift/v1X-Storage-Token: 
AUTH_rgwtk10007333557365723a737769667455736572961633914ab868f0b6428354483a6b08fc254e33b1283ed9f428c61436aa05c0f44069d8X-Auth-Token:
 
AUTH_rgwtk10007333557365723a737769667455736572961633914ab868f0b6428354483a6b08fc254e33b1283ed9f428c61436aa05c0f44069d8Content-Type:
 application/json
Appreciate your help.Thanks,Lakshmi.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] experimental features

2014-12-05 Thread Gregory Farnum
On Fri, Dec 5, 2014 at 9:36 AM, Sage Weil sw...@redhat.com wrote:
 A while back we merged Haomai's experimental OSD backend KeyValueStore.
 We named the config option 'keyvaluestore_dev', hoping to make it clear to
 users that it was still under development, not fully tested, and not yet
 ready for production.  In retrospect, I don't think '_dev' was
 sufficiently scary because many users tried it and ran into
 unexpectd trouble.

 There are several other features we've recently added or are considering
 adding that fall into this category.  Having them in the tree is great
 because it streamlines QA and testing, but I want to make sure that
 users are not able to enable the features without being aware of the
 risks.

 A few possible suggestions:

 - scarier option names, like

   osd objectstore = keyvaluestore_experimental_danger_danger
   ms type = async_experimental_danger_danger
   ms type = xio_experimental_danger_danger

   Once the feature becomes stable, they'll have to adjust their
 config, or we'll need to support both names going forward.

 - a separate config option that allows any experimental option

   allow experimental features danger danger = true
   osd objectstore = keyvaluestore
   ms type = xio

   This runs the risk that the user will enable experimental features to
 get X, and later start using Y without realizing Y is also
 experiemental.

 - enumerate experiemntal options we want to enable

   allow experimental features danger danger = keyvaluestore, xio
   ms type = xio
   osd objectstore = keyvaluestore

   This has the property that no config change is necessary when the
 feature drops its experimental status.

 In all of these cases, we can also make a point of sending something to
 the log on daemon startup.  I don't think too many people will notice
 this, but it is better than nothing.

 Other ideas?

I don't think these should even be going into release packages for
users to work with. We can build them on the dev gitbuilders for QA
and testing without them ever reaching the hands of users grabbing our
production packages. ;)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2014-12-05 Thread Gregory Farnum
On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com wrote:

 Hello,

 This morning I decided to reboot a storage node (Debian Jessie, thus 3.16
 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) after applying some
 changes.

 It came back up one OSD short, the last log lines before the reboot are:
 ---
 2014-12-05 09:35:27.700330 7f87e789c700  2 -- 10.0.8.21:6823/29520  
 10.0.8.22:0/5161 pipe(0x7f881b772580 sd=247 :6823 s=2 pgs=21 cs=1 l=1 
 c=0x7f881f469020).fault (0) Success
 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 pg_epoch: 293 pg[3.316( v 
 289'1347 (0'0,289'1347] local-les=289 n=8 ec=5 les/c 289/289 288/288/288) 
 [8,4,16] r=1 lpr=288 pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active] 
 cancel_copy_ops
 ---

 Quite obviously it didn't complete its shutdown, so unsurprisingly we get:
 ---
 2014-12-05 09:37:40.278128 7f218a7037c0  1 journal _open 
 /var/lib/ceph/osd/ceph-4/journal fd 24: 1269312 bytes, block size 4096 
 bytes, directio = 1, aio = 1
 2014-12-05 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding 
 journal header
 2014-12-05 09:37:40.278479 7f218a7037c0 -1 
 filestore(/var/lib/ceph/osd/ceph-4) mount failed to open journal 
 /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument
 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init: unable to mount 
 object store
 2014-12-05 09:37:40.776223 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init 
 failed: (22) Invalid argument
 ESC[0m
 ---

 Thankfully this isn't production yet and I was eventually able to recover
 the OSD by re-creating the journal (ceph-osd -i 4 --mkjournal), but it
 leaves me with a rather bad taste in my mouth.

 So the pertinent questions would be:

 1. What caused this?
 My bet is on the evil systemd just pulling the plug before the poor OSD
 had finished its shutdown job.

 2. How to prevent it from happening again?
 Is there something the Ceph developers can do with regards to init scripts?
 Or is this something to be brought up with the Debian maintainer?
 Debian is transiting from sysv-init to systemd (booo!) with Jessie, but
 the OSDs still have a sysvinit magic file in their top directory. Could
 this have an affect on things?

 3. Is it really that easy to trash your OSDs?
 In the case a storage node crashes, am I to expect most if not all OSDs or
 at least their journals to require manual loving?

So this can't happen. Being force killed definitely can't kill the
OSD's disk state; that's the whole point of the journaling. The error
message indicates that the header written on disk is nonsense to the
OSD, which means that the local filesystem or disk lost something
somehow (assuming you haven't done something silly like downgrading
the software version it's running) and doesn't know it (if there had
been a read error the output would be different). I'd double-check
your disk settings etc just to be sure, and check for known issues
with xfs on Jessie.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] experimental features

2014-12-05 Thread Robert LeBlanc
I prefer the third option (enumeration). I don't see a point where we would
enable experimental features on our production clusters, but it would be
nice to have the same bits and procedures between our dev/beta and
production clusters.

On Fri, Dec 5, 2014 at 10:36 AM, Sage Weil sw...@redhat.com wrote:

 A while back we merged Haomai's experimental OSD backend KeyValueStore.
 We named the config option 'keyvaluestore_dev', hoping to make it clear to
 users that it was still under development, not fully tested, and not yet
 ready for production.  In retrospect, I don't think '_dev' was
 sufficiently scary because many users tried it and ran into
 unexpectd trouble.

 There are several other features we've recently added or are considering
 adding that fall into this category.  Having them in the tree is great
 because it streamlines QA and testing, but I want to make sure that
 users are not able to enable the features without being aware of the
 risks.

 A few possible suggestions:

 - scarier option names, like

   osd objectstore = keyvaluestore_experimental_danger_danger
   ms type = async_experimental_danger_danger
   ms type = xio_experimental_danger_danger

   Once the feature becomes stable, they'll have to adjust their
 config, or we'll need to support both names going forward.

 - a separate config option that allows any experimental option

   allow experimental features danger danger = true
   osd objectstore = keyvaluestore
   ms type = xio

   This runs the risk that the user will enable experimental features to
 get X, and later start using Y without realizing Y is also
 experiemental.

 - enumerate experiemntal options we want to enable

   allow experimental features danger danger = keyvaluestore, xio
   ms type = xio
   osd objectstore = keyvaluestore

   This has the property that no config change is necessary when the
 feature drops its experimental status.

 In all of these cases, we can also make a point of sending something to
 the log on daemon startup.  I don't think too many people will notice
 this, but it is better than nothing.

 Other ideas?
 sage

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] experimental features

2014-12-05 Thread Nigel Williams
On Sat, Dec 6, 2014 at 4:36 AM, Sage Weil sw...@redhat.com wrote:
 - enumerate experiemntal options we want to enable
...
   This has the property that no config change is necessary when the
 feature drops its experimental status.

It keeps the risky options in one place too so easier to spot.

 In all of these cases, we can also make a point of sending something to
 the log on daemon startup.  I don't think too many people will notice
 this, but it is better than nothing.

Perhaps change the cluster health status to FRAGILE? or AT_RISK?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Old OSDs on new host, treated as new?

2014-12-05 Thread Udo Lembke
Hi,
perhaps an stupid question, but why you change the hostname?

Not tried, but I guess if you boot the node with an new hostname, the
old hostname are in the crush map, but without any OSDs - because they
are on the new host.
Don't know ( I guess not) if the degration level stay also on 5% if you
delete the empty host from the crush map.

I would simply use the same hostconfig on an rebuildet host.

Udo

On 03.12.2014 05:06, Indra Pramana wrote:
 Dear all,

 We have a Ceph cluster with several nodes, each node contains 4-6
 OSDs. We are running the OS off USB drive to maximise the use of the
 drive bays for the OSDs and so far everything is running fine.

 Occasionally, the OS running on the USB drive would fail, and we would
 normally replace the drive with a pre-configured similar OS and Ceph
 running, so when the new OS boots up, it will automatically detect all
 the OSDs and start them. It works fine without any issues.

 However, the issue is in recovery. When one node goes down, all the
 OSDs would be down and recovery will start to move the pg replicas on
 the affected OSDs to other available OSDs, and cause the Ceph to be
 degraded, say 5%, which is expected. However, when we boot up the
 failed node with a new OS, and bring back the OSDs up, more PGs are
 being scheduled for backfilling and instead of reducing, the
 degradation level will shoot up again to, for example, 10%, and in
 some occasion, it goes up to 19%.

 We had experience when one node is down, it will degraded to 5% and
 recovery will start, but when we manage to bring back up the node
 (still the same OS), the degradation level will reduce to below 1% and
 eventually recovery will be completed faster.

 Why the same behaviour doesn't apply on the above situation? The OSD
 numbers are the same when the node boots up, the crush map weight
 values are also the same. Only the hostname is different.

 Any advice / suggestions?

 Looking forward to your reply, thank you.

 Cheers.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs survey results

2014-12-05 Thread Lorieri
Hi,


if I have a situation when each node in a cluster writes their own
files in cephfs, is it safe to use multiple MDS ?
I mean, is the problem using multiple MDS related to nodes writing same files ?

thanks,

-lorieri



On Tue, Nov 4, 2014 at 9:47 PM, Shain Miley smi...@npr.org wrote:
 +1 for fsck and snapshots, being able to have snapshot backups and protect
 against accidental deletion, etc is something we are really looking forward
 to.

 Thanks,

 Shain



 On 11/04/2014 04:02 AM, Sage Weil wrote:

 On Tue, 4 Nov 2014, Blair Bethwaite wrote:

 On 4 November 2014 01:50, Sage Weil s...@newdream.net wrote:

 In the Ceph session at the OpenStack summit someone asked what the
 CephFS
 survey results looked like.

 Thanks Sage, that was me!

   Here's the link:

  https://www.surveymonkey.com/results/SM-L5JV7WXL/

 In short, people want

 fsck
 multimds
 snapshots
 quotas

 TBH I'm a bit surprised by a couple of these and hope maybe you guys
 will apply a certain amount of filtering on this...

 fsck and quotas were there for me, but multimds and snapshots are what
 I'd consider icing features - they're nice to have but not on the
 critical path to using cephfs instead of e.g. nfs in a production
 setting. I'd have thought stuff like small file performance and
 gateway support was much more relevant to uptake and
 positive/pain-free UX. Interested to hear others rationale here.

 Yeah, I agree, and am taking the results with a grain of salt.  I
 think the results are heavily influenced by the order they were
 originally listed (I whish surveymonkey would randomize is for each
 person or something).

 fsck is a clear #1.  Everybody wants multimds, but I think very few
 actually need it at this point.  We'll be merging a soft quota patch
 shortly, and things like performance (adding the inline data support to
 the kernel client, for instance) will probably compete with getting
 snapshots working (as part of a larger subvolume infrastructure).  That's
 my guess at least; for now, we're really focused on fsck and hard
 usability edges and haven't set priorities beyond that.

 We're definitely interested in hearing feedback on this strategy, and on
 peoples' experiences with giant so far...

 sage
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 --
 Shain Miley | Manager of Systems and Infrastructure, Digital Media |
 smi...@npr.org | 202.513.3649

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

2014-12-05 Thread Christian Balzer
On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote:

 On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer ch...@gol.com wrote:
 
  Hello,
 
  This morning I decided to reboot a storage node (Debian Jessie, thus
  3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) after
  applying some changes.
 
  It came back up one OSD short, the last log lines before the reboot
  are: ---
  2014-12-05 09:35:27.700330 7f87e789c700  2 -- 10.0.8.21:6823/29520 
  10.0.8.22:0/5161 pipe(0x7f881b772580 sd=247 :6823 s=2 pgs=21 cs=1 l=1
  c=0x7f881f469020).fault (0) Success 2014-12-05 09:35:27.700350
  7f87f011d700 10 osd.4 pg_epoch: 293 pg[3.316( v 289'1347
  (0'0,289'1347] local-les=289 n=8 ec=5 les/c 289/289 288/288/288)
  [8,4,16] r=1 lpr=288 pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346
  active] cancel_copy_ops ---
 
  Quite obviously it didn't complete its shutdown, so unsurprisingly we
  get: ---
  2014-12-05 09:37:40.278128 7f218a7037c0  1 journal
  _open /var/lib/ceph/osd/ceph-4/journal fd 24: 1269312 bytes, block
  size 4096 bytes, directio = 1, aio = 1 2014-12-05 09:37:40.278427
  7f218a7037c0 -1 journal read_header error decoding journal header
  2014-12-05 09:37:40.278479 7f218a7037c0 -1
  filestore(/var/lib/ceph/osd/ceph-4) mount failed to open
  journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument
  2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init: unable to
  mount object store 2014-12-05 09:37:40.776223 7f218a7037c0 -1
  ESC[0;31m ** ERROR: osd init failed: (22) Invalid argument ESC[0m ---
 
  Thankfully this isn't production yet and I was eventually able to
  recover the OSD by re-creating the journal (ceph-osd -i 4
  --mkjournal), but it leaves me with a rather bad taste in my mouth.
 
  So the pertinent questions would be:
 
  1. What caused this?
  My bet is on the evil systemd just pulling the plug before the poor OSD
  had finished its shutdown job.
 
  2. How to prevent it from happening again?
  Is there something the Ceph developers can do with regards to init
  scripts? Or is this something to be brought up with the Debian
  maintainer? Debian is transiting from sysv-init to systemd (booo!)
  with Jessie, but the OSDs still have a sysvinit magic file in their
  top directory. Could this have an affect on things?
 
  3. Is it really that easy to trash your OSDs?
  In the case a storage node crashes, am I to expect most if not all
  OSDs or at least their journals to require manual loving?
 
 So this can't happen. 

Good thing you quoted that, as it clearly did. ^o^

Now the question of how exactly remains to be answered.

 Being force killed definitely can't kill the
 OSD's disk state; that's the whole point of the journaling. 

The other OSDs got to the point where they logged journal flush done,
this one didn't. Coincidence? I think not.

Totally agree about the point of journaling being to prevent this kind of
situation of course.

 The error
 message indicates that the header written on disk is nonsense to the
 OSD, which means that the local filesystem or disk lost something
 somehow (assuming you haven't done something silly like downgrading
 the software version it's running) and doesn't know it (if there had
 been a read error the output would be different). 

The journal is on an SSD, as stated. 
And before you ask it's on an Intel DC S3700.

This was created on 0.80.7 just a day before, so no version games.

 I'd double-check
 your disk settings etc just to be sure, and check for known issues
 with xfs on Jessie.
 
I'm using ext4, but that shouldn't be an issue here to begin with, as the
journal is a raw SSD partition.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com