Re: [ceph-users] Ceph expansion/deploy via ansible
Hi, +1 for ceph-ansible too. ;) -- François (flaf) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw in Nautilus: message "client_io->complete_request() returned Broken pipe"
Hi @ll, I have a Nautilus Ceph cluster UP with radosgw in a zonegroup. I'm using the web frontend Beast (the default in Nautilus). All seems to work fine but in the log of radosgw I have this message: Apr 17 14:02:56 rgw-m-1 ceph-m-rgw.rgw-m-1.rgw0[888]: 2019-04-17 14:02:56.410 7fe659803700 0 ERROR: client_io->complete_request() returned Broken pipe approximately every ~2-3 minutes (it's an average, it's random, it's not every 2 minutes exactly). I think the code which generates this message is here: https://github.com/ceph/ceph/blob/master/src/rgw/rgw_process.cc#L283-L287 but I'm completely unqualified to understand the code. What is the meaning this error message? Should I worry about this message? François (flaf) PS: just in case, here my conf: ~$ cat /etc/ceph/ceph-m.conf [client.rgw.rgw-m-1.rgw0] host = rgw-m-1 keyring = /var/lib/ceph/radosgw/ceph-m-rgw.rgw-m-1.rgw0/keyring log file = /var/log/ceph/ceph-m-rgw-rgw-m-1.rgw0.log rgw frontends = beast endpoint=192.168.222.1:80 rgw thread pool size = 512 [client.rgw.rgw-m-2.rgw0] host = rgw-m-2 keyring = /var/lib/ceph/radosgw/ceph-m-rgw.rgw-m-2.rgw0/keyring log file = /var/log/ceph/ceph-m-rgw-rgw-m-2.rgw0.log rgw frontends = beast endpoint=192.168.222.2:80 rgw thread pool size = 512 # Please do not change this file directly since it is managed by Ansible and will be overwritten [global] cluster network = 10.90.90.0/24 debug_rgw = 0/5 fsid = bb27079f-f116-4440-8a64-9ed430dc17be log file = /dev/null mon cluster log file = /dev/null mon host = [v2:192.168.221.31:3300,v1:192.168.221.31:6789],[v2:192.168.221.32:3300,v1:192.168.221.32:6789],[v2:192.168.221.33:3300,v1:192.168.221.33:6789] mon_osd_down_out_subtree_limit = host mon_osd_min_down_reporters = 4 osd_crush_chooseleaf_type = 1 osd_crush_update_on_start = true osd_pool_default_min_size = 2 osd_pool_default_pg_num = 8 osd_pool_default_ppg_num = 8 osd_pool_default_size = 3 public network = 192.168.221.0/25 rgw_enable_ops_log = true rgw_log_http_headers = http_x_forwarded_for rgw_ops_log_socket_path = /var/run/ceph/rgw-opslog.asok rgw_realm = denmark rgw_zone = zone-m rgw_zonegroup = copenhagen Installation via ceph-ansible with a docker deployment version stable 4.0. ceph_docker_image: v4.0.0-stable-4.0-nautilus-centos-7-x86_64 ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy
Hi Matt, On 4/17/19 1:08 AM, Matt Benjamin wrote: Why is using an explicit unix socket problematic for you? For what it does, that decision has always seemed sensible. In fact, I don't understand why the "ops" logs have a different way from the logs of the process radosgw itself. Personally, if radosgw is launched without a foreground option, it seems to me logical that "ops" logs are put in "log_file" (ie /var/log/ceph/$cluster-$name.log by default) and if radosgw is launched with a foreground option (ie -d or -f) it seems to me logical that "ops" logs are put in stdout/stderr too. Is there a specific reason to put "ops" logs in a different location from the logs of the process radosgw itself? "ops" logs are logs of the "radosgw" process, no? In my case, I use ceph-ansible with docker containers (it works fine by the way ;)): 1. a systemd unit launches a docker container 2. the docker container launches a radosgw process with the -d option (ie "run in foreground, log to stderr"). 3. systemd logs stdout/stderr of the process radosgw in syslog. It could be handy for me if "ops" logs were directly put in stderr/stdout. No? -- François (flaf) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy
Hi @all, On 4/9/19 12:43 PM, Francois Lafont wrote: I have tried this config: - rgw enable ops log = true rgw ops log socket path = /tmp/opslog rgw log http headers = http_x_forwarded_for - and I have logs in the socket /tmp/opslog like this: - {"bucket":"test1","time":"2019-04-09 09:41:18.188350Z","time_local":"2019-04-09 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET /?prefix=toto/&delimiter=%2F HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk 1.05 ( http://www.dragondisk.com )","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]}, - I can see the IP address of the client in the value of HTTP_X_FORWARDED_FOR, that's cool. But I don't understand why there is a specific socket to log that? I'm using radosgw in a Docker container (installed via ceph-ansible) and I have logs of the "radosgw" daemon in the "/var/log/syslog" file of my host (I'm using the Docker "syslog" log-driver). 1. Why is there a _separate_ log source for that? Indeed, in "/var/log/syslog" I have already some logs of civetweb. For instance: 2019-04-09 12:33:45.926 7f02e021c700 1 civetweb: 0x55876dc9c000: 10.111.222.51 - - [09/Apr/2019:12:33:45 +0200] "GET /?prefix=toto/&delimiter=%2F HTTP/1.1" 200 1014 - DragonDisk 1.05 ( http://www.dragondisk.com ) The fact that radosgw uses a separate log source for "ops log" (ie a specific Unix socket) is still very mysterious for me. 2. In my Docker container context, is it possible to put the logs above in the file "/var/log/syslog" of my host, in other words is it possible to make sure to log this in stdout of the daemon "radosgw"? It seems to me impossible to put ops log in the stdout of the "radosgw" process (or, if it's possible, I have not found). So I have made a workaround. I have set: rgw_ops_log_socket_path = /var/run/ceph/rgw-opslog.asok in my ceph.conf and I have created a daemon (via un systemd unit file) which runs this loop: while true; do netcat -U "/var/run/ceph/rgw-opslog.asok" | logger -t "rgwops" -p "local5.notice" done to retrieve logs in syslog. It's not very satisfying but it's works. -- François (flaf) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy
On 4/9/19 12:43 PM, Francois Lafont wrote: 2. In my Docker container context, is it possible to put the logs above in the file "/var/log/syslog" of my host, in other words is it possible to make sure to log this in stdout of the daemon "radosgw"? In brief, is it possible log "operations" in a regular file or better for me in stdout? -- flaf ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy
Hi, On 4/9/19 5:02 AM, Pavan Rallabhandi wrote: Refer "rgw log http headers" under http://docs.ceph.com/docs/nautilus/radosgw/config-ref/ Or even better in the code https://github.com/ceph/ceph/pull/7639 Ok, thx for your help Pavan. I have progressed but I have already some problems. With the help of this comment: https://github.com/ceph/ceph/pull/7639#issuecomment-266893208 I have tried this config: - rgw enable ops log = true rgw ops log socket path = /tmp/opslog rgw log http headers= http_x_forwarded_for - and I have logs in the socket /tmp/opslog like this: - {"bucket":"test1","time":"2019-04-09 09:41:18.188350Z","time_local":"2019-04-09 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET /?prefix=toto/&delimiter=%2F HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk 1.05 ( http://www.dragondisk.com )","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]}, - I can see the IP address of the client in the value of HTTP_X_FORWARDED_FOR, that's cool. But I don't understand why there is a specific socket to log that? I'm using radosgw in a Docker container (installed via ceph-ansible) and I have logs of the "radosgw" daemon in the "/var/log/syslog" file of my host (I'm using the Docker "syslog" log-driver). 1. Why is there a _separate_ log source for that? Indeed, in "/var/log/syslog" I have already some logs of civetweb. For instance: 2019-04-09 12:33:45.926 7f02e021c700 1 civetweb: 0x55876dc9c000: 10.111.222.51 - - [09/Apr/2019:12:33:45 +0200] "GET /?prefix=toto/&delimiter=%2F HTTP/1.1" 200 1014 - DragonDisk 1.05 ( http://www.dragondisk.com ) 2. In my Docker container context, is it possible to put the logs above in the file "/var/log/syslog" of my host, in other words is it possible to make sure to log this in stdout of the daemon "radosgw"? -- flaf ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy
Hi @all, I'm using Ceph rados gateway installed via ceph-ansible with the Nautilus version. The radosgw are behind a haproxy which add these headers (checked via tcpdump): X-Forwarded-Proto: http X-Forwarded-For: 10.111.222.55 where 10.111.222.55 is the IP address of the client. The radosgw use the civetweb http frontend. Currently, this is the IP address of the haproxy itself which is mentioned in logs. I would like to mention the IP address from the X-Forwarded-For HTTP header. How to do that? I have tried this option in ceph.conf: rgw_remote_addr_param = X-Forwarded-For It doesn't work but maybe I have read the doc wrongly. Thx in advance for your help. PS: I have tried too the http frontend "beast" but, in this case, no HTTP request seems to be logged. -- François ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon memory issue jewel 10.2.5 kernel 4.4
Hi @all, On 02/08/2017 08:45 PM, Jim Kilborn wrote: > I have had two ceph monitor nodes generate swap space alerts this week. > Looking at the memory, I see ceph-mon using a lot of memory and most of the > swap space. My ceph nodes have 128GB mem, with 2GB swap (I know the > memory/swap ratio is odd) I had exactly the same problem here in my little ceph cluster: - 5 nodes ceph01,02,03,04,05 on Ubuntu Trusty kernel 3.13 (kernel from the distribution). - Ceph version Jewel 10.2.9 - 4 OSDs per node - 3 montors in ceph01,02,03 - 1 active and 2 standby mds in ceph01,02,03 Yesterday, I had in _ceph02_: 1. Swap and RAM at 100%. 2. A process kswapd0 which took 100% of 1 CPU. 3. A simple "ceph status" or "ceph --version" in this node (ceph02) failed with "ImportError: librados.so.2 cannot map zero-fill pages: Cannot allocate memory". However, in the other nodes, a "ceph status" gave me a fully HEALTH_OK cluster. Maybe an important point: ceph01, ceph02, ceph03 are the same servers (hardware and conf via Puppet, 4 osd + 1 mon + 1 mds each) but the _active_ mds was hosted in ceph02 (since 2 months approximatively). The process ceph-mon in ceph02 has been oom-killed by the kernel this night and the usage of the memory is normal now. The data in the monitor working dir are really small as you can see: Filesystem Size Used Avail Use% Mounted on ceph01 => /dev/sda530G 126M 30G 1% /var/lib/ceph/mon/ceph-ceph01 ceph02 => /dev/sda530G 121M 30G 1% /var/lib/ceph/mon/ceph-ceph02 ceph03 => /dev/sda530G 78M 30G 1% /var/lib/ceph/mon/ceph-ceph03 It seems to me that the problem appears step by step after 2 month approximatively. It's not suddenly. Is it a known issue? Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)
On 12/20/2016 10:02 AM, Wido den Hollander wrote: > I think it is commit 0cdf3bc875447c87fdc0fed29831554277a3774b: > https://github.com/ceph/ceph/commit/0cdf3bc875447c87fdc0fed29831554277a3774b Thanks Wido but in fact I have doubts... > It invokes a start after the package install/upgrade. Since you have manually > stopped the daemons they will be started again. Yes, that seems to be logical but I have checked with the 10.2.1 version for instance and I have these lines too in the ceph-osd postinst: case "$1" in configure) [ -x /sbin/start ] && start ceph-osd-all || : and I'm pretty sure that during the 10.2.0 => 10.2.1 upgrade the osd daemons weren't started again (after I have manually stopped the daemons). I'm pretty because my process is manual but it's a note I follow "idiotically" and I'm sure that I have notice the change with the 10.2.5 version. However, between 10.2.1 and 10.2.5 I have noticed this diff in the postinst: # Automatically added by dh_systemd_enable # This will only remove masks created by d-s-h on package removal. deb-systemd-helper unmask ceph-osd.target >/dev/null || true # was-enabled defaults to true, so new installations run enable. if deb-systemd-helper --quiet was-enabled ceph-osd.target; then # Enables the unit on first installation, creates new # symlinks on upgrades if the unit file has changed. deb-systemd-helper enable ceph-osd.target >/dev/null || true else # Update the statefile to add new symlinks (if any), which need to be # cleaned up on purge. Also remove old symlinks. deb-systemd-helper update-state ceph-osd.target >/dev/null || true fi # End automatically added section # Automatically added by dh_systemd_start if [ -d /run/systemd/system ]; then systemctl --system daemon-reload >/dev/null || true deb-systemd-invoke start ceph-osd.target >/dev/null || true fi # End automatically added section I don't know if it can explain the change I have noticed... Currently I'm lost. As you said Wido, the line "... start ceph-osd-all..." should start again the osd daemons manually stopped by myself but this line is present since 10.2.1 at least and I pretty sure that I hadn't this behavior with 10.2.1. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)
On 12/19/2016 09:58 PM, Ken Dreyer wrote: > I looked into this again on a Trusty VM today. I set up a single > mon+osd cluster on v10.2.3, with the following: > > # status ceph-osd id=0 > ceph-osd (ceph/0) start/running, process 1301 > > #ceph daemon osd.0 version > {"version":"10.2.3"} > > I ran "apt-get upgrade" to get go 10.2.3 -> 10.2.5, and the OSD PID > (1301) and version from the admin socket (v10.2.3) remained the same. In which repository do you have retrieved the 10.2.3 version of ceph? I could make a test too. > Could something else be restarting the daemons in your case? I use Puppet to manage my hosts but "ceph" services are all *un*managed by Puppet, I'm sure (and the run is weekly only and I have noticed the behavior in my 5 nodes). Management of the "ceph" services is completely manual in my case. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)
Hi, On 12/19/2016 09:58 PM, Ken Dreyer wrote: > I looked into this again on a Trusty VM today. I set up a single > mon+osd cluster on v10.2.3, with the following: > > # status ceph-osd id=0 > ceph-osd (ceph/0) start/running, process 1301 > > #ceph daemon osd.0 version > {"version":"10.2.3"} > > I ran "apt-get upgrade" to get go 10.2.3 -> 10.2.5, and the OSD PID > (1301) and version from the admin socket (v10.2.3) remained the same. > > Could something else be restarting the daemons in your case? As Christian said, this is not _exactly_ the "problem" I have described in my first message. You can read it again, I give _verbatim_ the commands I launch in the host during an upgrade. Personally, I stop manually the daemons before the "ceph" upgrade (which is not the case in your example above): 1. I stop manually all OSD daemons in the host. 2. I make the "ceph" upgrade (sudo apt-get update && sudo apt-get upgrade) Then... 3(i). Before the 10.2.5 version, the ceph daemons are still stopped. 3(ii). With the 10.2.5 version, the ceph daemons have been started automatically. Personally I would prefer the 3i scenario (all details are in my first message). I don't know what exactly but something has changed with the version 10.2.5. Regards. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)
On 12/13/2016 12:42 PM, Francois Lafont wrote: > But, _by_ _principle_, in the specific case of ceph (I know it's not the > usual case of packages which provide daemons), I think it would be more > safe and practical that the ceph packages don't manage the restart of > daemons. And I say (even if I think it was relatively clear in my first post) that *it was the case* before the 10.2.5 version, so I was surprised by this change. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unwanted automatic restart of daemons during an upgrade since 10.2.5 (on Trusty)
Hi @all, I have a little remark concerning at least the Trusty ceph packages (maybe it concerns another distributions, I don't know). I'm pretty sure that before the 10.2.5 version, the restart of the daemons wasn't managed during the packages upgrade and with the 10.2.5 version it's the case. I explain below. Personally, during a "ceph" upgrade, I prefer to manage the "ceph" daemons _myself_. For instance, during a "ceph" upgrade of a Ubuntu Trusty OSD server, I'm used to make something like that: # I stop all the OSD daemons (here, it's an upstart command but it's # an implementation detail, the idea is just "I stop all OSD"): sudo stop ceph-osd-all # And after that, I launch the "ceph" upgrade with something like that: sudo apt-get update && sudo apt-get upgrade # (*) Before the 10.2.5 version, the daemons weren't automatically # restarted by the upgrade and personally, it was a _good_ thing # for me. Now, with the 10.2.5 version, the daemons seems to be # automatically restarted. # Personally, after a "ceph" upgrade, I always prefer launch a _reboot_ # of the server. sudo reboot So, now with 10.2.5 version, in my process, OSD daemons are stopped, then automatically restarted by the upgrade and then stopped again by the reboot. This is not an optimal process of course. ;) I perfectly know workarounds to avoid an automatic restart of the daemons during the "ceph" upgrades (for instance, in the case of Trusty, I could temporarily removed the files /var/lib/ceph/osd/ceph-$id/upstart). But, _by_ _principle_, in the specific case of ceph (I know it's not the usual case of packages which provide daemons), I think it would be more safe and practical that the ceph packages don't manage the restart of daemons. What you do think about that ? Maybe I'm wrong... ;) François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
On 12/09/2016 06:39 PM, Alex Evonosky wrote: > Sounds great. May I asked what procedure you did to upgrade? Of course. ;) It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/ (I think this link was pointed by Greg Farnum or Sage Weil in a previous message). Personally I use Ubuntu Trusty, so for me in the page above leads me to use this line in my "sources.list": deb http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/ trusty main And after that "apt-get update && apt-get upgrade" etc. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
Hi, Just for information, after the upgrade to the version 10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e) of all my cluster (osd, mon and mds) since ~30 hours, I have no problem (my cluster is a small cluster with 5 nodes and 4 osds per nodes and 3 monitors and I just use cephfs). Bye. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
On 12/08/2016 11:24 AM, Ruben Kerkhof wrote: > I've been running this on one of my servers now for half an hour, and > it fixes the issue. It's the same for me. ;) ~$ ceph -v ceph version 10.2.4-1-g5d3c76c (5d3c76c1c6e991649f0beedb80e6823606176d9e) Thanks for the help. Bye. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released -- IMPORTANT
On 12/08/2016 12:38 AM, Gregory Farnum wrote: > Yep! Ok, thanks for the confirmations Greg. Bye. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released -- IMPORTANT
On 12/08/2016 12:06 AM, Sage Weil wrote: > Please hold off on upgrading to this release. It triggers a bug in > SimpleMessenger that causes threads for broken connections to spin, eating > CPU. > > We're making sure we understand the root cause and preparing a fix. Waiting for the fix and its release, can you confirm to me that restart osd daemon every 15 minutes is a possible workaround? In my case, I have a little cluster (5 nodes with 4 osd each) and it's possible for me to restart daemons every 15 minutes without have a cluster completely down. ;) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
On 12/07/2016 11:33 PM, Ruben Kerkhof wrote: > Thanks, l'll check how long it takes for this to happen on my cluster. > > I did just pause scrub and deep-scrub. Are there scrubs running on > your cluster now by any chance? Yes but normally not currently because I have: osd scrub begin hour = 3 osd scrub end hour = 5 In the ceph.conf of all my cluster node, so normally I have currently no scrubbing. Why do you think it's related with scrubbing. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
On 12/07/2016 11:16 PM, Steve Taylor wrote: > I'm seeing the same behavior with very similar perf top output. One server > with 32 OSDs has a load average approaching 800. No excessive memory usage > and no iowait at all. Exactly! And another interesting information (maybe). I have ceph-osd process with big cpu load (as Steve said no iowait and no excessive memory usage). If I restart the ceph-osd daemon cpu load becomes OK during exactly 15 minutes for me. After 15 minutes, I have the cpu load again. It's curious this number of 15 minutes, isn't it? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10.2.4 Jewel released
Hi, On 12/07/2016 01:21 PM, Abhishek L wrote: > This point release fixes several important bugs in RBD mirroring, RGW > multi-site, CephFS, and RADOS. > > We recommend that all v10.2.x users upgrade. Also note the following when > upgrading from hammer Well... little warning: after upgrade from 10.2.3 to 10.2.4, I have big load cpu on osd and mds. Something like that: top - 18:53:40 up 2:11, 1 user, load average: 32.14, 29.49, 27.36 Tasks: 192 total, 2 running, 190 sleeping, 0 stopped, 0 zombie %Cpu(s): 19.4 us, 80.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 32908088 total, 1876820 used, 31031268 free,31464 buffers KiB Swap: 8388604 total,0 used, 8388604 free. 412340 cached Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 2174 ceph 20 0 492408 79260 8688 S 169.7 0.2 139:49.77 ceph-mds 2318 ceph 20 0 1081428 166700 25832 S 160.4 0.5 178:32.18 ceph-osd 2288 ceph 20 0 1256604 241796 22896 S 159.4 0.7 189:25.19 ceph-osd 2301 ceph 20 0 1261172 261040 23664 S 156.1 0.8 197:11.24 ceph-osd 2337 ceph 20 0 1247904 260048 19084 S 154.8 0.8 191:01.90 ceph-osd 2171 ceph 20 0 466160 58292 10992 S 0.3 0.2 0:29.89 ceph-mon On IRC, two another persons have the same behavior after the upgrade. The cluster is HEALTH OK. I don't see O/I on disk. If I restart daemons, all is ok but after a few minutes the load cpu starts again. I have currently no idea about the problem. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Keep previous versions of ceph in the APT repository
Hi @all, Ceph teaem, could it be possible to keep the previous versions of ceph* packages in the APT repository? Indeed, for instance for Ubuntu Trusty, currently we have: ~$ curl -s http://download.ceph.com/debian-jewel/dists/trusty/main/binary-amd64/Packages | grep -A 1 '^Package: ceph$' Package: ceph Version: 10.2.3-1trusty Only the last version 10.2.3 is available and, for instance, versions 10.2.2 and 10.2.1 has been removed from the APT repository. It could be handy to keep the previous versions too. Personally, it's useful for me to test an upgrade in a lab: for instance when I want to make a lab in 10.2.2 and then test an upgrade to 10.2.3 (and then make the upgrade in production if all is ok). It seems to me a good thing to keep the old versions in the APT repository but maybe it's complicated for the Ceph team... Thanks for your help. Regards. François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2
Hi, On 08/29/2016 08:30 PM, Gregory Farnum wrote: > Ha, yep, that's one of the bugs Giancolo found: > > ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) > 1: (()+0x299152) [0x7f91398dc152] > 2: (()+0x10330) [0x7f9138bbb330] > 3: (Client::get_root_ino()+0x10) [0x7f91397df6c0] > 4: (CephFuse::Handle::make_fake_ino(inodeno_t, snapid_t)+0x175) > [0x7f91397dd3d5] > 5: (()+0x19ac09) [0x7f91397ddc09] > 6: (()+0x14b45) [0x7f91391f7b45] > 7: (()+0x1522b) [0x7f91391f822b] > 8: (()+0x11e49) [0x7f91391f4e49] > 9: (()+0x8184) [0x7f9138bb3184] > 10: (clone()+0x6d) [0x7f913752237d] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > > > So I that'll be in the next Jewel release if it's not already fixed in 10.2.2. If I see the previous message of Goncalo in this thread, the bug still exists in Jewel 10.2.2 so I deduce that it will be fixed in the 10.2.3. Can you tell me where is the report of this specific bug in http://tracker.ceph.com ? I have not found it. Thanks. François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2
On 08/27/2016 12:01 PM, Francois Lafont wrote: > I had exactly the same error in my production ceph client node with > Jewel 10.2.1 in my case. I have forgotten to say that the ceph cluster was perfectly HEALTH_OK before, during and after the error in the client side. Regards. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-fuse "Transport endpoint is not connected" on Jewel 10.2.2
Hi, I had exactly the same error in my production ceph client node with Jewel 10.2.1 in my case. In the client node : - Ubuntu 14.04 - kernel 3.13.0-92-generic - ceph 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) - cephfs via _ceph-fuse_ In the cluster node : - Ubuntu 14.04 - kernel 3.13.0-92-generic - ceph 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) It was during the execution of a very basic Python (2.7.6) script which makes some os.makedirs(...) and os.chown(...). Just in case, the logs are below. I'm sorry they are not verbose at all and so probably useless for you... Which settings should I put in my client and cluster configuration to have relevant logs if the same error happens again? Regards. François Lafont Here are the logs: 1. In the client node: http://francois-lafont.ac-versailles.fr/misc/ceph-client.cephfs.log.1.gz 2. In the (active) mds node: %<%<%<%<%<%<%<%< ~$ sudo zcat /var/log/ceph/ceph-mds.ceph02.log.1.gz 2016-08-22 15:02:03.799037 7f3f9adc1700 0 -- 10.0.2.102:6800/2186 >> 192.168.23.11:0/431481110 pipe(0x7f3fb3a87400 sd=22 :6800 s=2 pgs=64 cs=1 l=0 c=0x7f3fb5f10900).fault with nothing to send, going to standby 2016-08-22 15:02:40.236001 7f3f9f7d3700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 34.503993 secs 2016-08-22 15:02:40.236026 7f3f9f7d3700 0 log_channel(cluster) log [WRN] : slow request 34.503993 seconds old, received at 2016-08-22 15:02:05.731897: client_request(client.1442720:650326 getattr pAsLsXsFs #101b6d0 2016-08-22 15:02:05.731515) currently failed to rdlock, waiting 2016-08-22 15:07:00.245269 7f3f9f7d3700 0 log_channel(cluster) log [INF] : closing stale session client.1433176 192.168.23.11:0/431481110 after 304.132797 2016-08-22 15:23:07.970215 7f3f9adc1700 0 -- 10.0.2.102:6800/2186 >> 192.168.23.11:0/2607326748 pipe(0x7f3fff365400 sd=22 :6800 s=2 pgs=8 cs=1 l=0 c=0x7f3fb5f10a80).fault, server, going to standby 2016-08-22 15:28:05.281489 7f3f9f7d3700 0 log_channel(cluster) log [INF] : closing stale session client.1537178 192.168.23.11:0/2607326748 after 300.588323 %<%<%<%<%<%<%<%< ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-fuse, fio largely better after migration Infernalis to Jewel, is my bench relevant?
On 06/06/2016 18:41, Gregory Farnum wrote: > We had several metadata caching improvements in ceph-fuse recently which I > think went in after Infernalis. That could explain it. Ok, in this case, it could be good news. ;) I had doubts concerning my fio bench. I know that benchs can be tricky especially with distributed filesystems. Thanks for your answer Greg. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-fuse, fio largely better after migration Infernalis to Jewel, is my bench relevant?
Hi, I have a little Ceph cluster in production with 5 cluster nodes and 2 client nodes. The clients are using cephfs via fuse.ceph. Recently, I have upgraded my cluster from Infernalis to Jewel (servers _and_ clients). When the cluster was in Infernalis version the fio command below gave me approximatively 1100-1300 iops. fio --directory=/mnt/moodle/test/ --name=rwjob --readwrite=randrw \ --rwmixread=50 --gtod_reduce=1 --bs=4k --size=100MB \ --ioengine=sync --direct=0 --numjobs=4 --group_reporting I have tested the exactly same fio command after the migration where the all nodes are in Jewel version and I have ~ 2500-3000 iops. I know that benchs can be very tricky so here is my question: is this significant improvement due to the "Infernalis => Jewel" migration, or is my test not relevant? Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] A radosgw keyring with the minimal rights, which pools have I to create?
Hi, In a from scratch Jewel cluster, I'm searching the exact list of pools I have to create and the minimal rights that I can set for the keyring used by the radosgw instance. This is for the default zone. I intend to just use the S3 API of the radosgw. a) I have read the doc here http://docs.ceph.com/docs/jewel/radosgw/config-ref/#pools, but, according to me, it doesn't seem to be updated, am I wrong? Indeed, I have used a keyring with these rights: [client.radosgw.gateway] key = xx== caps mon = "allow rwx" caps osd = "allow rwx" so that the pools are created automatically after the starting of radosgw. I have created a S3 account with "radosgw-admin" and I have created a bucket with this S3 account. After that, here is the list of created pools: .rgw.root default.rgw.control default.rgw.data.root default.rgw.gc default.rgw.log default.rgw.users.uid default.rgw.users.email default.rgw.users.keys default.rgw.meta default.rgw.buckets.index It doesn't seem to match with the doc. Am I wrong anywhere? b) By the way, can you confirm me there are modifications on this point between Infernalis and Jewel. Indeed if I do exactly the same "test" with a from scratch Infernalis cluster, here is the list of created pools: .rgw.root .rgw.control .rgw .rgw.gc .log .users.uid .users.email .users .rgw.buckets.index .rgw.buckets Why is it different between Infernalis and Jewel? To me, it seems curious and I have probably missed something, haven't I? c) Can you confirm me that the minimal rights for a radosgw keyring is something like that: [client.radosgw.gateway] key = xx== caps mon = "allow r" caps osd = "allow rwx pool=,..., rwx=" and can you tell me the exact list of pools I have to create, ie the list , ..., because this is not clear for me? Just in case, here is the typical conf of my radosgw instance: [client.radosgw.gateway] host = ceph-rgw keyring= /etc/ceph/ceph.client.radosgw.gateway.keyring rgw socket path= "" log file = /var/log/ceph/ceph.client.radosgw.gateway.log rgw frontends = civetweb port=8080 rgw print continue = false rgw dns name = store.domain.tld Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jewel upgrade and sortbitwise
Hi, On 03/06/2016 16:29, Samuel Just wrote: > Sorry, I should have been more clear. The bug actually is due to a > difference in an on disk encoding from hammer. An infernalis cluster would > never had had such encodings and is fine. Ah ok, fine. ;) Thanks for the answer. Bye. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jewel upgrade and sortbitwise
Hi, On 03/06/2016 05:39, Samuel Just wrote: > Due to http://tracker.ceph.com/issues/16113, it would be best to avoid > setting the sortbitwise flag on jewel clusters upgraded from previous > versions until we get a point release out with a fix. > > The symptom is that setting the sortbitwise flag on a jewel cluster > upgraded from a previous version can result in some pgs reporting > spurious unfound objects. Unsetting sortbitwise should cause the PGs > to go back to normal. Clusters created at jewel don't need to worry > about this. Now, I have an Infernalis cluster in production. It's an Infernalis cluster installed from scratch (not from an upgrade). I intend to upgrade the cluster to Jewel. Indeed, I have noticed that the flag "sortbitwise" was set by default in my Infernalis cluster. By the way, I don't know exactly the meaning of this flag but the cluster is HEALTH_OK with this flag set by default so I have not changed it. If I have well understood, to upgrade my Infernalis cluster, I have 2 options: a) I unset the flag "sortbitwise" via "ceph osd unset sortbitwise", then I upgrade the cluster to Jewel 10.2.1 and in the next release of Jewel (I guess 10.2.2) I could set again the flag via "ceph osd set sortbitwise". b) Or I just wait for the next release of Jewel (10.2.2) without worrying about the flag "sortbitwise". 1. Is it correct? 2. Can we have data movement when we toggle the flag "sortbitwise"? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?
Hi, On 02/06/2016 04:44, Francois Lafont wrote: > ~# grep ceph /etc/fstab > id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ > /mnt/ fuse.ceph noatime,nonempty,defaults,_netdev 0 0 [...] > And I have rebooted. After the reboot, big surprise with this: > > ~# cat /tmp/mount.fuse.ceph.log > arguments are > id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint= > /mnt -o rw,_netdev,noatime,nonempty > ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring > --client_mountpoint= /mnt -o rw,noatime,nonempty > > Yes, this is not a misprint, there is no "/" after "client_mountpoint=". [...] > Now, my question is: which program gives the arguments to > /sbin/mount.fuse.ceph? > Is it the init program (upstart in my case)? Or does it concern a Ceph > programs? I have definitely found the culprit. In fact, this is not Upstart. It's "/sbin/mountall" (from the "mountall" package) which is used by Upstart to mount filesystems in fstab. In the source code "src/mountall.c", there is a line which removes wrongly the trailing "/" in my valid fstab line: id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ /mnt/ fuse.ceph ... I have made a bug report here where all is explained: https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1588594 It could be good to know it (I lost 1/2 day with this bug ;)). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?
Now, I have a explanation and it's _very_ strange, absolutely not related to a problem of Unix rights. For memory, my client node is an updated Ubuntu Trusty and I use ceph-fuse. Here is my fstab line: ~# grep ceph /etc/fstab id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ /mnt/ fuse.ceph noatime,nonempty,defaults,_netdev 0 0 My VM is in the Infernalis state where cephfs is well automatically mounted at boot. I have just modified the file /sbin/mount.fuse.ceph (it's shell script) to add these 2 lines: echo arguments are "$@" >/tmp/mount.fuse.ceph.log [...] # The command launched by /sbin/mount.fuse.ceph via an "exec". echo ceph-fuse $cephargs $2 $3 $opts >>/tmp/mount.fuse.ceph.log And I have rebooted. After the reboot, big surprise with this: ~# cat /tmp/mount.fuse.ceph.log arguments are id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint= /mnt -o rw,_netdev,noatime,nonempty ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring --client_mountpoint= /mnt -o rw,noatime,nonempty Yes, this is not a misprint, there is no "/" after "client_mountpoint=". But, with Infernalis, it works even without the "/". ~# ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring --client_mountpoint= /mnt -o rw,noatime,nonempty && echo OK ceph-fuse[1380]: starting ceph client 2016-06-02 04:09:37.340050 7f69590e9780 -1 init, newargv = 0x7f695b7ae0b0 newargc=13 ceph-fuse[1380]: starting fuse OK And with Jewel, it's simple, I have exactly the same thing, except that, without the "/", ceph-fuse fails: ~# ceph -v ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269) ~# ceph-fuse --id=cephfs --keyring=/etc/ceph/ceph.client.cephfs.keyring --client_mountpoint= /mnt -o rw,noatime,nonempty && echo OK ceph-fuse[1302]: starting ceph client 2016-06-02 04:30:25.840514 7f9ec24b2e80 -1 init, newargv = 0x7f9ecba9ffd0 newargc=13 ceph-fuse[1302]: ceph mount failed with (1) Operation not permitted ceph-fuse[1300]: mount failed: (1) Operation not permitted By the way, it seems to me a sane behavior to have a fail with a not clean option. So, this is not a Infernalis=>Jewel regression at all. The problem is: the arguments which are given to /sbin/mount.fuse.ceph are bad. A possible workaround is to just change the place of "client_mountpoint=/" in the fstab line. For instance, no problem with: id=cephfs,client_mountpoint=/,keyring=/etc/ceph/ceph.client.cephfs.keyring /mnt ... ^^^ It's definitively curious that a manual mount works well and not at boot. My conclusion is that the mechanism (ie the code) to pass arguments to ceph-fuse from fstab are different in these 2 cases (manual mount vs mount at boot). Now, my question is: which program gives the arguments to /sbin/mount.fuse.ceph? Is it the init program (upstart in my case)? Or does it concern a Ceph programs? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?
Hi, On 01/06/2016 23:16, Florent B wrote: > Don't have this problem on Debian migration from Infernalis to Jewel, > check all permissions... Ok, it's probably the reason (I hope) but currently I don't find the good unix rights. I have this (which doesn't work): ~# ll -d /etc/ceph drwxr-xr-x 2 root root 4096 Jun 2 00:17 /etc/ceph/ ~# tree -pug /etc/ceph /etc/ceph |-- [-rw-rw ceph ceph] ceph.client.cephfs.keyring |-- [-rw-rw ceph ceph] ceph.client.cephfs.secret `-- [-rw-r--r-- root root] ceph.conf Can you give me your unix rights to compare please? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Infernalis => Jewel: ceph-fuse regression concerning the automatic mount at boot?
Hi, I have a Jewel Ceph cluster in OK state and I have a "ceph-fuse" Ubuntu Trusty client with ceph Infernalis. The cephfs is mounted automatically and perfectly during the boot via ceph-fuse and this line in /etc/fstab : ~# grep ceph /etc/fstab id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ /mnt/ fuse.ceph noatime,nonempty,defaults,_netdev 0 0 I change my sources.list to install the Jewel version. I install the Jewel packages via a simple "apt-get update && apt-get upgrade". Now, the Jewel version is installed. So I reboot the machine. But now the automatic mount of cephfs at boot works no longer. After the reboot, I have: ~# mountpoint /mnt/ /mnt/ is not a mountpoint ~# tail /var/log/upstart/mountall.log [...] 2016-06-01 19:00:55.234594 7f29301dbe80 -1 init, newargv = 0x7f29397b8fd0 newargc=13 ceph-fuse[362]: starting ceph client ceph-fuse[362]: ceph mount failed with (1) Operation not permitted ceph-fuse[319]: mount failed: (1) Operation not permitted <== Here! mountall: mount /mnt [306] terminated with status 255 mountall: Disconnected from Plymouth The error is very curious because I have absolutely no problem to mount the cephfs manually: ~# mount /mnt/ ceph-fuse[1279]: starting ceph client 2016-06-01 19:06:23.660419 7f174e336e80 -1 init, newargv = 0x7f1758c4afd0 newargc=13 ceph-fuse[1279]: starting fuse ~# mountpoint /mnt/ /mnt/ is a mountpoint ~# df /mnt/ Filesystem 1K-blocks Used Available Use% Mounted on ceph-fuse 21983232 172032 21811200 1% /mnt The machine is an updated and basic Ubuntu Trusty. I can reproduce the problem *systematically*. Indeed, the machine is a VM with a snapshot in Infernalis state where all is OK and after the upgrade the problem happens systematically. I have tried several reboot and the cephfs is *never* mounted automatically (but the manual mount is completely OK). Is it a little "Infernalis => Jewel" regression concerning ceph-fuse or have I forgotten a new mount option or something like that? I can reproduce the problem and provide any log if needed. Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Meaning of the "host" parameter in the section [client.radosgw.{instance-name}] in ceph.conf?
Hi, On 26/05/2016 23:46, Francois Lafont wrote: > a) My first question is perfectly summarized in the title. ;) > Indeed, here is a typical section [client.radosgw.{instance-name}] in > the ceph.conf of a radosgw server "rgw-01": > > -- > # The instance-name is "gateway" here. > [client.radosgw.gateway] > host = rgw-01 > keyring= /etc/ceph/ceph.client.radosgw.gateway.keyring > rgw socket path= "" > log file = /var/log/radosgw/ceph.client.radosgw.gateway.log > rgw frontends = civetweb port=8080 > rgw print continue = false > rgw dns name = rgw-01.domain.tld > -- > > I have tried without the "host" parameter and it seems to work perfectly. > So what is the meaning of this parameter and what's it for? > > I have found no answer in the documentation but I may be wrong searched... Can you confirm me these 2 points? i) In fact, the "host" parameter is needed only if in ceph.conf there are multiple [client.radosgw.{instance-name}] sections for different radosgw servers. If there are only the sections [client.radosgw.{instance-name}] which concern the current radosgw server, the "host" parameter is useless. In fact, everything happens as if the default value of the "host" parameter in the [client.radosgw.{instance-name}] was $(hostname). ii) The {instance-name} in [client.radosgw.{instance-name}] must be necessarily unique _in_ the cluster, _not_ unique by radosgw server. Is it correct? > b) Is it a bad idea if I use the same keyring (and so the same ceph account) > in the 2 radosgw servers "rgw-01" and "rgw-02"? I'm still interested by this question. I know it's possible to use the same keyring (ie the same ceph account) in multiple radosgw servers but I don't know if it's recommended or not. Thanks in advance. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Meaning of the "host" parameter in the section [client.radosgw.{instance-name}] in ceph.conf?
Hi, a) My first question is perfectly summarized in the title. ;) Indeed, here is a typical section [client.radosgw.{instance-name}] in the ceph.conf of a radosgw server "rgw-01": -- # The instance-name is "gateway" here. [client.radosgw.gateway] host = rgw-01 keyring= /etc/ceph/ceph.client.radosgw.gateway.keyring rgw socket path= "" log file = /var/log/radosgw/ceph.client.radosgw.gateway.log rgw frontends = civetweb port=8080 rgw print continue = false rgw dns name = rgw-01.domain.tld -- I have tried without the "host" parameter and it seems to work perfectly. So what is the meaning of this parameter and what's it for? I have found no answer in the documentation but I may be wrong searched... 2. Is it a bad idea if I use the same keyring (and so the same ceph account) in the 2 radosgw servers "rgw-01" and "rgw-02"? Thanks in advance. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deprecating ext4 support
Hello, On 11/04/2016 23:39, Sage Weil wrote: > [...] Is this reasonable? [...] Warning: I'm just a ceph user and definitively non-expert user. 1. Personally, if you see the documentation, read a little the maling list and/or IRC, it seems to me _clear_ that ext4 is not recommended even if the opposite if mentioned sometimes (personally I don't use ext4 in my ceph cluster, I use xfs as the doc says). 2. I'm not a ceph expert but I can imagine the monstrous work that represents the development of a software such as ceph and I think it can be reasonable sometimes to limit the work when it's possible. So make ext4 deprecated seems to me reasonable. I think the comfort of the users is important but, in a _long_ term, it seems to me important that the developers can concentrate their work to important things. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
Hello, On 20/03/2016 04:47, Christian Balzer wrote: > That's not protection, that's an "uh-oh, something is wrong, you better > check it out" notification, after which you get to spend a lot of time > figuring out which is the good replica In fact, I have never been confronted to this case so far and I have a couple of questions. 1. When it happens (ie a deep scrub fails), is it mentioned in the output of the "ceph status" command and, in this case, can you confirm to me that the health of the cluster in the output is different of "HEALTH_OK"? 2. For instance, if it happens with the PG id == 19.10 and if I have 3 OSDs for this PG (because my pool has replica size == 3). I suppose that the concerned OSDs are OSD id == 1, 6 and 12. Can you tell me if this "naive" method is valid to solve the problem (and, if not, why)? a) ssh in the node which hosts osd-1 and I launch this command: ~# id=1 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | sed "s|/ceph-$id/|/ceph-id/|" | sha1sum 055b0fd18cee4b158a8d336979de74d25fadc1a3 - b) ssh in the node which hosts osd-6 and I launch this command: ~# id=6 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | sed "s|/ceph-$id/|/ceph-id/|" | sha1sum 055b0fd18cee4b158a8d336979de74d25fadc1a3 - c) ssh in the node which hosts osd-12 and I launch this command: ~# id=12 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | sed "s|/ceph-$id/|/ceph-id/|" | sha1sum 3f786850e387550fdab836ed7e6dc881de23001b - I notice that the result is different for osd-12 so it's the "bad" osd. So, in the node which hosts osd-12, I launch this command: id=12 && rm /var/lib/ceph/osd/ceph-$id/current/19.10_head/* And now I can launch safely this command: ceph pg repair 19.10 Is there a problem with this "naive" method? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Change Unix rights of /var/lib/ceph/{osd, mon}/$cluster-$id/ directories on Infernalis?
Hi David, On 14/03/2016 18:33, David Casier wrote: > "usermod -aG ceph snmp" is better ;) After thinking, the solution to add "snmp" in the "ceph" group seems to me better too... _if_ the "ceph" group has never the "w" right in /var/lib/ceph/ (which seems to be the case). So thanks to comfort me in my choice. PS: by the way, the "usermod" command always seems to me complicated to add a user in a group. I prefer the more readable below. ;) gpasswd --add snmp ceph -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Change Unix rights of /var/lib/ceph/{osd, mon}/$cluster-$id/ directories on Infernalis?
Hi, I have a ceph cluster on Infernalis and I'm using a snmp agent to retrieve data and generate generic graphs concerning each cluster node. Currently, I can see in the syslog of each node this kind of lines (every 5 minutes): Mar 11 03:15:26 ceph01 snmpd[16824]: Cannot statfs /var/lib/ceph/mon/ceph-ceph01#012: Permission denied Mar 11 03:15:26 ceph01 snmpd[16824]: Cannot statfs /var/lib/ceph/osd/ceph-16#012: Permission denied Of course, it's a basic problem of Unix rights. The snmp agent uses the account "snmp" and the Unix rights of the ceph home directory are: ~# ll -d /var/lib/ceph drwxr-x--- 9 ceph ceph 4096 Nov 4 06:34 /var/lib/ceph/ So, of course, currently the snmp account can't access to /var/lib/ceph/{osd,mon}/$cluster-$id/. 1. Is there a problem (an eventually side effect) if I just do that? chmod o+rx /var/lib/ceph/ Can I have security problem with that? 2. Or do you think it's a better idea to just add "snmp" in the Unix group "ceph"? Maybe better than 1. because I don't change the permissions of the directory _and_ it seems to me that a member of the "ceph" group has never the "w" right in /var/lib/ceph/. Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache tier operation clarifications
Hello, On 04/03/2016 09:17, Christian Balzer wrote: > Unlike the subject may suggest, I'm mostly going to try and explain how > things work with cache tiers, as far as I understand them. > Something of a reference to point to. [...] I'm currently unqualified concerning cache tiering but I'm pretty sure that your post is very relevant and I think you should make a pull-request on the Ceph documentation where you could bring all these lights. Here, your explanations will be lost in the depths of the mailing list. ;) Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot mount cephfs after some disaster recovery
On 01/03/2016 18:14, John Spray wrote: >> And what is the meaning of the first and the second number below? >> >> mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active} >>^ ^ > > Your whitespace got lost here I think, but I guess you're talking > about the 1/1 part. Yes indeed. > The shorthand MDS status is up/in/max_mds > (https://github.com/ceph/ceph/blob/master/src/mds/MDSMap.cc#L248) > > up: how many daemons are up and holding a rank (they may be active or > replaying, etc) > in: how many ranks exist in the MDS cluster > max_mds: if there are this many MDSs already, new daemons will be made > standbys instead of having ranks created for them. > > On single-active-daemon systems, this is really just going to be 1/1/1 > or 0/1/1 for whether you have an up MDS or not. Ok thx John for the explanations. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrade to INFERNALIS
Hi, On 02/03/2016 00:12, Garg, Pankaj wrote: > I have upgraded my cluster from 0.94.4 as recommended to the just released > Infernalis (9.2.1) Update directly (skipped 9.2.0). > I installed the packaged on each system, manually (.deb files that I built). > > After that I followed the steps : > > Stop ceph-all > chown -R ceph:ceph /var/lib/ceph > start ceph-all Ok, and the journals? > I am still getting errors on starting OSDs. > > 2016-03-01 22:44:45.991043 7fa185f000 -1 filestore(/var/lib/ceph/osd/ceph-69) > mount failed to open journal /var/lib/ceph/osd/ceph-69/journal: (13) > Permission denied I suppose your journal is a symlink which targets to a raw partition, correct? In this case, the ceph Unix account seems currently to be unable to read and write in this partition. If this partition is /dev/sdb2 (for instance), you have to set the Unix rights this "file" /dev/sdb2 (manually or via a udev rule). > 2016-03-01 22:44:46.001112 7fa185f000 -1 osd.69 0 OSD:init: unable to mount > object store > 2016-03-01 22:44:46.001128 7fa185f000 -1 ** ERROR: osd init failed: (13) > Permission denied > > > What am I missing? I think you missed to set the Unix rights of the journal partitions. The ceph account must be able to read/write in /var/lib/ceph/osd/$cluster-$id/ _and_ in the journal partitions too. Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot mount cephfs after some disaster recovery
Hi, On 01/03/2016 10:32, John Spray wrote: > As Zheng has said, that last number is the "max_mds" setting. And what is the meaning of the first and the second number below? mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active} ^ ^ -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis, cephfs: difference between df and du
On 21/01/2016 03:40, Francois Lafont wrote: > Ah ok, interesting. I have tested and I have noticed however that size > of a directory is not updated immediately. For instance, if I change > the size of the regular file in a directory (of cephfs) the size of the > size doesn't change immediately after. Misprint. The "size of the directory" of course. ^ -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis, cephfs: difference between df and du
Hi, On 19/01/2016 07:24, Adam Tygart wrote: > It appears that with --apparent-size, du adds the "size" of the > directories to the total as well. On most filesystems this is the > block size, or the amount of metadata space the directory is using. On > CephFS, this size is fabricated to be the size sum of all sub-files. > i.e. a cheap/free 'du -sh $folder' Ah ok, interesting. I have tested and I have noticed however that size of a directory is not updated immediately. For instance, if I change the size of the regular file in a directory (of cephfs) the size of the size doesn't change immediately after. > $ stat /homes/mozes/tmp/sbatten > File: '/homes/mozes/tmp/sbatten' > Size: 138286 Blocks: 0 IO Block: 65536 directory > Device: 0h/0d Inode: 1099523094368 Links: 1 > Access: (0755/drwxr-xr-x) Uid: (163587/ mozes) Gid: (163587/mozes_users) > Access: 2016-01-19 00:12:23.331201000 -0600 > Modify: 2015-10-14 13:38:01.098843320 -0500 > Change: 2015-10-14 13:38:01.098843320 -0500 > Birth: - > $ stat /tmp/sbatten/ > File: '/tmp/sbatten/' > Size: 4096Blocks: 8 IO Block: 4096 directory > Device: 803h/2051d Inode: 9568257 Links: 2 > Access: (0755/drwxr-xr-x) Uid: (163587/ mozes) Gid: (163587/mozes_users) > Access: 2016-01-19 00:12:23.331201000 -0600 > Modify: 2015-10-14 13:38:01.098843320 -0500 > Change: 2016-01-19 00:17:29.658902081 -0600 > Birth: - > > $ du -s --apparent-size -B1 /homes/mozes/tmp/sbatten > 276572 /homes/mozes/tmp/sbatten > $ du -s -B1 /homes/mozes/tmp/sbatten > 147456 /homes/mozes/tmp/sbatten > > $ du -s -B1 /tmp/sbatten > 225280 /tmp/sbatten > $ du -s --apparent-size -B1 /tmp/sbatten > 142382 /tmp/sbatten > > Notice how the apparent-size version is *exactly* the Size from the > stat + the size from the "proper" du? Err... exactly? Are you sure? 138286 + 147456 = 285742 which is != 276572, no? Anyway thx for your help Adam. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis, cephfs: difference between df and du
On 19/01/2016 05:19, Francois Lafont wrote: > However, I still have a question. Since my previous message, supplementary > data have been put in the cephfs and the values have changes as you can see: > > ~# du -sh /mnt/cephfs/ > 1.2G /mnt/cephfs/ > > ~# du --apparent-size -sh /mnt/cephfs/ > 6.4G /mnt/cephfs/ > > You can see that the difference between "disk usage" and "apparent size" > has really increased and it seems to me curious that only sparse files can > explain this difference (in my mind, sparse files are very specific files > and here the files are essentially images which doesn't seem to me potential > sparse files). I'm not completely sure but I think that same files are put in > the cephfs directory. > > Do you think it's possible that the sames file present in different > directories > of the cephfs are stored in only one object in the cephfs pool? > > This is my feeling when I see the difference between "apparent size" and > "disk usage" which has increased. Am I wrong? In fact, I'm not so sure. Here another information, where /backups is a XFS partition: ~# du --apparent-size -sh /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ 2.8G/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ ~# du -sh /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ 701M/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ ~# cp -r /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ /backups/test ~# du -sh /backups/test 701M/backups/test ~# du --apparent-size -sh /backups/test 701M/backups/test So I definitively don't understand of du --apparent-size -sh... -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis, cephfs: difference between df and du
Hi, On 18/01/2016 05:00, Adam Tygart wrote: > As I understand it: I think you understand well. ;) > 4.2G is used by ceph (all replication, metadata, et al) it is a sum of > all the space "used" on the osds. I confirm that. > 958M is the actual space the data in cephfs is using (without replication). > 3.8G means you have some sparse files in cephfs. > > 'ceph df detail' should return something close to 958MB used for your > cephfs "data" pool. "RAW USED" should be close to 4.2GB Yes, your predictions are correct. ;) However, I still have a question. Since my previous message, supplementary data have been put in the cephfs and the values have changes as you can see: ~# du -sh /mnt/cephfs/ 1.2G/mnt/cephfs/ ~# du --apparent-size -sh /mnt/cephfs/ 6.4G/mnt/cephfs/ You can see that the difference between "disk usage" and "apparent size" has really increased and it seems to me curious that only sparse files can explain this difference (in my mind, sparse files are very specific files and here the files are essentially images which doesn't seem to me potential sparse files). I'm not completely sure but I think that same files are put in the cephfs directory. Do you think it's possible that the sames file present in different directories of the cephfs are stored in only one object in the cephfs pool? This is my feeling when I see the difference between "apparent size" and "disk usage" which has increased. Am I wrong? Anyway, thanks a lot for the explanations Adam. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis upgrade breaks when journal on separate partition
Hi, I have not well followed this thread, so sorry in advance if I'm a little out of topic. Personally I'm using this udev rule and it works well (servers are Ubuntu Trusty): ~# cat /etc/udev/rules.d/90-ceph.rules ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_NAME}=="osd-?*-journal", OWNER="ceph" Indeed, I'm using GPT and all my journal partitions have this partname pattern: /osd-[0-9]+-journal/ If you currently don't use GTP (but msdos partitions), I think you can do the same thing by using _explicit_ "by-id". For instance something like that (not tested!): ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="xxx", OWNER="ceph" ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="yyy", OWNER="ceph" # etc. where xxx, yyy, etc. the name of your journal partitions in /dev/disk/by-id/. HTH. ;) -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Infernalis, cephfs: difference between df and du
On 18/01/2016 04:19, Francois Lafont wrote: > ~# du -sh /mnt/cephfs > 958M /mnt/cephfs > > ~# df -h /mnt/cephfs/ > Filesystem Size Used Avail Use% Mounted on > ceph-fuse55T 4.2G 55T 1% /mnt/cephfs Even with the option --apparent-size, the size are different (but closer indeed): ~# du -sh --apparent-size /mnt/cephfs 3.8G/mnt/cephfs -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Infernalis, cephfs: difference between df and du
Hello, Can someone explain me the difference between df and du commands concerning the data used in my cephfs? And which is the correct value, 958M or 4.2G? ~# du -sh /mnt/cephfs 958M/mnt/cephfs ~# df -h /mnt/cephfs/ Filesystem Size Used Avail Use% Mounted on ceph-fuse55T 4.2G 55T 1% /mnt/cephfs My client node is a "classical" Ubuntu Trusty, kernel 3.13 but as you can see I'm using ceph-fuse. The cluster nodes are "classical" Ubuntu Trusty nodes too. Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs (ceph-fuse) and file-layout: "operation not supported" in a client Ubuntu Trusty
Hi, Some news... On 08/01/2016 12:42, Francois Lafont wrote: > ~# mkdir /mnt/cephfs/ssd > > ~# setfattr -n ceph.dir.layout.pool -v poolssd /mnt/cephfs/ssd/ > setfattr: /mnt/cephfs/ssd/: Operation not supported > > ~# getfattr -n ceph.dir.layout /mnt/cephfs/ > /mnt/cephfs/: ceph.dir.layout: Operation not supported > > Here is my fstab line which mount the cephfs: > > id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/data1 > /mnt/cephfs fuse.ceph noatime,defaults,_netdev 0 0 In fact, I have retried the same thing without the "noatime" mount option and after that it worked. Then I have retried _with_ the "noatime" to be sure and... it worked too. Now, it just works with or witout the option. So I have 2 possible explanations: 1. The fact to remove noatime and mount just once has unblocked something... 2. or I have another explanation terrible for me. Maybe during my first attempt, the cephfs was just not mounted in fact. Indeed, now I have a doubt on this point because few minutes after the attempt I have seen that the cephfs was not mounted (and I don't know why). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs (ceph-fuse) and file-layout: "operation not supported" in a client Ubuntu Trusty
Hi @all, I'm using ceph Infernalis (9.2.0) in the client and cluster side. I have a Ubuntu Trusty client where cephfs is mounted via ceph-fuse and I would like to put a sub-directory of cephfs in a specific pool (a ssd pool). In the cluster, I have: ~# ceph auth get client.cephfs exported keyring for client.cephfs [client.cephfs] key = XX== caps mds = "allow" caps mon = "allow r" caps osd = "allow class-read object_prefix rbd_children, allow rwx pool=cephfsdata, allow rwx pool=poolssd" ~# ceph fs ls name: cephfs, metadata pool: cephfsmetadata, data pools: [cephfsdata poolssd ] Now, in the Ubuntu Trusty client, I have installed the "attr" package and I try this: ~# mkdir /mnt/cephfs/ssd ~# setfattr -n ceph.dir.layout.pool -v poolssd /mnt/cephfs/ssd/ setfattr: /mnt/cephfs/ssd/: Operation not supported ~# getfattr -n ceph.dir.layout /mnt/cephfs/ /mnt/cephfs/: ceph.dir.layout: Operation not supported Here is my fstab line which mount the cephfs: id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/data1 /mnt/cephfs fuse.ceph noatime,defaults,_netdev 0 0 Where is my problem? Thanks in advance for your help. ;) -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
olc, I think you haven't posted in the ceph-users list. On 31/12/2015 15:39, olc wrote: > Same model _and_ same firmware (`smartctl -i /dev/sdX | grep Firmware`)? As > far as I've been told, this can make huge differences. Good idea indeed. I have checked, the versions are the same. Finally, after some tests, I think I have probably made a mistake because now I have identical performances on the disks (~192 iops SYNC IO O_DIRECT). > Don't know how important it and if it is relevant in your case is but > transfer rate is supposed better when data are located at the periphery of > the platters than when they are located at the core of the platters. Yes indeed, but in my case the spinning hard drives have only one partition. Only the SSD have several partitions. Regard. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hi, On 31/12/2015 15:30, Robert LeBlanc wrote: > Because Ceph is not perfectly distributed there will be more PGs/objects in > one drive than others. That drive will become a bottleneck for the entire > cluster. The current IO scheduler poses some challenges in this regard. > I've implemented a new scheduler which I've seen much better drive > utilization across the cluster as well as 3-17% performance increase and a > substantial reduction in client performance deviation (all clients are > getting the same amount of performance). Hopefully we will be able to get > that into Jewel. Ok, thx for the information. So I hope too it will be ready for Jewel. If I have well understood, Jewel will involve many improvements. I follow that with attention... ;) -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] In production - Change osd config
Hi, On 03/01/2016 02:16, Sam Huracan wrote: > I try restart all osd but not efficient. > Is there anyway to apply this change transparently to client? You can use this command (it's an example): # In a cluster node where the admin account is available. ceph tell 'osd.*' injectargs '--osd_disk_threads 2' After, you can check the config in a specific osd. For instance: ceph daemon osd.5 config show | grep 'osd_disk_threads' But you must launch this command in the node which hosts the osd.5 daemon. Furthermore, with "ceph tell osd.\* injectargs ..." it's possible to set a parameter for all osds from a simple cluster node with just one command, but I don't know if it's possible to just _get_ (not set) the value of a parameter of all osds with just one command. Does a such command exist? Personally, I don't know a such command and currently, I have to launch "ceph daemon osd.$id config show" for each osd which is hosted by the current server where I'm connected and I have to repeat the commands in the other cluster nodes. Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hi, On 30/12/2015 10:23, Yan, Zheng wrote: >> And it seems to me that I can see the bottleneck of my little cluster (only >> 5 OSD servers with each 4 osds daemons). According to the "atop" command, I >> can see that some disks (4TB SATA 7200rpm Western digital WD4000FYYZ) are >> very busy. It's curious because during the bench I have some disks very busy >> and some other disks not so busy. But I think the reason is that is a little >> cluster and with just 15 osds (the 5 other osds are full SSD osds >> cephfsmetadata >> dedicated), I can have a perfect repartition of data, especially when the >> bench concern just a specific file of few hundred MB. > > do these disks have same size and performance? large disks (with > higher wights) or slow disks are likely busy. The disks are exactly the same model with the same size (4TB SATA 7200rpm Western digital WD4000FYYZ). I'm not completely sure but it seems to me that in a specific node I have a disk which is a little slower than the others (maybe minus ~50-75 iops) and it seems to me that it's the busiest disk during a bench. Is it possible (or frequent) to have difference of perfs between exactly same model of disks? >> That being said, when you talk about "using buffered IO" I'm not sure to >> understand the option of fio which is concerns by that. Is it the --buffered >> option ? Because with this option I have noticed no change concerning iops. >> Personally, I was able to increase global iops only with the --numjobs >> option. >> > > I didn't make it clear. I actually meant buffered write (add > --rwmixread=0 option to fio) . But with fio if I set "--readwrite=randrw --rwmixread=0", it's completely equivalent to just set "--readwrite=randwrite", no? > In your test case, writes mix with reads. Yes indeed. > read is synchronous when cache miss. You mean that I have SYNC IO for reading if I set --direct=0, is it correct? Is it valid for any file system or just for cephfs? Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hi, On 28/12/2015 09:04, Yan, Zheng wrote: >> Ok, so in a client node, I have mounted cephfs (via ceph-fuse) and a rados >> block device formatted in XFS. If I have well understood, cephfs uses sync >> IO (not async IO) and, with ceph-fuse, cephfs can't make O_DIRECT IO. So, I >> have tested this fio command in cephfs _and_ in rbd: >> >> fio --randrepeat=1 --ioengine=sync --direct=0 --gtod_reduce=1 >> --name=readwrite \ >> --filename=rw.data --bs=4k --iodepth=1 --size=300MB >> --readwrite=randrw \ >> >> >> and indeed with cephfs _and_ rbd, I have approximatively the same result: >> - cephfs => ~516 iops >> - rbd=> ~587 iops >> >> Is it consistent? >> > yes Ok, cool. ;) >> That being said, I'm unable to know if it's good performance as regard my >> hardware >> configuration. I'm curious to know the result in other clusters with the >> same fio >> command. > > This fio command is check performance of single thread SYNC IO. If you > want to check overall throughput, you can try using buffered IO or > increasing thread number. Ok, I have increased the thread number via the --numjobs option of fio and indeed, if I add all the iops of each job, it seems to me that I can reach something like ~1000 iops with ~5 jobs. This result seems to me further in relation with my hardware configuration, isn't it? And it seems to me that I can see the bottleneck of my little cluster (only 5 OSD servers with each 4 osds daemons). According to the "atop" command, I can see that some disks (4TB SATA 7200rpm Western digital WD4000FYYZ) are very busy. It's curious because during the bench I have some disks very busy and some other disks not so busy. But I think the reason is that is a little cluster and with just 15 osds (the 5 other osds are full SSD osds cephfsmetadata dedicated), I can have a perfect repartition of data, especially when the bench concern just a specific file of few hundred MB. That being said, when you talk about "using buffered IO" I'm not sure to understand the option of fio which is concerns by that. Is it the --buffered option ? Because with this option I have noticed no change concerning iops. Personally, I was able to increase global iops only with the --numjobs option. > FYI, I have written a patch to add AIO support to cephfs kernel client: > https://github.com/ceph/ceph-client/commits/testing Ok thanks for the information but I'm afraid to be unable to test it immediately. >> * --direct=1 => ~1400 iops >> * --direct=0 => ~570 iops >> >> Why I have this behavior? I thought it will be the opposite (better perfs >> with >> --direct=0). Is it normal? >> > linux kernel only supports AIO for fd opened in O_DIRECT mode, when > file is not opened in O_DIRECT mode, AIO is actually SYNC IO. Ok, so this is not ceph specific, this is a behavior of the Linux kernel. A good info to know again. Anyway, thanks _a_ _lot_ Yan for your help very efficient. I have learned lot of very interesting things Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hi, Sorry for my late answer. On 23/12/2015 03:49, Yan, Zheng wrote: >>> fio tests AIO performance in this case. cephfs does not handle AIO >>> properly, AIO is actually SYNC IO. that's why cephfs is so slow in >>> this case. >> >> Ah ok, thanks for this very interesting information. >> >> So, in fact, the question I ask myself is: how to test my cephfs >> to know if I have correct (or not) perfs as regard my hardware >> configuration? >> >> Because currently, in fact, I'm unable to say if I have correct perf >> (not incredible but in line with my hardware configuration) or if I >> have a problem. ;) >> > > It's hard to tell. basically data IO performance on cephfs should be > similar to data IO performance on rbd. Ok, so in a client node, I have mounted cephfs (via ceph-fuse) and a rados block device formatted in XFS. If I have well understood, cephfs uses sync IO (not async IO) and, with ceph-fuse, cephfs can't make O_DIRECT IO. So, I have tested this fio command in cephfs _and_ in rbd: fio --randrepeat=1 --ioengine=sync --direct=0 --gtod_reduce=1 --name=readwrite \ --filename=rw.data --bs=4k --iodepth=1 --size=300MB --readwrite=randrw \ --rwmixread=50 and indeed with cephfs _and_ rbd, I have approximatively the same result: - cephfs => ~516 iops - rbd=> ~587 iops Is it consistent? That being said, I'm unable to know if it's good performance as regard my hardware configuration. I'm curious to know the result in other clusters with the same fio command. Another point: I have noticed something which is very strange for me. It's about the rados block device and this fio command: # In this case, I use libaio and (direct == 0) fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=readwrite \ --filename=rw.data --bs=4k --iodepth=16 --size=300MB --readwrite=randrw \ --rwmixread=50 This command in the rados block device gives me ~570 iops. But the curious thing is that I have better iops if I just change "--direct=0" to "--direct=1" in the command above. In this case, I have ~1400 iops. I don't understand this difference. So, I have better perfs with "--direct=1": * --direct=1 => ~1400 iops * --direct=0 => ~570 iops Why I have this behavior? I thought it will be the opposite (better perfs with --direct=0). Is it normal? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hello, On 21/12/2015 04:47, Yan, Zheng wrote: > fio tests AIO performance in this case. cephfs does not handle AIO > properly, AIO is actually SYNC IO. that's why cephfs is so slow in > this case. Ah ok, thanks for this very interesting information. So, in fact, the question I ask myself is: how to test my cephfs to know if I have correct (or not) perfs as regard my hardware configuration? Because currently, in fact, I'm unable to say if I have correct perf (not incredible but in line with my hardware configuration) or if I have a problem. ;) -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
On 20/12/2015 22:51, Don Waterloo wrote: > All nodes have 10Gbps to each other Even the link client node <---> cluster nodes? > OSD: > $ ceph osd tree > ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 5.48996 root default > -2 0.8 host nubo-1 > 0 0.8 osd.0 up 1.0 1.0 > -3 0.8 host nubo-2 > 1 0.8 osd.1 up 1.0 1.0 > -4 0.8 host nubo-3 > 2 0.8 osd.2 up 1.0 1.0 > -5 0.92999 host nubo-19 > 3 0.92999 osd.3 up 1.0 1.0 > -6 0.92999 host nubo-20 > 4 0.92999 osd.4 up 1.0 1.0 > -7 0.92999 host nubo-21 > 5 0.92999 osd.5 up 1.0 1.0 > > Each contains 1 x Samsung 850 Pro 1TB SSD (on sata) > > Each are Ubuntu 15.10 running 4.3.0-040300-generic kernel. > Each are running ceph 0.94.5-0ubuntu0.15.10.1 > > nubo-1/nubo-2/nubo-3 are 2x X5650 @ 2.67GHz w/ 96GB ram. > nubo-19/nubo-20/nubo-21 are 2x E5-2699 v3 @ 2.30GHz, w/ 576GB ram. > > the connections are to the chipset sata in each case. > The fio test to the underlying xfs disk > (e.g. cd /var/lib/ceph/osd/ceph-1; fio --randrepeat=1 --ioengine=libaio > --direct=1 --gtod_reduce=1 --name=readwrite --filename=rw.data --bs=4k > --iodepth=64 --size=5000MB --readwrite=randrw --rwmixread=50) > shows ~22K IOPS on each disk. > > nubo-1/2/3 are also the mon and the mds: > $ ceph status > cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded > health HEALTH_OK > monmap e1: 3 mons at {nubo-1= > 10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0} > election epoch 1104, quorum 0,1,2 nubo-1,nubo-2,nubo-3 > mdsmap e621: 1/1/1 up {0=nubo-3=up:active}, 2 up:standby > osdmap e2459: 6 osds: 6 up, 6 in > pgmap v127331: 840 pgs, 6 pools, 144 GB data, 107 kobjects > 289 GB used, 5332 GB / 5622 GB avail > 840 active+clean > client io 0 B/s rd, 183 kB/s wr, 54 op/s And you have "replica size == 3" in your cluster, correct? Do you have specific mount options or specific options in ceph.conf concerning ceph-fuse? So the hardware configuration of your cluster seems to me globally highly better than my cluster (config given in my first message) because you have 10Gb links (between the client and the cluster I have just 1Gb) and you have full SSD OSDs. I have tried to put _all_ cephfs in my SSD: ie the pools "cephfsdata" _and_ "cephfsmetadata" are in the SSD. The performances are slightly improved because I have ~670 iops now (with the fio command of my first message again) but it still seems to me bad. In fact, I'm curious to have the opinion of "cephfs" experts to know what iops we can expect. If anaything, ~700 iops is a correct iops for our hardware configuration and maybe we are searching a problem which doesn't exist... -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
On 20/12/2015 21:06, Francois Lafont wrote: > Ok. Please, can you give us your configuration? > How many nodes, osds, ceph version, disks (SSD or not, HBA/controller), RAM, > CPU, network (1Gb/10Gb) etc.? And I add this: with cephfs-fuse, did you have some specific conf in the client side? Specific mount options? Specific parameters in ceph.conf? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hi, On 20/12/2015 19:47, Don Waterloo wrote: > I did a bit more work on this. > > On cephfs-fuse, I get ~700 iops. > On cephfs kernel, I get ~120 iops. > These were both on 4.3 kernel > > So i backed up to 3.16 kernel on the client. And observed the same results. > > So ~20K iops w/ rbd, ~120iops w/ cephfs. Ok. Please, can you give us your configuration? How many nodes, osds, ceph version, disks (SSD or not, HBA/controller), RAM, CPU, network (1Gb/10Gb) etc.? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hello, On 18/12/2015 23:26, Don Waterloo wrote: > rbd -p mypool create speed-test-image --size 1000 > rbd -p mypool bench-write speed-test-image > > I get > > bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq > SEC OPS OPS/SEC BYTES/SEC > 1 79053 79070.82 323874082.50 > 2144340 72178.81 295644410.60 > 3221975 73997.57 303094057.34 > elapsed:10 ops: 262144 ops/sec: 26129.32 bytes/sec: 107025708.32 > > which is *much* faster than the cephfs. Me too, I have better performance with rbd (~1400 iops with the fio command in my first message instead of ~575 iops with the same fio command and cephfs). The question is: is it normal if I have ~575 iops with cephfs and my config? Indeed, I imagine that rbd has better performance than cephfs and, after all maybe my value of iops is normal. I don't know... I have tried to edit the crushmap to put the cephfsmetadata pool only in the 5 SSD. It seems to improve slightly the performance and, with the fio command of my first message, I have ~650 iops now but it still seems to me bad, no? Currently I'm searching any option in ceph.conf or any mount option to improve performance with cephfs via ceph-fuse. In the archive of "ceph-users", I have seen the options "client cache size" and "client oc size" which would be used by ceph-fuse. Is it correct? I don't see anything in the documentation. Where should I put these parameters? In the ceph.conf of the client which mounts the cephfs via fuse? In the [global] section? I have tried that but it seems to be not ignored. Indeed I have tried to put these parameters in the [global] section of ceph.conf (in the client node) and I have set very very small value like this: [global] client cache size = 1024 client oc size= 1024 and I thought it highly decreases the performance but there is absolutely no effect and I have the same result (ie ~650 iops) so I think the parameters are just ignored. Is it the right place to put these parameters? Furthermore, do you know mount options which can improve perf (for cephfs mount via ceph-fuse)? It seems to me that the mount option noacl existed but ceph-fuse doesn't know this mount option (I have no need to acl). I haven't found the list of mount options in the web. I just can display a short list with the command "ceph-fuse -h". I have tried to change the max_* options but without effect. Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs, low performances
Hi Christian, On 18/12/2015 04:16, Christian Balzer wrote: >> It seems to me very bad. > Indeed. > Firstly let me state that I don't use CephFS and have no clues how this > influences things and can/should be tuned. Ok, no problem. Anyway, thanks for your answer. ;) > That being said, the fio above running in VM (RBD) gives me 440 IOPS > against a single OSD storage server (replica 1) with 4 crappy HDDs and > on-disk journals on my test cluster (1Gb/s links). > So yeah, given your configuration that's bad. I have tried a quick test with a rados block device (size = 4GB with filesystem EXT4) mounted on the same client node (the client node where I'm testing cephfs) and the same "fio" command give me iops read/write equal to ~1400. So my problem could be "cephfs" specific, no? That being said, I don't know if it's can be a symptom but during the bench the iops are real-time displayed and the value seems to me no very constant. I can see sometimes peacks at 1800 iops and suddenly the value is 800 iops and re-turns up at ~1400 etc. > In comparison I get 3000 IOPS against a production cluster (so not idle) > with 4 storage nodes. Each with 4 100GB DC S3700 for journals and OS and 8 > SATA HDDs, Infiniband (IPoIB) connectivity for everything. > > All of this is with .80.x (Firefly) on Debian Jessie. Ok, interesting. My cluster is idle and but I have approximatively twice as less disks than your cluster and my SATA disk are directly connected on the motherboard. So, it seems to me logical that I have ~1400 and you ~3000, no? > You want to use atop on all your nodes and look for everything from disks > to network utilization. > There might be nothing obvious going on, but it needs to be ruled out. It's a detail but I have noticed that atop (on Ubuntu Trusty) don't display the % of bandwidth of my 10GbE interface. Anyway, I have tried to inspect the node cluster during the cephfs bench, but I have seen no bottleneck concerning CPU, network and disks. >> I use Ubuntu 14.04 on each server with the 3.13 kernel (it's the same >> for the client ceph where I run my bench) and I use Ceph 9.2.0 >> (Infernalis). > > I seem to recall that this particular kernel has issues, you might want to > scour the archives here. But, in my case, I use cephfs-fuse in the client node so the kernel version is not relevant I think. And I thought that the kernel version was not very important in the cluster nodes side. Am I wrong? >> On the client, cephfs is mounted via cephfs-fuse with this >> in /etc/fstab: >> >> id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ >> /mnt/cephfs >> fuse.cephnoatime,defaults,_netdev0 0 >> >> I have 5 cluster node servers "Supermicro Motherboard X10SLM+-LN4 S1150" >> with one 1GbE port for the ceph public network and one 10GbE port for >> the ceph private network: >> > For the sake of latency (which becomes the biggest issues when you're not > exhausting CPU/DISK), you'd be better off with everything on 10GbE, unless > you need the 1GbE to connect to clients that have no 10Gb/s ports. Yes, exactly. My client is 1Gb/s only. >> - 1 x Intel Xeon E3-1265Lv3 >> - 1 SSD DC3710 Series 200GB (with partitions for the OS, the 3 >> OSD-journals and, just for ceph01, ceph02 and ceph03, the SSD contains >> too a partition for the workdir of a monitor > The 200GB DC S3700 would have been faster, but that's a moot point and not > your bottleneck for sure. > >> - 3 HD 4TB Western Digital (WD) SATA 7200rpm >> - RAM 32GB >> - NO RAID controlleur > > Which controller are you using? No controller, the 3 SATA disks of my client are directly connected on the SATA ports of the motherboard. > I recently came across an Adaptec SATA3 HBA that delivered only 176 MB/s > writes with 200GB DC S3700s as opposed to 280MB/s when used with Intel > onboard SATA-3 ports or a LSI 9211-4i HBA. Thanks for your help Christian. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs, low performances
Hi, I have ceph cluster currently unused and I have (to my mind) very low performances. I'm not an expert in benchs, here an example of quick bench: --- # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readwrite --filename=rw.data --bs=4k --iodepth=64 --size=300MB --readwrite=randrw --rwmixread=50 readwrite: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64 fio-2.1.3 Starting 1 process readwrite: Laying out IO file(s) (1 file(s) / 300MB) Jobs: 1 (f=1): [m] [100.0% done] [2264KB/2128KB/0KB /s] [566/532/0 iops] [eta 00m:00s] readwrite: (groupid=0, jobs=1): err= 0: pid=3783: Fri Dec 18 02:01:13 2015 read : io=153640KB, bw=2302.9KB/s, iops=575, runt= 66719msec write: io=153560KB, bw=2301.7KB/s, iops=575, runt= 66719msec cpu : usr=0.77%, sys=3.07%, ctx=115432, majf=0, minf=604 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued: total=r=38410/w=38390/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=153640KB, aggrb=2302KB/s, minb=2302KB/s, maxb=2302KB/s, mint=66719msec, maxt=66719msec WRITE: io=153560KB, aggrb=2301KB/s, minb=2301KB/s, maxb=2301KB/s, mint=66719msec, maxt=66719msec --- It seems to me very bad. Can I hope better results with my setup (explained below)? During the bench, I don't see particular symptoms (no CPU blocked at 100% etc). If you have advices to improve the perf and/or maybe to make smarter benchs, I'm really interested. Thanks in advance for your help. Here is my conf... I use Ubuntu 14.04 on each server with the 3.13 kernel (it's the same for the client ceph where I run my bench) and I use Ceph 9.2.0 (Infernalis). On the client, cephfs is mounted via cephfs-fuse with this in /etc/fstab: id=cephfs,keyring=/etc/ceph/ceph.client.cephfs.keyring,client_mountpoint=/ /mnt/cephfs fuse.ceph noatime,defaults,_netdev0 0 I have 5 cluster node servers "Supermicro Motherboard X10SLM+-LN4 S1150" with one 1GbE port for the ceph public network and one 10GbE port for the ceph private network: - 1 x Intel Xeon E3-1265Lv3 - 1 SSD DC3710 Series 200GB (with partitions for the OS, the 3 OSD-journals and, just for ceph01, ceph02 and ceph03, the SSD contains too a partition for the workdir of a monitor - 3 HD 4TB Western Digital (WD) SATA 7200rpm - RAM 32GB - NO RAID controlleur - Each partition uses XFS with noatim option, except the OS partition in EXT4. Here is my ceph.conf : --- [global] fsid = cluster network= 192.168.22.0/24 public network = 10.0.2.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true osd pool default size = 3 osd pool default min size = 1 osd pool default pg num= 64 osd pool default pgp num = 64 osd crush chooseleaf type = 1 osd journal size = 0 osd max backfills = 1 osd recovery max active= 1 osd client op priority = 63 osd recovery op priority = 1 osd op threads = 4 mds cache size = 100 osd scrub begin hour = 3 osd scrub end hour = 5 mon allow pool delete = false mon osd down out subtree limit = host mon osd min down reporters = 4 [mon.ceph01] host = ceph01 mon addr = 10.0.2.101 [mon.ceph02] host = ceph02 mon addr = 10.0.2.102 [mon.ceph03] host = ceph03 mon addr = 10.0.2.103 --- mds are in active/standby mode. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] about PG_Number
Hi, On 13/11/2015 09:13, Vickie ch wrote: > If you have a large amount of OSDs but less pg number. You will find your > data write unevenly. > Some OSD have no change to write data. > In the other side, pg number too large but OSD number too small that have a > chance to cause data lost. Data lost, are you sure? Personally, I would have said: few PG/OSDs lot of PG/OSDs > * Data distribution less envenly * Good balanced distribution of data * Use less CPU and RAM* Use lot of CPU and RAM No? François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.2.0 Infernalis released
Oops, sorry Dan, I would like to send my message to the list. Sorry. > On Mon, Nov 9, 2015 at 11:55 AM, Francois Lafont >> >> 1. Ok, so, the rank of my monitors are 0, 1, 2 but the its ID are 1, 2, 3 >> (ID chosen by himself because the hosts are called ceph01, ceph02 and >> ceph03 and these ID seemed to me a good idea). Is it correct ? >> >> 2. And, if I understand well, with this command `ceph tell mon.$thing >> version` >> $thing is in fact the rank of the monitor, correct? >> >> 3. But with `ceph tell osd.$thing version`, $thing is the ID of the osd, >> correct? >> >> 4. Why not. But in this case, why with the command `ceph tell mon.* version`, >> "*" is expanded to the ID of my monitors (ie 1, 2, 3) and not to the ranks ? >> It seems to me not logical? Am I wrong? >> >> But Dan, in your case (monitors have the ID `hostname -s`), the command >> `ceph tell mon.* version` doesn't work at all, no? Because "*" is expanded >> to `hostname -s` which doesn't match any rank value, no? >> >> Sorry for all these questions, I understand the difference between ID >> and rank for monitors, but currently I don't understand: >> >> - which is $thing (rank or ID?) in the command `ceph tell mon.$thing >> version`? >> - in what "*" is expanded (ranks or IDs?) in the command `ceph tell mon.* >> version`? >> > > Here's the behaviour on hammer. I don't know if this changed in infernalis: > > # ceph mon dump > ... > 0: 128.142.xxx:6790/0 mon.p01001532077xxx > 1: 128.142.yyy:6790/0 mon.p01001532149yyy > 2: 128.142.zzz:6790/0 mon.p01001532184zzz > > > # ceph tell mon.* version > mon.p01001532077xxx: ceph version 0.94.5 > (9764da52395923e0b32908d83a9f7304401fee43) > mon.p01001532149yyy: ceph version 0.94.5 > (9764da52395923e0b32908d83a9f7304401fee43) > mon.p01001532184zzz: ceph version 0.94.5 > (9764da52395923e0b32908d83a9f7304401fee43) > > So, mon.* resolves to the IDs. You can tell directly to the IDs: > > # ceph tell mon.p01001532077xxx version > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) > # ceph tell mon.p01001532149yyy version > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) > # ceph tell mon.p01001532184zzz version > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) > > And you can also tell directly to the ranks: > > # ceph tell mon.0 version > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) > # ceph tell mon.1 version > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) > # ceph tell mon.2 version > ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) Ok, thanks Dan for your answer. If understand well: 1. with `ceph tell mon.* version`, "*" are expanded to the IDs of monitors. 2. But with `ceph tell mon.$thing version`, if $thing is a interger, $thing is interpreted as a rank not as an ID, and if not $thing is interpreted as an ID. Is it correct? If yes, by conclusion : for the monitor ID, it's better to chose an ID which is not an integer (even if it's not very dramatic). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.2.0 Infernalis released
On 09/11/2015 06:28, Francois Lafont wrote: > I have just upgraded a cluster to 9.2.0 from 9.1.0. > All seems to be well except I have this little error > message : > > ~# ceph tell mon.* version --format plain > mon.1: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) > mon.2: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) > mon.3: Error ENOENT: problem getting command descriptions from mon.3 < > Here. ;) > mon.3: problem getting command descriptions from mon.3 > > Except this little message, all seems to be fine. > > ~# ceph -s > cluster f875b4c1-535a-4f17-9883-2793079d410a > health HEALTH_OK > monmap e3: 3 mons at > {1=10.0.2.101:6789/0,2=10.0.2.102:6789/0,3=10.0.2.103:6789/0} > election epoch 104, quorum 0,1,2 1,2,3 > mdsmap e66: 1/1/1 up {0=3=up:active}, 2 up:standby > osdmap e256: 15 osds: 15 up, 15 in > flags sortbitwise > pgmap v1094: 192 pgs, 3 pools, 31798 bytes data, 20 objects > 560 MB used, 55862 GB / 55863 GB avail > 192 active+clean > > I have tried to restart mon.3 but no success. Should I ignore the > message? In fact, it's curious: ~# ceph mon dump dumped monmap epoch 3 epoch 3 fsid f875b4c1-535a-4f17-9883-2793079d410a last_changed 2015-11-04 08:25:37.700420 created 2015-11-04 07:31:38.790832 0: 10.0.2.101:6789/0 mon.1 1: 10.0.2.102:6789/0 mon.2 2: 10.0.2.103:6789/0 mon.3 ~# ceph tell mon.1 version ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) ~# ceph tell mon.2 version ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) ~# ceph tell mon.3 version Error ENOENT: problem getting command descriptions from mon.3 [2] root@ceph03 06:35 ~ ~# ceph tell mon.0 version ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) Concerning monitors, I have this in my ceph.conf: [mon.1] host = ceph01 mon addr = 10.0.2.101 [mon.2] host = ceph02 mon addr = 10.0.2.102 [mon.3] host = ceph03 mon addr = 10.0.2.103 So the ID of my monitors are 1, 2, 3. But there is a little problem because I have : ~# ceph tell mon.0 version ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) So what is this mon.0 ?? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.2.0 Infernalis released
Hi, I have just upgraded a cluster to 9.2.0 from 9.1.0. All seems to be well except I have this little error message : ~# ceph tell mon.* version --format plain mon.1: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) mon.2: ceph version 9.2.0 (17df5d2948d929e997b9d320b228caffc8314e58) mon.3: Error ENOENT: problem getting command descriptions from mon.3 < Here. ;) mon.3: problem getting command descriptions from mon.3 Except this little message, all seems to be fine. ~# ceph -s cluster f875b4c1-535a-4f17-9883-2793079d410a health HEALTH_OK monmap e3: 3 mons at {1=10.0.2.101:6789/0,2=10.0.2.102:6789/0,3=10.0.2.103:6789/0} election epoch 104, quorum 0,1,2 1,2,3 mdsmap e66: 1/1/1 up {0=3=up:active}, 2 up:standby osdmap e256: 15 osds: 15 up, 15 in flags sortbitwise pgmap v1094: 192 pgs, 3 pools, 31798 bytes data, 20 objects 560 MB used, 55862 GB / 55863 GB avail 192 active+clean I have tried to restart mon.3 but no success. Should I ignore the message? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.4 Hammer released
Hi, On 20/10/2015 20:11, Stefan Eriksson wrote: > A change like this below, where we have to change ownership was not add to a > point release for hammer right? Right. ;) I have upgraded my ceph cluster from 0.94.3 to 0.94.4 today without any problem. The daemons used in 0.94.3 and currently use in 0.94.4 the root account. I have not changed at all ownership of /var/lib/ceph/ for this upgrade. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS file to rados object mapping
Hi, On 14/10/2015 06:45, Gregory Farnum wrote: >> Ok, however during my tests I had been careful to replace the correct >> file by a bad file with *exactly* the same size (the content of the >> file was just a little string and I have changed it by a string with >> exactly the same size). I had been careful to undo the mtime update >> too (I had restore the mtime of the file before the change). Despite >> this, the "repair" command worked well. Tested twice: 1. with the change >> on the primary OSD and 2. on the secondary OSD. And I was surprised >> because I though the test 1. (in primary OSD) will fail. > > Hm. I'm a little confused by that, actually. Exactly what was the path > to the files you changed, and do you have before-and-after comparisons > on the content and metadata? I didn't remember exactly the process I have made so I have just retried today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on /mnt on one of the nodes. ~# cat /mnt/file.txt # yes it's a little file. ;) 123456 ~# ls -i /mnt/file.txt 1099511627776 /mnt/file.txt ~# printf "%x\n" 1099511627776 100 ~# rados -p data ls - | grep 100 100. I have the name of the object mapped to my "file.txt". ~# ceph osd map data 100. osdmap e76 pool 'data' (3) object '100.' -> pg 3.f0b56f30 (3.30) -> up ([1,2], p1) acting ([1,2], p1) So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2. So I open a terminal in the node which hosts the primary OSD OSD-1 and then: ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 123456 ~# ll /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 -rw-r--r-- 1 root root 7 Oct 15 03:46 /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 Now, I change the content with this script called "change_content.sh" to preserve the mtime after the change: - #!/bin/sh f="$1" f_tmp="${f}.tmp" content="$2" cp --preserve=all "$f" "$f_tmp" echo "$content" >"$f" touch -r "$f_tmp" "$f" # to restore the mtime after the change rm "$f_tmp" - So, let's go, I replace the content by a new content with exactly the same size (ie "ABCDEF" in this example): ~# ./change_content.sh /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 ABCDEF ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 ABCDEF ~# ll /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 -rw-r--r-- 1 root root 7 Oct 15 03:46 /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 Now, the secondary OSD contains the good version of the object and the primary a bad version. Now, I launch a "ceph pg repair": ~# ceph pg repair 3.30 instructing pg 3.30 on osd.1 to repair # I'm in the primary OSD and the file below has been repaired correctly. ~# cat /var/lib/ceph/osd/ceph-1/current/3.30_head/100.__head_F0B56F30__3 123456 As you can see, the repair command has worked well. Maybe my little is too trivial? >> Greg, if I understand you well, I shouldn't have too much confidence in >> the "ceph pg repair" command, is it correct? >> >> But, if yes, what is the good way to repair a PG? > > Usually what we recommend is for those with 3 copies to find the > differing copy, delete it, and run a repair — then you know it'll > repair from a good version. But yeah, it's not as reliable as we'd > like it to be on its own. I would like to be sure to well understand. The process could be (in the case where size == 3): 1. In each of the 3 OSDs where my object is put: md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}* 2. Normally, I will have the same result in 2 OSDs, and in the other OSD, let's call it OSD-X, the result will be different. So, in the OSD-X, I run: rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}* 3. And now I can run the "ceph pg repair" command without risk: ceph pg repair $pg_id Is it the correct process? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.1.0 Infernalis release candidate released
Sorry, another remark. On 13/10/2015 23:01, Sage Weil wrote: > The v9.1.0 packages are pushed to the development release repositories:: > > http://download.ceph.com/rpm-testing > http://download.ceph.com/debian-testing I don't see the 9.1.0 available for Ubuntu Trusty : http://download.ceph.com/debian-testing/dists/trusty/main/binary-amd64/Packages (the string "9.1" is not present in this page currently) The 9.0.3 is available but, after a quick test, this version of the package doesn't create the ceph unix account. Have I forgotten something? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v9.1.0 Infernalis release candidate released
Hi and thanks at all for this good news, ;) On 13/10/2015 23:01, Sage Weil wrote: >#. Fix the data ownership during the upgrade. This is the preferred > option, > but is more work. The process for each host would be to: > > #. Upgrade the ceph package. This creates the ceph user and group. For >example:: > > ceph-deploy install --stable infernalis HOST > > #. Stop the daemon(s).:: > > service ceph stop # fedora, centos, rhel, debian > stop ceph-all # ubuntu > > #. Fix the ownership:: > > chown -R ceph:ceph /var/lib/ceph > > #. Restart the daemon(s).:: > > start ceph-all# ubuntu > systemctl start ceph.target # debian, centos, fedora, rhel With this (preferred) option, if I understand well, I should repeat these commands above host-by-host. Personally, my monitors are hosted in the OSD servers (I have no dedicated monitor server). So, with this option, I will have osd daemons upgraded before monitor daemons. Is it a problem? I ask the question because, during a migration to a new release, it's generally recommended to upgrade _all_ the monitors before to upgrade the first osd daemon. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS file to rados object mapping
Hi, Thanks for your answer Greg. On 09/10/2015 04:11, Gregory Farnum wrote: > The size of the on-disk file didn't match the OSD's record of the > object size, so it rejected it. This works for that kind of gross > change, but it won't catch stuff like a partial overwrite or loss of > data within a file. Ok, however during my tests I had been careful to replace the correct file by a bad file with *exactly* the same size (the content of the file was just a little string and I have changed it by a string with exactly the same size). I had been careful to undo the mtime update too (I had restore the mtime of the file before the change). Despite this, the "repair" command worked well. Tested twice: 1. with the change on the primary OSD and 2. on the secondary OSD. And I was surprised because I though the test 1. (in primary OSD) will fail. Greg, if I understand you well, I shouldn't have too much confidence in the "ceph pg repair" command, is it correct? But, if yes, what is the good way to repair a PG? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS file to rados object mapping
Hi, On 08/10/2015 22:25, Gregory Farnum wrote: > So that means there's no automated way to guarantee the right copy of > an object when scrubbing. If you have 3+ copies I'd recommend checking > each of them and picking the one that's duplicated... It's curious because I have already tried with cephfs to "corrupt" a file in the OSD backend. I had a little text file in cephfs mapped to the object "$inode.$num" and this object was in the PG $pg_id, in the primary OSD $primary and in the secondary OSD $secondary (I had indeed size == 2). I thought that the primary OSD was always taken as reference by the "ceph pg repair" command, so I have tried this: # Test A echo "foo blabla..." >/var/lib/ceph/osd/ceph-$primary/current/$pg_id_head/$inode.$num ceph pg repair $pg_id and the "repair" command worked correctly and my file was repaired correctly. I have tried to change the file in the secondary OSD too with: # Test B echo "foo blabla..." >/var/lib/ceph/osd/ceph-$secondary/current/$pg_id_head/$inode.$num ceph pg repair $pg_id and it was the same, the file was repaired correctly too. In these 2 cases, the good OSD was taken as reference (the secondary for the test A and the primary for the test B). So, in this case, how did ceph know which copy was the correct object? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS: 'ls -alR' performance terrible unless Linux cache flushed
Hi, On 16/06/2015 18:46, negillen negillen wrote: > Fixed! At least looks like fixed. That's cool for you. ;) > It seems that after migrating every node (both servers and clients) from > kernel 3.10.80-1 to 4.0.4-1 the issue disappeared. > Now I get decent speeds both for reading files and for getting stats from > every node. It seems to me that an interesting test could be to let the old kernel in your client nodes (ie 3.10.80-1), use ceph-fuse instead of the ceph kernel module and test if you have decent speeds too. Bye. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.2 Hammer released
Hi, On 11/06/2015 19:34, Sage Weil wrote: > Bug #11442 introduced a change that made rgw objects that start with > underscore incompatible with previous versions. The fix to that bug > reverts to the previous behavior. In order to be able to access objects > that start with an underscore and were created in prior Hammer releases, > following the upgrade it is required to run (for each affected bucket):: > > $ radosgw-admin bucket check --check-head-obj-locator \ > --bucket= [--fix] > > You can get a list of buckets with > > $ radosgw-admin bucket list After the upgrade of my radosgw, I can't fix the problem of rgw objects that start with underscore. The command with the --fix option displays some errors which I don't understand. Here is a (troncated) paste of my shell below. Have I done something wrong? Thx in advance for the help. François Lafont -- ~# radosgw-admin --id=radosgw.gw2 bucket check --check-head-obj-locator --bucket=$bucket { "bucket": "moodles-poc-registry", "check_objects": [ { "key": { "name": "_multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta", "instance": "" }, "oid": "default.763616.1___multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta", "locator": "default.763616.1__multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta", "needs_fixing": true, "status": "needs_fixing" }, [snip] { "key": { "name": "_multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta", "instance": "" }, "oid": "default.763616.1___multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta", "locator": "default.763616.1__multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta", "needs_fixing": true, "status": "needs_fixing" } ] } ~# radosgw-admin --id=radosgw.gw2 bucket check --check-head-obj-locator --bucket=$bucket --fix 2015-06-12 03:01:33.197984 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta) returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 2015-06-12 03:01:33.200428 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909/layer.2~poMH-PQKCLstUWpMQpji7JuGaBT53Th.meta) returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 2015-06-12 03:01:33.206875 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/c5a7fc74211188aabf3429539674275645b07717d003c390a943acc44f35c6d0/layer.2~Bg6bkbSOE8GCtV4Mxr0t56vSfTQTCx9.1) returned ret=-2 2015-06-12 03:01:33.209293 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/c5a7fc74211188aabf3429539674275645b07717d003c390a943acc44f35c6d0/layer.2~Bg6bkbSOE8GCtV4Mxr0t56vSfTQTCx9.2) returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 ERROR: fix_head_object_locator() returned ret=-2 [snip] 2015-06-12 03:01:33.301101 7f3c9130d840 -1 ERROR: ioctx.operate(oid=default.763616.1___multipart_registry/images/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b/layer.2~TSdIpafsfGXJ7kKMOVqJ-hn8Aog4ETF.meta) returned ret=-2 { "bucket": "moodles-poc-registry", "check_objects": [ { "key": { "name": "_multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta", "instance": "" }, "oid": "default.763616.1___multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta", "locator": "default.763616.1__multipart_registry\/images\/1483a2ea4c3f5865d4d583fb484bbe11afe709a6f3d1baef102904d4d9127909\/layer.2~QorD8QaGiDc4HPUP7VVpx4LS-e_7f0u.meta", "needs_fixing": true, "status": "needs_fixing" }, [snip] { "key": { "name": "_multipart_registry\/images\/fa4fd76b09ce9b87bfdc96515f9a5dd5121c01cc996cf5379050d8e13d4a864b\/layer.2~TS
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
Hi, On 27/05/2015 22:34, Gregory Farnum wrote: > Sorry for the delay; I've been traveling. No problem, me too, I'm not really fast to answer. ;) >> Ok, I see. According to the online documentation, the way to close >> a cephfs client session is: >> >> ceph daemon mds.$id session ls # to get the $session_id and the >> $address >> ceph osd blacklist add $address >> ceph osd dump # to get the $epoch >> ceph daemon mds.$id osdmap barrier $epoch >> ceph daemon mds.$id session evict $session_id >> >> Is it correct? >> >> With the commands above, could I reproduce the client freeze in my testing >> cluster? > > Yes, I believe so. In fact, after some tests, the commands above evicts correctly the client (`ceph daemon mds.1 session ls` returns an empty array) but in the client side a new connection is automatically established as soon as the cephfs mountpoint is requested. In fact, I haven't succeeded in reproducing the freeze. ;) I have tried to stop the network in the client side (ifdown -a) and after few minutes (more than 60 seconds though), I have seen in the mds log "closing stale session client". But after a `ifup -a`, I have get back a cephfs connection and a mountpoint in good health. >> And could it be conceivable one day (for instance with an option) to be >> able to change the behavior of cephfs to be *not*-strictly-consistent, >> like NFS for instance? It seems to me it could improve performances of >> cephfs and cephfs could be more flexible concerning short network failure >> (not really sure for this second point). Ok it's just a remark of a simple >> and unqualified ceph-user ;) but it seems to me that NFS isn't strictly >> consistent and generally this not a problem in many use cases. Am I wrong? > > Mmm, this is something we're pretty resistant to. Ah ok, so I don't insist. ;) > In particular NFS > just doesn't make any efforts to be consistent when there are multiple > writers, and CephFS works *really hard* to behave properly in that > case. For many use cases it's not a big deal, but for others it is, > and we target some of them. Ok. Thanks Greg for your answer. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: one ceph account per directory?
Hi, Gregory Farnum wrote: >> 1. Can you confirm to me that currently it's impossible to restrict the read >> and write access of a ceph account to a specific directory of a cephfs? > > It's sadly impossible to restrict access to the filesystem hierarchy > at this time, yes. By making use of the file layouts and assigning > each user their own pool you can restrict access to the actual file > data. In fact, according to my test and with the precious help of John Spray in IRC (thanks to him), it seems that file-layouts features can't protect a cephfs directory against the deletion from a specific ceph account. I try to be more precise. In a client node if I mount the cephfs with a specific ceph account, with the file-layouts features it's possible to configure a cephfs directory so that "root" (in the node) will be not able to *read* and to *modify* the files contained in the directory but "root" will always be able to *remove* the files because "root" will always has the capability "to send unlink operations to the MDS and the MDS will purge the files" (I take the liberty of quoting John Spray from IRC ;) and I have noticed indeed this behaviour). >> 2. Is it planned to implement a such feature in a next release of Ceph? > > There are a couple students working on these features this summer, and > many discussions amongst the core team about how to enable secure > multi-tenancy in CephFS. Ok, cool. I'm ready to test this feature with pleasure when it will be released (I have a good feeling to fall in bugs by accident ;)). > Just the file layout/multiple-pool one, right now. Or you could do > something like set up an NFS export that each user mounts of the > CephFS, but then you lose all the CephFS goodness on the clients... Ok, I see. Many thanks Greg for your answer. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mount options nodcache and nofsc
Hi, Yan, Zheng wrote: > fsc means fs-cache. it's a kernel facility by which a network > filesystem can cache data locally, trading disk space to gain > performance improvements for access to slow networks and media. cephfs > does not use fs-cache by default. So enable this option can improve performance, correct? Is there downside in return? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup hundreds or thousands of TB
Hi, Wido den Hollander wrote: > Aren't snapshots something that should protect you against removal? IF > snapshots work properly in CephFS you could create a snapshot every hour. Are you talking about the .snap/ directory in a cephfs directory? If yes, does it work well? Because, with Hammer, if I want to enable this feature: ~# ceph mds set allow_new_snaps true Error EPERM: Snapshots are unstable and will probably break your FS! Set to --yes-i-really-mean-it if you are sure you want to enable them I have never tried with the --yes-i-really-mean-it option. The warning is not very encouraging. ;) > With the recursive statistics [0] of CephFS you could "easily" backup > all your data to a different Ceph system or anything not Ceph. What is the link between this (very interesting) recursive statistics feature and the backup? I'm not sure to understand. Can you explain me? Maybe you test if the size of a directory has changed? > I've done this with a ~700TB CephFS cluster and that is still working > properly. > > Wido > > [0]: > http://blog.widodh.nl/2015/04/playing-with-cephfs-recursive-statistics/ Thanks Wido for this very interesting (and very simple) feature. But does it work well? Because, I use Hammer in a Ubuntu Trusty cluster nodes, and in a Ubuntu Trusty client with 3.16 kernel and cephfs mounted with the kernel module client, I have this: ~# mount | grep cephfs # /mnt is my mounted cephfs 10.0.2.150,10.0.2.151,10.0.2.152:/ on /mnt type ceph (noacl,name=cephfs,key=client.cephfs) ~# ls -lah /mnt/dir1/ total 0 drwxr-xr-x 1 root root 96M May 12 21:06 . drwxr-xr-x 1 root root 103M May 17 23:56 .. drwxr-xr-x 1 root root 96M May 12 21:06 8 drwxr-xr-x 1 root root 4.0M May 17 23:57 test As you can see: /mnt/dir1/8/ => 96M /mnt/dir1/test/ => 4.0M But: /mnt/dir1/ (ie .) => 96M I should have: size("/mnt/dir1/") = size("/mnt/dir1/8/") + size("/mnt/dir1/test/") and this is not the case. Is it normal? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
John Spray wrote: > Greg's response is pretty comprehensive, but for completeness I'll add that > the specific case of shutdown blocking is http://tracker.ceph.com/issues/9477 Yes indeed, during the freeze, "INFO: task sync:3132 blocked for more than 120 seconds..." was exactly the message I have seen in the VNC console of the client (it was a Openstack VM). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
Hi, Sorry for my late answer. Gregory Farnum wrote: >> 1. Is this kind of freeze normal? Can I avoid these freezes with a >> more recent version of the kernel in the client? > > Yes, it's normal. Although you should have been able to do a lazy > and/or force umount. :) Ah, I haven't tried it. Maybe I'm wrong but I think a "lazy" or a "force" umount wouldn't succeed. I'll try to test if I can reproduce the freeze. > You can't avoid the freeze with a newer client. :( > > If you notice the problem quickly enough, you should be able to > reconnect everything by rebooting the MDS — although if the MDS hasn't > failed the client then things shouldn't be blocking, so actually that > probably won't help you. Yes, the mds was completely ok and after the hard-reboot of the client, the client had access again to the cephfs with the exactly same mds service in the cluster side (no restart etc). >> 2. Can I avoid these freezes with ceph-fuse instead of the kernel >> cephfs module? But in this case, the cephfs performance will be >> worse. Am I wrong? > > No, ceph-fuse will suffer the same blockage, although obviously in > userspace it's a bit easier to clean up. Yes, I suppose that after "kill" commands, I would be able to remount the cephfs without any reboot etc., isn't it? > Depending on your workload it > will be slightly faster to a lot slower. Though you'll also get > updates faster/more easily. ;) Yes, I imagine that with "ceph-fuse" I have a completely updated cephfs-client (in user-space) whereas with the cephfs-client kernel I have just the version available in the current kernel of my client node (3.16 in my case). >> 3. Is there a parameter in ceph.conf to tell mds to be more patient >> before closing the "stale session" of a client? > > Yes. You'll need to increase the "mds session timeout" value on the > MDS; it currently defaults to 60 seconds. You can increase that to > whatever values you like. The tradeoff here is that if you have a > client die, anything it had "capabilities' on (for read/write access) > will be unavailable for anybody who's doing something that might > conflict with those capabilities. Ok, thanks for the warning, it seems logical. > If you've got a new enough MDS (Hammer, probably, but you can check) Yes, I use Hammer. > then you can use the admin socket to boot specific sessions, so it may > suit you to set very large timeouts and manually zap any client which > actually goes away badly (rather than getting disconnected by the > network). Ok, I see. According to the online documentation, the way to close a cephfs client session is: ceph daemon mds.$id session ls # to get the $session_id and the $address ceph osd blacklist add $address ceph osd dump # to get the $epoch ceph daemon mds.$id osdmap barrier $epoch ceph daemon mds.$id session evict $session_id Is it correct? With the commands above, could I reproduce the client freeze in my testing cluster? I'll try because it convenient to be able reproduce the problem just with command lines (without to really stop the network in the client etc). I would like to test if, with ceph-fuse, I can easily restore the situation of my client. >> I'm in a testing period and a hard reboot of my cephfs clients would >> be quite annoying for me. Thanks in advance for your help. > > Yeah. Unfortunately there's a basic tradeoff in strictly-consistent > (aka POSIX) network filesystems here: if the network goes away, you > can't be consistent any more because the disconnected client can make > conflicting changes. And you can't tell exactly when the network > disappeared. And could it be conceivable one day (for instance with an option) to be able to change the behavior of cephfs to be *not*-strictly-consistent, like NFS for instance? It seems to me it could improve performances of cephfs and cephfs could be more flexible concerning short network failure (not really sure for this second point). Ok it's just a remark of a simple and unqualified ceph-user ;) but it seems to me that NFS isn't strictly consistent and generally this not a problem in many use cases. Am I wrong? > So while we hope to make this less painful in the future, the network > dying that badly is a failure case that you need to be aware of > meaning that the client might have conflicting information. If it > *does* have conflicting info, the best we can do about it is be > polite, return a bunch of error codes, and unmount gracefully. We'll > get there eventually but it's a lot of work. Yes, I can imagine the amount of work... Thank a lot Greg for your answer. ;) -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Complete freeze of a cephfs client (unavoidable hard reboot)
Hi, I had a problem with a cephfs freeze in a client. Impossible to re-enable the mountpoint. A simple "ls /mnt" command totally blocked (of course impossible to umount-remount etc.) and I had to reboot the host. But even a "normal" reboot didn't work, the host didn't stop. I had to do a hard reboot of the host. In brief, it was like a big "NFS" freeze. ;) In the logs, nothing relevant in the client side and just this line in the cluster side: ~# cat /var/log/ceph/ceph-mds.1.log [...] 2015-05-14 17:07:17.259866 7f3b5cffc700 0 log_channel(cluster) log [INF] : closing stale session client.1342358 192.168.21.207:0/519924348 after 301.329013 [...] And indeed, the freeze was probably triggered by a little network interruption. Here is my configuration: - OS: Ubuntu 14.04 in the client and in the cluster nodes. - Kernel: 3.16.0-36-generic in the client and in the cluster nodes. (apt-get install linux-image-generic-lts-utopic). - Ceph version: Hammer in the client and in cluster nodes (0.94.1-1trusty). In the client, I use the cephfs kernel module (not ceph-fuse). Here is the fstab line in the client node: 10.0.2.150,10.0.2.151,10.0.2.152:/ /mnt ceph noatime,noacl,name=cephfs,secretfile=/etc/ceph/secret,_netdev 0 0 My only configuration concerning mds in ceph.conf is just: mds cache size = 100 That's all. Here are my questions: 1. Is this kind of freeze normal? Can I avoid these freezes with a more recent version of the kernel in the client? 2. Can I avoid these freezes with ceph-fuse instead of the kernel cephfs module? But in this case, the cephfs performance will be worse. Am I wrong? 3. Is there a parameter in ceph.conf to tell mds to be more patient before closing the "stale session" of a client? I'm in a testing period and a hard reboot of my cephfs clients would be quite annoying for me. Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Find out the location of OSD Journal
Hi, Patrik Plank wrote: > i cant remember on which drive I install which OSD journal :-|| > Is there any command to show this? It's probably not the answer you hope, but why don't use a simple: ls -l /var/lib/ceph/osd/ceph-$id/journal ? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some more numbers - CPU/Memory suggestions for OSDs and Monitors
Mark Nelson wrote: > I'm not sure who came up with the 1GB for each 1TB of OSD daemons rule, but > frankly I don't think it scales well at the extremes. You can't get by with > 256MB of ram for OSDs backed by 256GB SSDs, nor do you need 6GB of ram per > OSD for 6TB spinning disks. > > 2-4GB of RAM per OSD is reasonable depending on how much page cache you need. > I wouldn't stray outside of that range myself. Ok. It's recorded. > What it really comes down to is that your CPU needs to be fast enough to > process your workload. Small IOs tend to be more CPU intensive than large > IOs. Some processors have higher IPC than others so it's all just kind of a > vague guessing game. With modern Intel XEON processors, 1GHz of 1 core is a > good general estimate. If you are doing lots of small IO with SSD backed > OSDs you may need more. If you are doing high performance erasure coding you > may need more. If you have slow disks with journals on disk, 3x replication, > and a mostly read workload, you may be able to get away with less. > > As always, the recommendations above are just recommendations. It's best if > you can test yourself. Yes, sure. Thx for the explanations Mark. :) Bye. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs: proportion of data between data pool and metadata pool
Hi, When I want to have an estimation of the pg_num of a new pool, I use this very useful page: http://ceph.com/pgcalc/. In the table, I must give the %data of a pool. For instance, for a "rados gateway only" use case, I can see that, by default, the page gives: - .rgw.buckets => 96.90% of data - .rgw.control => 0.10% of data - etc. But in the menu, the use case "cephfs only" doesn't exist and I have no idea of the %data for each pools metadata and data. So, what is the proportion (approximatively) of %data between the "data" pool and the "metadata" pool of cephfs in a cephfs-only cluster? Is it rather metadata=20%, data=80%? Is it rather metadata=10%, data=90%? Is it rather metadata= 5%, data=95%? etc. Thanks in advance. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw and mds hardware configuration
Hi Cephers, :) I would like to know if there are some rules to estimate (approximatively) the need of CPU and RAM for: 1. a radosgw server (for instance with Hammer and civetweb). 2. a mds server If I am not mistaken, for these 2 types of server, there is no need concerning the storage. For a mds server, I wonder if this page is udpated: http://ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations 1GB per mds daemon seems to me very few. Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] decrease pg number
Hi, Pavel V. Kaygorodov wrote: > I have updated my cluster to Hammer and got a warning "too many PGs > per OSD (2240 > max 300)". I know, that there is no way to decrease > number of page groups, so I want to re-create my pools with less pg > number, move all my data to them, delete old pools and rename new > pools as the old ones. Also I want to preserve the user rights on new > pools. I have several pools with RBD images, some of them with > snapshots. > > Which is the proper way to do this? I'm not a ceph expert and I can just tell you my (little but happy) experience. ;) I had the same problem with pools of my radosgw pools ie: - the .rgw.* pools except ".rgw.buckets", and - the .users.* pools So, **warning**, it was for very tiny pools. The version of Ceph was Hammer 94.1, nodes were Ubuntu 14.04 with 3.16 kernel. These commands worked well for me: - # /!\ Before I have stopped my radosgws (ie the ceph clients of the pools). old_pool=foo new_pool=foo.new ceph osd pool create $new_pool 64 rados cppool $old_pool $new_pool ceph osd pool delete $old_pool $old_pool --yes-i-really-really-mean-it ceph osd pool rename $new_pool $old_pool # And I have restarted my radosgws. - That's all. In my case, it was very fast because the pools didn't contain very much data. And I prolong your question: is it possible to do the same process but with a pool of the cephfs? For instance, the pool metadata? If I try the commands above, I have an error with the delete command: ~# ceph osd pool delete metadata metadata --yes-i-really-really-mean-it Error EBUSY: pool 'metadata' is in use by CephFS However, I'm sure no client use the cephfs (it's a cluster for test). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some more numbers - CPU/Memory suggestions for OSDs and Monitors
Hi, Christian Balzer wrote: >> thanks for the feedback regarding the network questions. Currently I try >> to solve the question of how much memory, cores and GHz for OSD nodes >> and Monitors. >> >> My research so far: >> >> OSD nodes: 2 GB RAM, 2 GHz, 1 Core (?) per OSD >> > RAM is enough, but more helps (page cache on the storage node makes the > reads of hot objects quite fast and prevents concurrent access to the > disks). Personally, I have seen a different rule for the RAM: "1GB for each 1TB of OSD daemons". This is I understand in this doc: http://ceph.com/docs/master/start/hardware-recommendations/#minimum-hardware-recommendations So, for instance, with (it's just a stupid example): - 4 OSD daemons of 6TB and - 5 OSD daemons of 1TB The needed RAM would be: 1GB x (4 x 6) + 1GB x (5 x 1) = 29GB for the RAM Is it correct? Because if I follow the "2GB RAM per OSD" rule, I just need: 2GB x 9 = 18GB. Which rule is correct? > 1GHz or so for per pure HDD based OSD, at least 2GHz for HDD OSDs with SSD > journals, as much as you can afford for entirely SSD based OSDs. Are there links about the "at least 2Ghz per OSD with SSD journal", because I have never seen that except in this mailing list. For instance in the "HARDWARE CONFIGURATION GUIDE" of Inktank, it is just indicated: "one GHz per OSD" (https://ceph.com/category/resources/). Why should SSD journals increase the needed CPU? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What is a "dirty" object
Hi, John Spray wrote: > As far as I can see, this is only meaningful for cache pools, and object is > "dirty" in the sense of having been created or modified since their its last > flush. For a non-cache-tier pool, everything is logically dirty since it is > never flushed. > > I hadn't noticed that we presented this as nonzero for regular pools before, > it is a bit weird. Perhaps we should show zero here instead for > non-cache-tier pools. Ok, in this case, maybe something like "Not_Relevant" or "NR" could be more suitable. Thank you John. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Questions about an example of ceph infrastructure
Hi, Christian Balzer wrote: > For starters, make that 5 MONs. > It won't really help you with your problem of keeping a quorum when > loosing a DC, but being able to loose more than 1 monitor will come in > handy. > Note that MONs don't really need to be dedicated nodes, if you know what > you're doing and have enough resources (most importantly fast I/O aka SSD > for the leveldb) on another machine. Ok, I keep that in my head. >> In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk. >> Journals in SSD, there are 2 SSD so 3 journals per SSD. >> In DC2: the same config. >> > Out of curiosity, is that a 1U case with 8 2.5" bays, or why that > (relatively low) density per node? Sorry I have no idea because, in fact, it was just an example to be concrete. So I have taken a (imaginary) server with 8 disks and 2 SSDs (among the 8 disks, 2 for the OS in RAID1 soft). Currently, I can't be precise about hardware because were are absolutely not fixed about the budget (if we get it!), there are lot of uncertainties. > 4 nodes make a pretty small cluster, if you loose a SSD or a whole node > your cluster will get rather busy and may run out of space if you filled > it more than 50%. Yes indeed, it's a relevant remark. If the cluster is ~50% filled and if a node crashes in a DC, the other node in the same DC will be 100% filled and the cluster will be blocked. Indeed, the cluster is probably too small. > Unless you OSDs are RAID1s, a replica of 2 is basically asking Murphy to > "bless" you with a double disk failure. A very distinct probability with > 24 HDDs. The probability of a *simultaneous* disk failure in DC1 and in DC2 seems to me relatively low. For instance, if a disk fails in DC1 and if the rebalancing of data takes ~ 1 or 2 hours, it seems to me acceptable. But maybe I'm too optimistic... ;) > With OSDs backed by plain HDDs you really want a replica size of 3. But the "2-DCs" topology isn't really suitable for a replica size of 3, no? Is the replica size of 2 so risky? > Normally you'd configure Ceph to NOT set OSDs out automatically if a DC > fails (mon_osd_down_out_subtree_limit) I didn't known this option. In the online doc, the explanations are not clear enough for me and I'm not sure to understand its meaning. If I set: mon_osd_down_out_subtree_limit = datacenter what are the consequences? - If all OSDs in DC2 are unreachable, these OSDs will not be marked out - and if only several OSDs in DC2 are unreachable but not all in DC2, these OSDs will be marked out. Am I correct? > but in the case of a prolonged DC > outage you'll want to restore redundancy and set those OSDs out. > Which means you will need 3 times the actual data capacity on your > surviving 2 nodes. > In other words, if your 24 OSDs are 2TB each you can "safely" only store > 8TB in your cluster (48TB/3(replica)/2(DCs). I see but my idea was just to have a long enough disaster in DC1 so that I must restart the cluster in degraded mode in DC2, but not long enough so that I must restore a total redundancy in DC2. Personally I didn't consider this case and, unfortunately, I think we will never have a budget to be able to restore a total redundancy in just one datacenter. I'm afraid that it a unreachable whealth for us. > Fiber isn't magical FTL (faster than light) communications and the latency > depends (mostly) on the length (which you may or may not control) and the > protocol used. > A 2m long GbE link has a much worse latency than the same length in > Infiniband. In our case, if we can implement this infrastructure (if we have the budget etc.), the connection would be probably 2 dark fiber with 10km between DC1 and DC2. And we'll use Ethernet switchs with SFP transceivers (if you have good references of switchs, I'm interested). I suppose it could be possible to have low latencies in this case, no? > You will of course need "enough" bandwidth, but what is going to kill > (making it rather slow) your cluster will be the latency between those DCs. > > Each write will have to be acknowledged and this is where every ms less of > latency will make a huge difference. Yes indeed, I understand. >> For instance, I suppose the OSD disks in DC1 (and in DC2) has >> a throughput equal to 150 MB/s, so with 12 OSD disk in each DC, >> I have: >> >> 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps >> >> So, in the fiber, I need to have 14.4 Mbs. Is it correct? > > How do you get from 1.8 GigaByte/s to 14.4 Megabit/s? Sorry, it was a misprint, I wanted to write 14.4 Gb/s of course. ;) > You to multiply, not divide. > And assuming 10 bits (not 8) for a Byte when serialized never hurts. > So that's 18 Gb/s. Yes, indeed. So the "naive" estimation gives 18 Gb/s (Ok for 10 bits instead of 8). >> Maybe is it too naive reasoning? > > Very much so. Your disks (even with SSD journals) will not write 150MB/s, > because Ceph doesn't do long sequential writes (though 4MB blobs are > better than
[ceph-users] What is a "dirty" object
Hi, With my testing cluster (Hammer on Ubuntu 14.04), I have this: -- ~# ceph df detail GLOBAL: SIZE AVAIL RAW USED %RAW USED OBJECTS 4073G 3897G 176G 4.33 23506 POOLS: NAME ID CATEGORY USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE data 0 -20579M 0.49 1934G 6973 6973 597k 2898k metadata 1 -81447k 0 1934G 5353 243 135k volumes3 -56090M 1.34 1934G 14393 14393 208k 2416k images 4 -12194M 0.29 1934G 1551 1551 6263 5912 .rgw.buckets 13 - 362M 0 1934G 445 445 9244 14954 .users 25 -26 0 1934G 3 30 3 .users.email 26 -26 0 1934G 3 30 3 .users.uid 27 - 1059 0 1934G 6 6 12 6 .rgw.root 28 - 840 0 1934G 3 3 63 3 .rgw.control 29 - 0 0 1934G 8 80 8 .rgw.buckets.extra 30 - 0 0 1934G 8 80 8 .rgw.buckets.index 31 - 0 0 1934G 1111011 .rgw.gc32 - 0 0 1934G 3232032 .rgw 33 - 3064 0 1934G 1717017 -- If I understand well, all objects in the cluster are "dirty". Is it normal? What is a "dirty" object? Thanks for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Questions about an example of ceph infrastructure
Hi, We are thinking about a ceph infrastructure and I have questions. Here is the conceived (but not yet implemented) infrastructure: (please, be careful to read the schema with a monospace font ;)) +-+ | users | |(browser)| +++ | | +++ | | +--+ WAN ++ | | || | +-+| | | | | +-+-+ +-+-+ | | | | | monitor-1 | | monitor-3 | | monitor-2 | | | | | Fiber connection | | | +-+ | | OSD-1| | OSD-13 | | OSD-2| | OSD-14 | | ... | | ... | | OSD-12 | | OSD-24 | | | | | | client-a1 | | client-a2 | | client-b1 | | client-b2 | | | | | +---+ +---+ Datacenter1 Datacenter2 (DC1) (DC2) In DC1: 2 "OSD" nodes each with 6 OSDs daemons, one per disk. Journals in SSD, there are 2 SSD so 3 journals per SSD. In DC2: the same config. You can imagine for instance that: - client-a1 and client-a2 are radosgw - client-b1 and client-b2 are web servers which use the Cephfs of the cluster. And of course, the principle is to have data dispatched in DC1 and DC2 (size == 2, one copy of the object in DC1, the other in DC2). 1. If I suppose that the latency between DC1 and DC2 (via the fiber connection) is ok, I would like to know which throughput do I need to avoid network bottleneck? Is there a rule to compute the needed throughput? I suppose it depends on the disk throughputs? For instance, I suppose the OSD disks in DC1 (and in DC2) has a throughput equal to 150 MB/s, so with 12 OSD disk in each DC, I have: 12 x 150 = 1800 MB/s ie 1.8 GB/s, ie 14.4 Mbps So, in the fiber, I need to have 14.4 Mbs. Is it correct? Maybe is it too naive reasoning? Furthermore I have not taken into account the SSD. How evaluate the needed throughput more precisely? 2. I'm thinking about disaster recoveries too. For instance, if there is a disaster in DC2, DC1 will work (fine). But if there is a disaster in DC1, DC2 will not work (no quorum). But now, I suppose there is a long and big disaster in DC1. So I suppose DC1 is totally unreachable. In this case, I want to start (manually) my ceph cluster in DC2. No problem with that, I have seen explanations in the documentation to do that: - I stop monitor-3 - I extract the monmap - I remove monitor-1 and monitor-2 from this monmap - I inject the new monmap in monitor-3 - I restart monitor-3 After that, I have a DC1 unreachable but DC2 is working (with only one monitor). But what happens if DC1 becomes again reachable? What will the behavior of monitor-1 and monitor-2 in this case? Do monitor-1 and monitor-2 understand that they belong no longer to the ceph cluster? And now I imagine the worst scenario: DC1 becomes again reachable but the switch in DC1 which is connected on the fiber is very long to restart so that, during a short period, DC1 is reachable but the connection with DC2 is not yet operational. What happens in this period? client-a1 and client-b1 could write data in the cluster in this case, right? And the data in the cluster could be compromised because DC1 in not aware of writings in DC2. Am I wrong? My conclusion about that is: in case of long disaster in DC1, I can restart the ceph cluster in DC2 with the method described above (removing monitor-1 and monitor-2 from the monmap in monitor-3 etc.) but *only* *if* I can definitively stop monitor-1 and monitor-2 in DC1 before (and if I can't, I do nothing and I wait). Is it correct? Thanks in advance for your explanations. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrade from Firefly to Hammer
Hi, Garg, Pankaj wrote: > I have a small cluster of 7 machines. Can I just individually upgrade each of > them (using apt-get upgrade) from Firefly to Hammer release, or there more to > it than that? Not exactly, this is "individually" which is not correct. ;) You should indeed "apt-get upgrade" on each node 1,..., 7 but after you should follow this order: 1. restart the monitor daemons on each node 2. then, restart the osd daemons on each node 3. then, restart the mds daemons on each node 4. then, restart the radosgw daemon on each node Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] norecover and nobackfill
Robert LeBlanc wrote: > HmmmI've been deleting the OSD (ceph osd rm X; ceph osd crush rm osd.X) > along with removing the auth key. This has caused data movement, Maybe but if the flag "noout" is set, removing an OSD of the cluster doesn't trigger at all data movement (I have tested with Firefly). > I'd still like to know the difference between norecover and nobackfill if > anyone knows. If I read this page, http://ceph.com/docs/master/rados/operations/pg-states/, I understand that backfilling is just a special case of recovery more "detailed" (but I'm not a ceph expert). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to dispatch monitors in a multi-site cluster (ie in 2 datacenters)
Joao Eduardo wrote: > To be more precise, it's the lowest IP:PORT combination: > > 10.0.1.2:6789 = rank 0 > 10.0.1.2:6790 = rank 1 > 10.0.1.3:6789 = rank 3 > > and so on. Ok, so if there is 2 possible quorum, the quorum with the lowest IP:PORT will be chosen. But what happens if, in the 2 possible quorum, quorum A and quorum B, the monitor which has the lowest IP:PORT belongs to quorum A and quorum B? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw: upgrade Firefly to Hammer, impossible to create bucket
Hi, Yehuda Sadeh-Weinraub wrote: > The 405 in this case usually means that rgw failed to translate the http > hostname header into > a bucket name. Do you have 'rgw dns name' set correctly? Ah, I have found and indeed it concerned "rgw dns name" as also Karan thought. ;) But it's a little curious. Explanations: My s3cmd client use these hostnames (which are well resolved with the IP address of the radosgw host): .ostore.athome.priv And in the configuration of my radosgw, I had: --- [client.radosgw.gw1] host= ceph-radosgw1 rgw dns name= ostore ... --- ie just the *short* name of the radosgw's fqdn (its fqdn is ostore.athome.priv). And with Firefly, it worked well, I never had problem with this configuration! But with Hammer, it doesn't work anymore (I don't know why). Now, with Hammer, I just notice that I have to put the fqdn in "rgw dns name" not the short name: --- [client.radosgw.gw1] host= ceph-radosgw1 rgw dns name= ostore.athome.priv ... --- And with this configuration, it works. Is it normal? In fact, maybe my configuration with the short name (instead of the fqdn) was not valid and I just was lucky it work well so far. Is it the good conclusion of the story? In fact, I think I never have well understood the meaning of the "rgw dns name" parameter. Can you confirm to me (or not) this: This parameter is *only* used when a S3 client accesses to a bucket with the method http://.. If we don't set this parameter, such access will not work and a S3 client could access to a bucket only with the method http:/// Is it correct? Thx Yehuda and thx to Karan (who has pointed the real problem in fact ;)). -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] norecover and nobackfill
Hi, Robert LeBlanc wrote: > What I'm trying to achieve is minimal data movement when I have to service > a node to replace a failed drive. [...] I will perhaps say something stupid but it seems to me that it's the goal of the "noout" flag, isn't it? 1. ceph osd set noout 2. an old OSD disk failed, no rebalancing of data because noout is set, the cluster is just degraded. 3. You remove of the cluster the OSD daemon which used the old disk. 4. You power off the host and replace the old disk by a new disk and you restart the host. 5. You create a new OSD on the new disk. With these steps, there will be no movement of data. Only during the step 5 where the data will be recreated in the new disk (but it's normal and desired). Sorry in advance if there is something I'm missing in your problem. Regards. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com