Re: [ceph-users] Rugged data distribution on OSDs
Hello Greg, Output of 'ceph osd tree': # idweight type name up/down reweight -1 27.3root default -2 9.1 host stor1 0 3.64osd.0 up 1 1 3.64osd.1 up 1 2 1.82osd.2 up 1 -3 9.1 host stor2 3 3.64osd.3 up 1 4 1.82osd.4 up 1 6 3.64osd.6 up 1 -4 9.1 host stor3 7 3.64osd.7 up 1 8 3.64osd.8 up 1 9 1.82osd.9 up 1 (missing of osd.5 comes from previous test when I remove HDD from a working cluster, but I think this is not relevant now) root@stor3:~# ceph osd pool get .rgw.buckets pg_num pg_num: 250 root@stor3:~# ceph osd pool get .rgw.buckets pgp_num pgp_num: 250 pgmap v129814: 514 pgs: 514 active; 818 GB data, 1682 GB used Thank you, Mihaly 2013/9/16 Gregory Farnum g...@inktank.com What is your PG count and what's the output of ceph osd tree? It's possible that you've just got a slightly off distribution since there still isn't much data in the cluster (probabilistic placement and all that), but let's cover the basics first. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 16, 2013 at 2:08 AM, Mihály Árva-Tóth mihaly.arva-t...@virtual-call-center.eu wrote: Hello, I made some tests on 3 node Ceph cluster: upload 3 million 50 KiB object to single container. Speed and performance were okay. But data does not distributed correctly. Every node has got 2 pcs. 4 TB and 1 pc. 2 TB HDD. osd.0 41 GB (4 TB) osd.1 47 GB (4 TB) osd.3 16 GB (2 TB) osd.4 40 GB (4 TB) osd.5 49 GB (4 TB) osd.6 17 GB (2 TB) osd.7 48 GB (4 TB) osd.8 42 GB (4 TB) osd.9 18 GB (2 TB) Every 4 TB and 2 TB HDDs are from same vendor and same type. (WD RE SATA) I monitored iops with Zabbix under test, you can see here: http://ctrlv.in/237368 (sda and sdb are system HDDs) This graph are same on every three nodes. Is there any idea what's wrong or what should I see? I'm using ceph-0.67.3 on Ubuntu 12.04.3 x86_64. Thank you, Mihaly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph instead of RAID
On Tue, Aug 13, 2013 at 10:41:53AM -0500, Mark Nelson wrote: Hi Mark, On 08/13/2013 02:56 AM, Dmitry Postrigan wrote: I am currently installing some backup servers with 6x3TB drives in them. I played with RAID-10 but I was not impressed at all with how it performs during a recovery. Anyway, I thought what if instead of RAID-10 I use ceph? All 6 disks will be local, so I could simply create 6 local OSDs + a monitor, right? Is there anything I need to watch out for in such configuration? You can do that. Although it's nice to play with and everything, I wouldn't recommend doing it. It will give you more pain than pleasure. Any specific reason? I just got it up and running, an after simulating some failures, I like it much better than mdraid. Again, this only applies to large arrays (6x3TB in my case). I would not use ceph to replace a RAID-1 array of course, but it looks like a good idea to replace a large RAID10 array with a local ceph installation. The only thing I do not enjoy about ceph is performance. Probably need to do more tweaking, but so far numbers are not very impressive. I have two exactly same servers running same OS, kernel, etc. Each server has 6x 3TB drives (same model and firmware #). Server 1 runs ceph (2 replicas) Server 2 runs mdraid (raid-10) I ran some very basic benchmarks on both servers: dd if=/dev/zero of=/storage/test.bin bs=1M count=10 Ceph: 113 MB/s mdraid: 467 MB/s dd if=/storage/test.bin of=/dev/null bs=1M Ceph: 114 MB/s mdraid: 550 MB/s As you can see, mdraid is by far faster than ceph. It could be by design, or perhaps I am not doing it right. Even despite such difference in speed, I would still go with ceph because *I think* it is more reliable. couple of things: 1) Ceph is doing full data journal writes so is going to eat (at least) half of your write performance right there. 2) Ceph tends to like lots of concurrency. You'll probably see higher numbers with multiple dd reads/writes going at once. 3) Ceph is a lot more complex than something like mdraid. It gives you a lot more power and flexibility but the cost is greater complexity. There are probably things you can tune to get your numbers up, but it could take some work. Having said all of this, my primary test box is a single server and I can get 90MB/s+ per drive out of Ceph (with 24 drives!), but if I Could you share the configurations and parameters you have modified, or where I could find the associate documents? was building a production box and never planned to expand to multiple servers, I'd certainly be looking into zfs or btrfs RAID. Mark Dmitry ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best regards, Guangliang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Help with radosGW
Hello to all, I've a big issue with Ceph RadosGW. I did a PoC some days ago with radosgw. It worked well. Ceph version 0.67.3 under CentOS 6.4 Now, I'm installing a new cluster but, I can't succeed. I do not understand why. Here is some elements : ceph.conf: [global] filestore_xattr_use_omap = true mon_host = 192.168.0.1,192.168.0.2,192.168.0.3 fsid = f261d4c5-2a93-43dc-85a9-85211ec7100f mon_initial_members = mon-1, mon-2, mon-3 auth_supported = cephx osd_journal_size = 10240 [osd] cluster_network = 192.168.0.0/24 public_network = 192.168.1.0/24 [client.radosgw.gateway] host = gw-1 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/ceph/radosgw.log rgw print continue = false I followed this doc to install radosgw : http://ceph.com/docs/next/install/rpm/#installing-ceph-object-storage I start httpd : /etc/init.d/httpd start I start radosgw : [root@gw-1]# /etc/init.d/ceph-radosgw start Starting radosgw instance(s)... 2013-09-17 08:07:11.954248 7f835d7fb820 -1 WARNING: libcurl doesn't support curl_multi_wait() 2013-09-17 08:07:11.954253 7f835d7fb820 -1 WARNING: cross zone / region transfer performance may be affected I create a user : radosgw-admin user create --uid=alexis It works. Fine. So now, I connect to the gateway via a client (CyberDuck). I can create a bucket : test. Then, I try to upload a file = does not work. I have a time out after about 30 secs. And, of course, the file is not uploaded. A rados df on .rgw.buckets show that there is no objects inside. Here are some logs. radosgw.log: http://pastebin.com/6NNuczC5 (the last lines are because I stop radosgw, not to pollute the logs) and httpd.log : [Tue Sep 17 08:02:15 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:02:15 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:02:45 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:02:45 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:08:42 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:08:46 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:12:35 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:12:35 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:13:02 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi I'm really diapointed because i can't understand where is the issue. Thanks A LOT for your help. Alexis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] VM storage and OSD Ceph failures
Hi to all. Let's assume a Ceph cluster used to store VM disk images. VMs will be booted directly from the RBD. What will happens in case of OSD failure if the failed OSD is the primary where VM is reading from ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rbd cp empty block
Yeah,rbd clone works well, thanks a lot! 2013/9/16 Sage Weil s...@inktank.com On Mon, 16 Sep 2013, Chris Dunlop wrote: On Mon, Sep 16, 2013 at 09:20:29AM +0800, ??? wrote: Hi all: I have a 30G rbd block device as virtual machine disk, Aleady installed ubuntu 12.04. About 1G space used. When I want to deploy vm, I made a rbd cp. Then problem came, it copy 30G data instead of 1G. And this action take lots of time. Any ideal? I just want make it faster to deploy vm. It's a bug: http://tracker.ceph.com/issues/6257 Instead of cp, you can use rbd clone; this is copy-on-write and will always be faster than rbd cp. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- OPS 王根意 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to use the admin api (get user info)
hi i follow the admin api document http://ceph.com/docs/master/radosgw/adminops/ , when i get user info , it rentue 405 not allowed my commond is curl -XGET http://kp/admin/user?format=json -d'{uid:user1}' -H'Authoeization:AWS **:**' -H'Date:**' -i -v the reasult is 405 Method not allowed , {'code':MethodNotAllowed} the same commd will be work when I get usage, the commod is curl -XGET http://kp/admin/usage?format=json -d'{uid:user1}' -H'Authoeization:AWS **:**' -H'Date:**' -i -v it will return 200 ok , {entries:[], summary} my ceph vesion is 0.56.3, Thank you for your patience ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Objects get via s3api FastCGI incomplete headers and hanging up
Hello, I'm trying to download objects from one container (which contains 3 million objects, file sizes between 16K and 1024K) parallel 10 threads. I'm using s3 binary comes from libs3. I'm monitoring download time, response time of 80% lower than 50-80 ms. But sometimes download hanging up, up to 17 secs; apache returns with error code 500. apache error log (lot of): [Tue Sep 17 11:33:11 2013] [error] [client 194.38.106.67] FastCGI: comm with server /var/www/radosgw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 11:33:11 2013] [error] [client 194.38.106.67] FastCGI: incomplete headers (0 bytes) received from server /var/www/radosgw.fcgi [Tue Sep 17 11:33:11 2013] [error] [client 194.38.106.67] Handler for fastcgi-script returned invalid result code 1 I tried with native apache2/fastcgi ubuntu packages and Ceph built apache2/fastcgi both. When I turn on rgw print continue = true with modified build, the result is better very bit (less hungs). FastCgiWrapper Off of course. And if I set parallel get requests only 3 (instead of 10) the result is much better, the longest hang only 1500 ms. So I think this is depends with some resource management. But I get no idea. Using ceph-0.67.4 with Ubuntu 12.04 x8_64. I found the following issue (more than 1 year): http://tracker.ceph.com/issues/2027 But this closed with unable to reproduce. I can reproduce every time. Thank you, Mihaly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Decrease radosgw logging level
On 09/13/2013 01:02 PM, Mihály Árva-Tóth wrote: Hello, How can I decrease logging level of radosgw? I uploaded 400k pieces of objects and my radosgw log raises to 2 GiB. Current settings: rgw_enable_usage_log = true rgw_usage_log_tick_interval = 30 rgw_usage_log_flush_threshold = 1024 rgw_usage_max_shards = 32 rgw_usage_max_user_shards = 1 rgw_print_continue = false rgw_enable_ops_log = false rgw_ops_log_rados = false log_file = log_to_syslog = true If you mean output from rgw itself to its own log, try adjusting 'debug rgw'. Default is 1, so check if you have it set to some higher value. You can always set it to 0 too (debug rgw = 0) -Joao -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem with ceph-deploy hanging
On Mon, Sep 16, 2013 at 8:30 PM, Gruher, Joseph R joseph.r.gru...@intel.com wrote: -Original Message- From: Alfredo Deza [mailto:alfredo.d...@inktank.com] Subject: Re: [ceph-users] problem with ceph-deploy hanging ceph-deploy will use the user as you are currently executing. That is why, if you are calling ceph-deploy as root, it will log in remotely as root. So by a different user, I mean, something like, user `ceph` executing ceph- deploy (yes, that same user needs to exist remotely too with correct permissions) This is interesting. Since the preflight has us set up passwordless SSH with a default ceph user I assumed it didn't really matter what user I was logged in as on the admin system. Good to know. Well, it is (for now) a crappy work around. We have fixed this in the upcoming release :) Unfortunately, logging in as my ceph user on the admin system (with a matching user on the target system) does not affect my result. The ceph-deploy install still hangs here: [cephtest02][INFO ] Running command: wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | apt-key add - It has been suggested that this could be due to our firewall. I have the proxies configured in /etc/environment and when I run a wget myself (as the ceph user, either directly on cephtest02 or via SSH command to cephtest02 from the admin system) it resolves the proxy and succeeds. Is there any reason the wget might behave differently when run by ceph-deploy and fail to resolve the proxy? Is there anywhere I might need to set proxy information besides /etc/environment? I was about to ask if you had tried running that command through SSH, but you did and had correct behavior. This is puzzling for me because that is exactly what ceph-deploy does :/ When you say 'via SSH command' you mean something like: ssh cephtest02 sudo wget -q -O- 'https://ceph.com/git/?p=ceph.git,a=blob_plain;f=keys/release.asc' | apt-key add - Right? The firewall might have something to do with it. How do you have your proxies configured in /etc/environment ? Again, in this next coming release, you will be able to tell ceph-deploy to just install the packages without mangling your repos (or installing keys) Or, any other thoughts on how to debug this further? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd stuck creating a block device
On 09/16/2013 11:29 AM, Nico Massenberg wrote: Am 16.09.2013 um 11:25 schrieb Wido den Hollander w...@42on.com: On 09/16/2013 11:18 AM, Nico Massenberg wrote: Hi there, I have successfully setup a ceph cluster with a healthy status. When trying to create a rbd block device image I am stuck with an error which I have to ctrl+c: ceph@vl0181:~/konkluster$ rbd create imagefoo --size 5120 --pool kontrastpool 2013-09-16 10:59:06.838235 7f3bcb9eb700 0 -- 192.168.111.109:0/1013698 192.168.111.10:6806/3750 pipe(0x1fdfb00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x1fdfd60).fault Any ideas anyone? Is the Ceph cluster healthy? Yes it is. What does 'ceph -s' say? ceph@vl0181:~/konkluster$ ceph -s cluster 3dad736b-a9fc-42bf-a2fb-399cb8cbb880 health HEALTH_OK monmap e3: 3 mons at {ceph01=192.168.111.10:6789/0,ceph02=192.168.111.11:6789/0,ceph03=192.168.111.12:6789/0}, election epoch 52, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e230: 12 osds: 12 up, 12 in pgmap v3963: 292 pgs: 292 active+clean; 0 bytes data, 450 MB used, 6847 GB / 6847 GB avail mdsmap e1: 0/0/1 up If the cluster is healthy it seems like this client can't contact the Ceph cluster. I have no problems contacting any node/monitor from the admin machine via ping or telnet. It seems like the first monitor (ceph01) is not responding properly, is that one reachable? And if you leave the rbd command running for some time, will it work eventually? Wido Thanks, Nico ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem with ceph-deploy hanging
Le 17/09/2013 14:48, Alfredo Deza a écrit : On Mon, Sep 16, 2013 at 8:30 PM, Gruher, Joseph R joseph.r.gru...@intel.com wrote: [...] Unfortunately, logging in as my ceph user on the admin system (with a matching user on the target system) does not affect my result. The ceph-deploy install still hangs here: [cephtest02][INFO ] Running command: wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | apt-key add - It has been suggested that this could be due to our firewall. I have the proxies configured in /etc/environment and when I run a wget myself (as the ceph user, either directly on cephtest02 or via SSH command to cephtest02 from the admin system) it resolves the proxy and succeeds. Is there any reason the wget might behave differently when run by ceph-deploy and fail to resolve the proxy? Is there anywhere I might need to set proxy information besides /etc/environment? [...] Just a thought, as it concern a proxy server. On Debian, so perhaps also on Ubuntu, sudo does reset almost all environment variables, and it does for sure for http_proxy ones. As ceph-deploy runs sudo on the other end, Perhaps /etc/environment (deprecated) is loaded for the normal user and reset by sudo. I don't know the good way of solving this. Perhaps, just add in the doc that while creating a user with sudo rights, to add the options not to reset http_proxy variables... Extract of sudoers' man : By default, the env_reset option is enabled. This causes commands to be executed with a new, minimal environment. On AIX (and Linux systems without PAM), the environment is initialized with the contents of the /etc/environment file. The new environment contains the TERM, PATH, HOME, MAIL, SHELL, LOGNAME, USER, USERNAME and SUDO_* variables in addition to variables from the invoking process permitted by the env_check and env_keep options. This is effectively a whitelist for environment variables. So you can add something like this in all ceph nodes' /etc/sudoers (use visudo) : Defaults env_keep += http_proxy https_proxy ftp_proxy no_proxy Hope it can help. -- Gilles Mocellin Nuage Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph performance with 8K blocks.
Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency:0.234199 Stddev Latency: 0.130874 Max latency:0.867119 Min latency:0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size:4194304 Bandwidth (MB/sec):956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405 Max bandwidth (MB/sec): 6.19531 Min bandwidth (MB/sec): 0 Average Latency:0.0348977 Stddev Latency: 0.0349212 Max latency:0.326429 Min latency:0.0019 -- rados bench -b 8192 -p pbench 30 seq Total reads made: 13770 Read size:8192 Bandwidth (MB/sec):52.573 Average Latency: 0.00237483 Max latency: 0.006783 Min latency: 0.000521 So are these performance correct or is this something I missed with the testing procedure? The RADOS bench number with 8K block size are the same we see when testing performance in an VM with SQLIO. Does anyone know of any configure changes that are needed to get the Ceph performance closer to native performance with 8K blocks? Thanks in advance. -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VM storage and OSD Ceph failures
The VM read will hang until a replica gets promoted and the VM resends the read. In a healthy cluster with default settings this will take about 15 seconds. -Greg On Tuesday, September 17, 2013, Gandalf Corvotempesta wrote: Hi to all. Let's assume a Ceph cluster used to store VM disk images. VMs will be booted directly from the RBD. What will happens in case of OSD failure if the failed OSD is the primary where VM is reading from ? ___ ceph-users mailing list ceph-users@lists.ceph.com javascript:; http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency:0.234199 Stddev Latency: 0.130874 Max latency:0.867119 Min latency:0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size:4194304 Bandwidth (MB/sec):956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405 Max bandwidth (MB/sec): 6.19531 Min bandwidth (MB/sec): 0 Average Latency:0.0348977 Stddev Latency: 0.0349212 Max latency:0.326429 Min latency:0.0019 -- rados bench -b 8192 -p pbench 30 seq Total reads made: 13770 Read size:8192 Bandwidth (MB/sec):52.573 Average Latency: 0.00237483 Max latency: 0.006783 Min latency: 0.000521 So are these performance correct or is this something I missed with the testing procedure? The RADOS bench number with 8K block size are the same we see when testing performance in an VM with SQLIO. Does anyone know of any configure changes that are needed to get the Ceph performance closer to native performance with 8K blocks? Thanks in advance. -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? - Original Message - From: Gregory Farnum g...@inktank.com To: Jason Villalta ja...@rubixnet.com Cc: ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 10:40:09 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency: 0.234199 Stddev Latency: 0.130874 Max latency: 0.867119 Min latency: 0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size: 4194304 Bandwidth (MB/sec): 956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405 Max bandwidth (MB/sec): 6.19531 Min bandwidth (MB/sec): 0 Average Latency: 0.0348977 Stddev Latency: 0.0349212 Max latency: 0.326429 Min latency: 0.0019 -- rados bench -b 8192 -p pbench 30 seq Total reads made: 13770 Read size: 8192 Bandwidth (MB/sec): 52.573 Average Latency: 0.00237483 Max latency: 0.006783 Min latency: 0.000521 So are these performance correct or is this something I missed with the testing procedure? The RADOS bench number with 8K block size are the same we see when testing performance in an VM with SQLIO. Does anyone know of any configure changes that are needed to get the Ceph performance closer to native performance with 8K blocks? Thanks in advance. -- -- Jason Villalta Co-founder 800.799.4407x1230 | www.RubixTechnology.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com NOTICE: Protect the information in this message in accordance with the company's security policies. If you received this message in error, immediately notify the sender and destroy all copies.___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
Oh, and you should run some local sync benchmarks against these drives to figure out what sort of performance they can deliver with two write streams going on, too. Sometimes the drives don't behave the way one would expect. -Greg On Tuesday, September 17, 2013, Gregory Farnum wrote: Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency:0.234199 Stddev Latency: 0.130874 Max latency:0.867119 Min latency:0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size:4194304 Bandwidth (MB/sec):956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405 Max bandwidth (MB/sec): 6.19531 Min bandwidth (MB/sec): 0 Average Latency:0.0348977 Stddev Latency: 0.0349212 Max latency:0.326429 Min latency:0.0019 -- rados bench -b 8192 -p pbench 30 seq Total reads made: 13770 Read size:8192 Bandwidth (MB/sec):52.573 Average Latency: 0.00237483 Max latency: 0.006783 Min latency: 0.000521 So are these performance correct or is this something I missed with the testing procedure? The RADOS bench number with 8K block size are the same we see when testing performance in an VM with SQLIO. Does anyone know of any configure changes that are needed to get the Ceph performance closer to native performance with 8K blocks? Thanks in advance. -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ -- Software Engineer #42 @ http://inktank.com | http://ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VM storage and OSD Ceph failures
2013/9/17 Gregory Farnum g...@inktank.com: The VM read will hang until a replica gets promoted and the VM resends the read. In a healthy cluster with default settings this will take about 15 seconds. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Pause i/o from time to time
You could be suffering from a known, but unfixed issue [1] where spindle contention from scrub and deep-scrub cause periodic stalls in RBD. You can try to disable scrub and deep-scrub with: # ceph osd set noscrub # ceph osd set nodeep-scrub If your problem stops, Issue #6278 is likely the cause. To re-enable scrub and deep-scrub: # ceph osd unset noscrub # ceph osd unset nodeep-scrub Because you seem to only have two OSDs, you may also be saturating your disks even without scrub or deep-scrub. http://tracker.ceph.com/issues/6278 Cheers, Mike Dawson On 9/16/2013 12:30 PM, Timofey wrote: I use ceph for HA-cluster. Some time ceph rbd go to have pause in work (stop i/o operations). Sometime it can be when one of OSD slow response to requests. Sometime it can be my mistake (xfs_freeze -f for one of OSD-drive). I have 2 storage servers with one osd on each. This pauses can be few minutes. 1. Is any settings for fast change primary osd if current osd work bad (slow, don't response). 2. Can I use ceph-rbd in software raid-array with local drive, for use local drive instead of ceph if ceph cluster fail? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? -- *From: *Gregory Farnum g...@inktank.com *To: *Jason Villalta ja...@rubixnet.com *Cc: *ceph-users@lists.ceph.com *Sent: *Tuesday, September 17, 2013 10:40:09 AM *Subject: *Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency:0.234199 Stddev Latency: 0.130874 Max latency:0.867119 Min latency:0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size:4194304 Bandwidth (MB/sec):956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405 Max bandwidth (MB/sec): 6.19531 Min bandwidth (MB/sec): 0 Average Latency:0.0348977 Stddev Latency: 0.0349212 Max latency:0.326429 Min latency:0.0019 -- rados bench -b 8192 -p pbench 30 seq Total reads made: 13770 Read size:8192 Bandwidth (MB/sec):52.573 Average Latency: 0.00237483 Max latency: 0.006783 Min latency: 0.000521 So are these performance correct or is this something I missed with the testing procedure? The RADOS bench number with 8K block size are the same we see when testing performance in an VM with SQLIO. Does anyone know of any configure changes that are needed to get the Ceph performance closer to native performance with 8K blocks? Thanks in advance. -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com *NOTICE: Protect the information in this message in accordance with the company's security policies. If you received this message in error, immediately notify the sender and destroy all copies.* -- -- *Jason
Re: [ceph-users] Ceph performance with 8K blocks.
Ahh thanks I will try the test again with that flag and post the results. On Sep 17, 2013 11:38 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool? -- *From: *Jason Villalta ja...@rubixnet.com *To: *Bill Campbell bcampb...@axcess-financial.com *Cc: *Gregory Farnum g...@inktank.com, ceph-users ceph-users@lists.ceph.com *Sent: *Tuesday, September 17, 2013 11:31:43 AM *Subject: *Re: [ceph-users] Ceph performance with 8K blocks. Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? -- *From: *Gregory Farnum g...@inktank.com *To: *Jason Villalta ja...@rubixnet.com *Cc: *ceph-users@lists.ceph.com *Sent: *Tuesday, September 17, 2013 10:40:09 AM *Subject: *Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency:0.234199 Stddev Latency: 0.130874 Max latency:0.867119 Min latency:0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size:4194304 Bandwidth (MB/sec):956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405 Max bandwidth (MB/sec): 6.19531 Min bandwidth (MB/sec): 0 Average Latency:0.0348977 Stddev Latency: 0.0349212 Max latency:0.326429 Min latency:0.0019 -- rados bench -b 8192 -p pbench 30 seq Total reads made: 13770 Read size:8192 Bandwidth (MB/sec):52.573 Average Latency: 0.00237483 Max latency:
Re: [ceph-users] Rugged data distribution on OSDs
Well, that all looks good to me. I'd just keep writing and see if the distribution evens out some. You could also double or triple the number of PGs you're using in that pool; it's not atrocious but it's a little low for 9 OSDs. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 17, 2013 at 12:06 AM, Mihály Árva-Tóth mihaly.arva-t...@virtual-call-center.eu wrote: Hello Greg, Output of 'ceph osd tree': # idweight type name up/down reweight -1 27.3root default -2 9.1 host stor1 0 3.64osd.0 up 1 1 3.64osd.1 up 1 2 1.82osd.2 up 1 -3 9.1 host stor2 3 3.64osd.3 up 1 4 1.82osd.4 up 1 6 3.64osd.6 up 1 -4 9.1 host stor3 7 3.64osd.7 up 1 8 3.64osd.8 up 1 9 1.82osd.9 up 1 (missing of osd.5 comes from previous test when I remove HDD from a working cluster, but I think this is not relevant now) root@stor3:~# ceph osd pool get .rgw.buckets pg_num pg_num: 250 root@stor3:~# ceph osd pool get .rgw.buckets pgp_num pgp_num: 250 pgmap v129814: 514 pgs: 514 active; 818 GB data, 1682 GB used Thank you, Mihaly 2013/9/16 Gregory Farnum g...@inktank.com What is your PG count and what's the output of ceph osd tree? It's possible that you've just got a slightly off distribution since there still isn't much data in the cluster (probabilistic placement and all that), but let's cover the basics first. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Sep 16, 2013 at 2:08 AM, Mihály Árva-Tóth mihaly.arva-t...@virtual-call-center.eu wrote: Hello, I made some tests on 3 node Ceph cluster: upload 3 million 50 KiB object to single container. Speed and performance were okay. But data does not distributed correctly. Every node has got 2 pcs. 4 TB and 1 pc. 2 TB HDD. osd.0 41 GB (4 TB) osd.1 47 GB (4 TB) osd.3 16 GB (2 TB) osd.4 40 GB (4 TB) osd.5 49 GB (4 TB) osd.6 17 GB (2 TB) osd.7 48 GB (4 TB) osd.8 42 GB (4 TB) osd.9 18 GB (2 TB) Every 4 TB and 2 TB HDDs are from same vendor and same type. (WD RE SATA) I monitored iops with Zabbix under test, you can see here: http://ctrlv.in/237368 (sda and sdb are system HDDs) This graph are same on every three nodes. Is there any idea what's wrong or what should I see? I'm using ceph-0.67.3 on Ubuntu 12.04.3 x86_64. Thank you, Mihaly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with radosGW
On Tue, Sep 17, 2013 at 1:29 AM, Alexis GÜNST HORN alexis.gunsth...@outscale.com wrote: Hello to all, I've a big issue with Ceph RadosGW. I did a PoC some days ago with radosgw. It worked well. Ceph version 0.67.3 under CentOS 6.4 Now, I'm installing a new cluster but, I can't succeed. I do not understand why. Here is some elements : ceph.conf: [global] filestore_xattr_use_omap = true mon_host = 192.168.0.1,192.168.0.2,192.168.0.3 fsid = f261d4c5-2a93-43dc-85a9-85211ec7100f mon_initial_members = mon-1, mon-2, mon-3 auth_supported = cephx osd_journal_size = 10240 [osd] cluster_network = 192.168.0.0/24 public_network = 192.168.1.0/24 [client.radosgw.gateway] host = gw-1 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/ceph/radosgw.log rgw print continue = false I followed this doc to install radosgw : http://ceph.com/docs/next/install/rpm/#installing-ceph-object-storage I start httpd : /etc/init.d/httpd start I start radosgw : [root@gw-1]# /etc/init.d/ceph-radosgw start Starting radosgw instance(s)... 2013-09-17 08:07:11.954248 7f835d7fb820 -1 WARNING: libcurl doesn't support curl_multi_wait() 2013-09-17 08:07:11.954253 7f835d7fb820 -1 WARNING: cross zone / region transfer performance may be affected I create a user : radosgw-admin user create --uid=alexis It works. Fine. So now, I connect to the gateway via a client (CyberDuck). I can create a bucket : test. Then, I try to upload a file = does not work. I have a time out after about 30 secs. And, of course, the file is not uploaded. A rados df on .rgw.buckets show that there is no objects inside. Here are some logs. radosgw.log: http://pastebin.com/6NNuczC5 (the last lines are because I stop radosgw, not to pollute the logs) and httpd.log : [Tue Sep 17 08:02:15 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:02:15 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:02:45 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:02:45 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:08:42 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:08:46 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:12:35 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:12:35 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:13:02 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi Are you using the correct fastcgi apache module? Yehuda I'm really diapointed because i can't understand where is the issue. Thanks A LOT for your help. Alexis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Disk partition and replicas
Hi! I've a remote server with one unit where is installed Ubuntu. I can't create another partition on the disk to install OSD because is mounted. There is another way to install OSD? Maybe in a folder? And another question... Could I configure Ceph to make a particular replica in a particular OSD? For example, imagine that I'm interested to have a replica of a file in a server that runs faster than others. Thanks!! - Jordi___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with radosGW
I see that you added your public and cluster networks under an [osd] section. All daemons use the public network, and OSDs use the cluster network. Consider moving those settings to [global]. http://ceph.com/docs/master/rados/configuration/network-config-ref/#ceph-networks Also, I do believe I had a doc bug to fix here. http://tracker.ceph.com/issues/6182 It is now resolved. The s3gw.fcgi file should be in /var/www as suggested. However, my chmod instruction pointed to an incorrect directory. Can you take a look at that and see if that helps? On Tue, Sep 17, 2013 at 1:29 AM, Alexis GÜNST HORN alexis.gunsth...@outscale.com wrote: Hello to all, I've a big issue with Ceph RadosGW. I did a PoC some days ago with radosgw. It worked well. Ceph version 0.67.3 under CentOS 6.4 Now, I'm installing a new cluster but, I can't succeed. I do not understand why. Here is some elements : ceph.conf: [global] filestore_xattr_use_omap = true mon_host = 192.168.0.1,192.168.0.2,192.168.0.3 fsid = f261d4c5-2a93-43dc-85a9-85211ec7100f mon_initial_members = mon-1, mon-2, mon-3 auth_supported = cephx osd_journal_size = 10240 [osd] cluster_network = 192.168.0.0/24 public_network = 192.168.1.0/24 [client.radosgw.gateway] host = gw-1 keyring = /etc/ceph/keyring.radosgw.gateway rgw socket path = /tmp/radosgw.sock log file = /var/log/ceph/radosgw.log rgw print continue = false I followed this doc to install radosgw : http://ceph.com/docs/next/install/rpm/#installing-ceph-object-storage I start httpd : /etc/init.d/httpd start I start radosgw : [root@gw-1]# /etc/init.d/ceph-radosgw start Starting radosgw instance(s)... 2013-09-17 08:07:11.954248 7f835d7fb820 -1 WARNING: libcurl doesn't support curl_multi_wait() 2013-09-17 08:07:11.954253 7f835d7fb820 -1 WARNING: cross zone / region transfer performance may be affected I create a user : radosgw-admin user create --uid=alexis It works. Fine. So now, I connect to the gateway via a client (CyberDuck). I can create a bucket : test. Then, I try to upload a file = does not work. I have a time out after about 30 secs. And, of course, the file is not uploaded. A rados df on .rgw.buckets show that there is no objects inside. Here are some logs. radosgw.log: http://pastebin.com/6NNuczC5 (the last lines are because I stop radosgw, not to pollute the logs) and httpd.log : [Tue Sep 17 08:02:15 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:02:15 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:02:45 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:02:45 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:08:42 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:08:46 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:12:35 2013] [error] [client 46.231.147.8] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Tue Sep 17 08:12:35 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Tue Sep 17 08:13:02 2013] [error] [client 46.231.147.8] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi I'm really diapointed because i can't understand where is the issue. Thanks A LOT for your help. Alexis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- John Wilkins Senior Technical Writer Intank john.wilk...@inktank.com (415) 425-9599 http://inktank.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OpenStack user survey
If you use OpenStack, you should fill out the user survey: https://www.openstack.org/user-survey/Login In particular, it helps us to know how openstack users consume their storage, and it helps the larger community to know what kind of storage systems are being deployed. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool?From: "Jason Villalta" ja...@rubixnet.comTo: "Bill Campbell" bcampb...@axcess-financial.comCc: "Gregory Farnum" g...@inktank.com, "ceph-users" ceph-users@lists.ceph.comSent: Tuesday, September 17, 2013 11:31:43 AMSubject: Re: [ceph-users] Ceph performance with 8K blocks.Thanks for you feed back it is helpful.I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? From: "Gregory Farnum" g...@inktank.com To: "Jason Villalta" ja...@rubixnet.comCc: ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 10:40:09 AMSubject: Re: [ceph-users] Ceph performance with 8K blocks.Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even273MB/s with sync 8k writesregardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemonshould be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :)-GregOn Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list.I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows.dd of=ddbenchfile if=/dev/zero bs=8K count=100819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 writeTotal writes made: 2061Write size: 4194304Bandwidth (MB/sec): 273.004Stddev Bandwidth:67.5237 Max bandwidth (MB/sec): 352Min bandwidth (MB/sec): 0Average Latency:0.234199Stddev Latency: 0.130874Max latency: 0.867119Min latency: 0.039318 -rados bench -p pbench 30 seqTotal reads made: 2061Read size: 4194304Bandwidth (MB/sec): 956.466Average Latency:0.0666347 Max latency: 0.208986Min latency: 0.011625This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks):rados --no-cleanup bench -b 8192 -p pbench 30 writeTotal writes made: 13770Write size: 8192 Bandwidth (MB/sec): 3.581Stddev Bandwidth:1.04405Max bandwidth (MB/sec): 6.19531Min bandwidth (MB/sec): 0Average Latency:0.0348977 Stddev Latency: 0.0349212Max latency: 0.326429Min latency: 0.0019--rados bench -b 8192 -p pbench 30 seqTotal reads made: 13770 Read size: 8192Bandwidth (MB/sec): 52.573Average Latency:0.00237483Max latency: 0.006783Min latency: 0.000521 So are these performance correct or is this something I missed with the testing procedure? The RADOS bench number with 8K block size are the same we see when testing performance in an VM with SQLIO. Does anyone know of any configure changes that are needed to get the Ceph performance closer to native performance with 8K blocks? Thanks in advance.-- -- Jason Villalta Co-founder 800.799.4407x1230|www.RubixTechnology.com -- Software Engineer #42 @
Re: [ceph-users] problem with ceph-deploy hanging
-Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Gilles Mocellin So you can add something like this in all ceph nodes' /etc/sudoers (use visudo) : Defaults env_keep += http_proxy https_proxy ftp_proxy no_proxy Hope it can help. Thanks for the suggestion! However, no effect on the problem from this change. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
I will try both suggestions, Thank you for your input. On Tue, Sep 17, 2013 at 5:06 PM, Josh Durgin josh.dur...@inktank.comwrote: Also enabling rbd writeback caching will allow requests to be merged, which will help a lot for small sequential I/O. On 09/17/2013 02:03 PM, Gregory Farnum wrote: Try it with oflag=dsync instead? I'm curious what kind of variation these disks will provide. Anyway, you're not going to get the same kind of performance with RADOS on 8k sync IO that you will with a local FS. It needs to traverse the network and go through work queues in the daemon; your primary limiter here is probably the per-request latency that you're seeing (average ~30 ms, looking at the rados bench results). The good news is that means you should be able to scale out to a lot of clients, and if you don't force those 8k sync IOs (which RBD won't, unless the application asks for them by itself using directIO or frequent fsync or whatever) your performance will go way up. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 17, 2013 at 1:47 PM, Jason Villalta ja...@rubixnet.com wrote: Here are the stats with direct io. dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=direct 819200 bytes (8.2 GB) copied, 68.4789 s, 120 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s These numbers are still over all much faster than when using RADOS bench. The replica is set to 2. The Journals are on the same disk but separate partitions. I kept the block size the same 8K. On Tue, Sep 17, 2013 at 11:37 AM, Campbell, Bill bcampbell@axcess-financial.**com bcampb...@axcess-financial.com wrote: As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool? __**__ From: Jason Villalta ja...@rubixnet.com To: Bill Campbell bcampbell@axcess-financial.**combcampb...@axcess-financial.com Cc: Gregory Farnum g...@inktank.com, ceph-users ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 11:31:43 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampbell@axcess-financial.**com bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? __**__ From: Gregory Farnum g...@inktank.com To: Jason Villalta ja...@rubixnet.com Cc: ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 10:40:09 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default
Re: [ceph-users] Ceph performance with 8K blocks.
Try it with oflag=dsync instead? I'm curious what kind of variation these disks will provide. Anyway, you're not going to get the same kind of performance with RADOS on 8k sync IO that you will with a local FS. It needs to traverse the network and go through work queues in the daemon; your primary limiter here is probably the per-request latency that you're seeing (average ~30 ms, looking at the rados bench results). The good news is that means you should be able to scale out to a lot of clients, and if you don't force those 8k sync IOs (which RBD won't, unless the application asks for them by itself using directIO or frequent fsync or whatever) your performance will go way up. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 17, 2013 at 1:47 PM, Jason Villalta ja...@rubixnet.com wrote: Here are the stats with direct io. dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=direct 819200 bytes (8.2 GB) copied, 68.4789 s, 120 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s These numbers are still over all much faster than when using RADOS bench. The replica is set to 2. The Journals are on the same disk but separate partitions. I kept the block size the same 8K. On Tue, Sep 17, 2013 at 11:37 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool? From: Jason Villalta ja...@rubixnet.com To: Bill Campbell bcampb...@axcess-financial.com Cc: Gregory Farnum g...@inktank.com, ceph-users ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 11:31:43 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? From: Gregory Farnum g...@inktank.com To: Jason Villalta ja...@rubixnet.com Cc: ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 10:40:09 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304
Re: [ceph-users] Pause i/o from time to time
I have examined logs. Yes, first time it can be scrubbing. It repaired some self. I had 2 servers before first problem: one dedicated for osd (osd.0), and second - with osd and websites (osd.1). After problem I add third server - dedicated for osd (osd.2) and call ceph osd set out osd.1 for replace data. In ceph -s i saw normal replacing process and all work good about 5-7 hours. Then I have many misdirected records (few hundreds per second): osd.0 [WRN] client.359671 misdirected client.359671.1:220843 pg 2.3ae744c0 to osd.0 not [2,0] in e1040/1040 and errors in i/o operations. Now I have about 20GB ceph logs with this errors. (I don't work with cluster now - I copy out all data on hdd and work from hdd). Is any way have local software raid1 with ceph rbd and local image (for work when ceph fail or work slow by any reason). I tried mdadm but it work bad - server hang up every few hours. You could be suffering from a known, but unfixed issue [1] where spindle contention from scrub and deep-scrub cause periodic stalls in RBD. You can try to disable scrub and deep-scrub with: # ceph osd set noscrub # ceph osd set nodeep-scrub If your problem stops, Issue #6278 is likely the cause. To re-enable scrub and deep-scrub: # ceph osd unset noscrub # ceph osd unset nodeep-scrub Because you seem to only have two OSDs, you may also be saturating your disks even without scrub or deep-scrub. http://tracker.ceph.com/issues/6278 Cheers, Mike Dawson On 9/16/2013 12:30 PM, Timofey wrote: I use ceph for HA-cluster. Some time ceph rbd go to have pause in work (stop i/o operations). Sometime it can be when one of OSD slow response to requests. Sometime it can be my mistake (xfs_freeze -f for one of OSD-drive). I have 2 storage servers with one osd on each. This pauses can be few minutes. 1. Is any settings for fast change primary osd if current osd work bad (slow, don't response). 2. Can I use ceph-rbd in software raid-array with local drive, for use local drive instead of ceph if ceph cluster fail? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performance with 8K blocks.
Also enabling rbd writeback caching will allow requests to be merged, which will help a lot for small sequential I/O. On 09/17/2013 02:03 PM, Gregory Farnum wrote: Try it with oflag=dsync instead? I'm curious what kind of variation these disks will provide. Anyway, you're not going to get the same kind of performance with RADOS on 8k sync IO that you will with a local FS. It needs to traverse the network and go through work queues in the daemon; your primary limiter here is probably the per-request latency that you're seeing (average ~30 ms, looking at the rados bench results). The good news is that means you should be able to scale out to a lot of clients, and if you don't force those 8k sync IOs (which RBD won't, unless the application asks for them by itself using directIO or frequent fsync or whatever) your performance will go way up. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 17, 2013 at 1:47 PM, Jason Villalta ja...@rubixnet.com wrote: Here are the stats with direct io. dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=direct 819200 bytes (8.2 GB) copied, 68.4789 s, 120 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s These numbers are still over all much faster than when using RADOS bench. The replica is set to 2. The Journals are on the same disk but separate partitions. I kept the block size the same 8K. On Tue, Sep 17, 2013 at 11:37 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool? From: Jason Villalta ja...@rubixnet.com To: Bill Campbell bcampb...@axcess-financial.com Cc: Gregory Farnum g...@inktank.com, ceph-users ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 11:31:43 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? From: Gregory Farnum g...@inktank.com To: Jason Villalta ja...@rubixnet.com Cc: ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 10:40:09 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados
Re: [ceph-users] Ceph performance with 8K blocks.
Here are the stats with direct io. dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=direct 819200 bytes (8.2 GB) copied, 68.4789 s, 120 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s These numbers are still over all much faster than when using RADOS bench. The replica is set to 2. The Journals are on the same disk but separate partitions. I kept the block size the same 8K. On Tue, Sep 17, 2013 at 11:37 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool? -- *From: *Jason Villalta ja...@rubixnet.com *To: *Bill Campbell bcampb...@axcess-financial.com *Cc: *Gregory Farnum g...@inktank.com, ceph-users ceph-users@lists.ceph.com *Sent: *Tuesday, September 17, 2013 11:31:43 AM *Subject: *Re: [ceph-users] Ceph performance with 8K blocks. Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? -- *From: *Gregory Farnum g...@inktank.com *To: *Jason Villalta ja...@rubixnet.com *Cc: *ceph-users@lists.ceph.com *Sent: *Tuesday, September 17, 2013 10:40:09 AM *Subject: *Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly windows. I would also like to try to understand how it will scale IO by removing one disk of the three and doing the benchmark tests. But that is secondary. So far here are my results. I am aware this is all sequential, I just want to know how fast it can go. DD IO test of SSD disks: I am testing 8K blocks since that is the default block size of windows. dd of=ddbenchfile if=/dev/zero bs=8K count=100 819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s RADOS bench test with 3 SSD disks and 4MB object size(Default): rados --no-cleanup bench -p pbench 30 write Total writes made: 2061 Write size: 4194304 Bandwidth (MB/sec): 273.004 Stddev Bandwidth: 67.5237 Max bandwidth (MB/sec): 352 Min bandwidth (MB/sec): 0 Average Latency:0.234199 Stddev Latency: 0.130874 Max latency:0.867119 Min latency:0.039318 - rados bench -p pbench 30 seq Total reads made: 2061 Read size:4194304 Bandwidth (MB/sec):956.466 Average Latency: 0.0666347 Max latency: 0.208986 Min latency: 0.011625 This all looks like I would expect from using three disks. The problems appear to come with the 8K blocks/object size. RADOS bench test with 3 SSD disks and 8K object size(8K blocks): rados --no-cleanup bench -b 8192 -p pbench 30 write Total writes made: 13770 Write size: 8192 Bandwidth (MB/sec): 3.581 Stddev Bandwidth: 1.04405
Re: [ceph-users] Ceph performance with 8K blocks.
So what I am gleaming from this is it better to have more than 3 ODSs since the OSD seems to add additional processing overhead when using small blocks. I will try to do some more testing by using the same three disks but with 6 or more OSDs. If the OSD has is limited by processing is it safe to say it would make sense to just use SSD for the journal and a spindel disk for data and read. On Tue, Sep 17, 2013 at 5:12 PM, Jason Villalta ja...@rubixnet.com wrote: Here are the results: dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=dsync 819200 bytes (8.2 GB) copied, 266.873 s, 30.7 MB/s On Tue, Sep 17, 2013 at 5:03 PM, Gregory Farnum g...@inktank.com wrote: Try it with oflag=dsync instead? I'm curious what kind of variation these disks will provide. Anyway, you're not going to get the same kind of performance with RADOS on 8k sync IO that you will with a local FS. It needs to traverse the network and go through work queues in the daemon; your primary limiter here is probably the per-request latency that you're seeing (average ~30 ms, looking at the rados bench results). The good news is that means you should be able to scale out to a lot of clients, and if you don't force those 8k sync IOs (which RBD won't, unless the application asks for them by itself using directIO or frequent fsync or whatever) your performance will go way up. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 17, 2013 at 1:47 PM, Jason Villalta ja...@rubixnet.com wrote: Here are the stats with direct io. dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=direct 819200 bytes (8.2 GB) copied, 68.4789 s, 120 MB/s dd if=ddbenchfile of=/dev/null bs=8K 819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s These numbers are still over all much faster than when using RADOS bench. The replica is set to 2. The Journals are on the same disk but separate partitions. I kept the block size the same 8K. On Tue, Sep 17, 2013 at 11:37 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: As Gregory mentioned, your 'dd' test looks to be reading from the cache (you are writing 8GB in, and then reading that 8GB out, so the reads are all cached reads) so the performance is going to seem good. You can add the 'oflag=direct' to your dd test to try and get a more accurate reading from that. RADOS performance from what I've seen is largely going to hinge on replica size and journal location. Are your journals on separate disks or on the same disk as the OSD? What is the replica size of your pool? From: Jason Villalta ja...@rubixnet.com To: Bill Campbell bcampb...@axcess-financial.com Cc: Gregory Farnum g...@inktank.com, ceph-users ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 11:31:43 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Thanks for you feed back it is helpful. I may have been wrong about the default windows block size. What would be the best tests to compare native performance of the SSD disks at 4K blocks vs Ceph performance with 4K blocks? It just seems their is a huge difference in the results. On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill bcampb...@axcess-financial.com wrote: Windows default (NTFS) is a 4k block. Are you changing the allocation unit to 8k as a default for your configuration? From: Gregory Farnum g...@inktank.com To: Jason Villalta ja...@rubixnet.com Cc: ceph-users@lists.ceph.com Sent: Tuesday, September 17, 2013 10:40:09 AM Subject: Re: [ceph-users] Ceph performance with 8K blocks. Your 8k-block dd test is not nearly the same as your 8k-block rados bench or SQL tests. Both rados bench and SQL require the write to be committed to disk before moving on to the next one; dd is simply writing into the page cache. So you're not going to get 460 or even 273MB/s with sync 8k writes regardless of your settings. However, I think you should be able to tune your OSDs into somewhat better numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small pipeline!), and an SSD-based daemon should be going faster. What kind of logging are you running with and what configs have you set? Hopefully you can get Mark or Sam or somebody who's done some performance tuning to offer some tips as well. :) -Greg On Tuesday, September 17, 2013, Jason Villalta wrote: Hello all, I am new to the list. I have a single machines setup for testing Ceph. It has a dual proc 6 cores(12core total) for CPU and 128GB of RAM. I also have 3 Intel 520 240GB SSDs and an OSD setup on each disk with the OSD and Journal in separate partitions formatted with ext4. My goal here is to prove just how fast Ceph can go and what kind of performance to expect when using it as a back-end storage for virtual machines mostly
[ceph-users] Scaling RBD module
Hi, I am running Ceph on a 3 node cluster and each of my server node is running 10 OSDs, one for each disk. I have one admin node and all the nodes are connected with 2 X 10G network. One network is for cluster and other one configured as public network. Here is the status of my cluster. ~/fio_test# ceph -s cluster b2e0b4db-6342-490e-9c28-0aadf0188023 health HEALTH_WARN clock skew detected on mon. server-name-2, mon. server-name-3 monmap e1: 3 mons at {server-name-1=xxx.xxx.xxx.xxx:6789/0, server-name-2=xxx.xxx.xxx.xxx:6789/0, server-name-3=xxx.xxx.xxx.xxx:6789/0}, election epoch 64, quorum 0,1,2 server-name-1,server-name-2,server-name-3 osdmap e391: 30 osds: 30 up, 30 in pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 11145 GB / 11172 GB avail mdsmap e1: 0/0/1 up I started with rados bench command to benchmark the read performance of this Cluster on a large pool (~10K PGs) and found that each rados client has a limitation. Each client can only drive up to a certain mark. Each server node cpu utilization shows it is around 85-90% idle and the admin node (from where rados client is running) is around ~80-85% idle. I am trying with 4K object size. Now, I started running more clients on the admin node and the performance is scaling till it hits the client cpu limit. Server still has the cpu of 30-35% idle. With small object size I must say that the ceph per osd cpu utilization is not promising! After this, I started testing the rados block interface with kernel rbd module from my admin node. I have created 8 images mapped on the pool having around 10K PGs and I am not able to scale up the performance by running fio (either by creating a software raid or running on individual /dev/rbd* instances). For example, running multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2) the performance I am getting is half of what I am getting if running one instance. Here is my fio job script. [random-reads] ioengine=libaio iodepth=32 filename=/dev/rbd1 rw=randread bs=4k direct=1 size=2G numjobs=64 Let me know if I am following the proper procedure or not. But, If my understanding is correct, kernel rbd module is acting as a client to the cluster and in one admin node I can run only one of such kernel instance. If so, I am then limited to the client bottleneck that I stated earlier. The cpu utilization of the server side is around 85-90% idle, so, it is clear that client is not driving. My question is, is there any way to hit the cluster with more client from a single box while testing the rbd module ? Appreciate, if anybody can help me on this. Thanks Regards Somnath PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com