Monitor crash
Hi All: I met this crash once. You can download the full log here. https://dl.dropbox.com/u/35107741/ceph-mon.log ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-mon() [0x52569a] 2: (()+0xfcb0) [0x7ffad0949cb0] 3: (gsignal()+0x35) [0x7ffacf725425] 4: (abort()+0x17b) [0x7ffacf728b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7ffad007769d] 6: (()+0xb5846) [0x7ffad0075846] 7: (()+0xb5873) [0x7ffad0075873] 8: (()+0xb596e) [0x7ffad007596e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x5deb9e] 10: (PaxosService::propose_pending()+0x2e1) [0x4965b1] 11: (SafeTimer::timer_thread()+0x407) [0x5d6897] 12: (SafeTimerThread::entry()+0xd) [0x5d740d] 13: (()+0x7e9a) [0x7ffad0941e9a] 14: (clone()+0x6d) [0x7ffacf7e2cbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD crash on 0.48.2argonaut
Dear All: I met this issue on one of osd node. Is this a known issue? Thanks! ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe) 1: /usr/bin/ceph-osd() [0x6edaba] 2: (()+0xfcb0) [0x7f08b112dcb0] 3: (gsignal()+0x35) [0x7f08afd09445] 4: (abort()+0x17b) [0x7f08afd0cbab] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f08b065769d] 6: (()+0xb5846) [0x7f08b0655846] 7: (()+0xb5873) [0x7f08b0655873] 8: (()+0xb596e) [0x7f08b065596e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1de) [0x7a82fe] 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x693) [0x530f83] 11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x159) [0x531ac9] 12: (ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptrOpRequest)+0x15c) [0x53251c] 13: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptrOpRequest)+0x81) [0x54d241] 14: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x1e3) [0x600883] 15: (OSD::dequeue_op(PG*)+0x238) [0x5bfaf8] 16: (ThreadPool::worker()+0x4d5) [0x79f835] 17: (ThreadPool::WorkThread::entry()+0xd) [0x5d87cd] 18: (()+0x7e9a) [0x7f08b1125e9a] 19: (clone()+0x6d) [0x7f08afdc54bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Limitaion of CephFS
Hi all: I have some question about the limitation of CephFS. Would you please help to answer these questions? Thanks! 1. Max file size 2. Max number of files 3. Max filename length 4. filename character set, ex: any byte, except null, / 5. max pathname length And one question about RBD 1. max RBD size -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RBD boot from volume weirdness in OpenStack
Dear Josh and Travis: I am trying to setup the openstack+ceph environment too, but I am not using devstack. I deploy the glance, cinder, nova, keystone into different servers. All the basic function works fine, I can import image, create volume and create virtual machine. It seems the glance and cinder access ceph block device correctly. (ceph version 0.53) But when I try to create an volume based on existed images, it failed. I use the command in http://ceph.com/docs/master/rbd/rbd-openstack/ root@glance:~# glance image-list +--+-+-+--+++ | ID | Name| Disk Format | Container Format | Size | Status | +--+-+-+--+++ | cad779fc-c851-4581-ac4d-474c3773bf89 | Ubuntu-Precise-Raw | raw | bare | 2147483648 | active | +--+-+-+--+++ root@glance:~# rbd info -p images cad779fc-c851-4581-ac4d-474c3773bf89 rbd image 'cad779fc-c851-4581-ac4d-474c3773bf89': size 2048 MB in 256 objects order 23 (8192 KB objects) block_name_prefix: rbd_data.28c076755ff format: 2 features: layering root@cinder:~# cinder create --image-id cad779fc-c851-4581-ac4d-474c3773bf89 10 root@cinder:~# cinder list +--+---++--+-+--+ | ID | Status | Display Name | Size | Volume Type | Attached to | +--+---++--+-+--+ | b8af3932-b27a-41e4-a2cc-082b78083f79 | error | None | 10 | None| | +--+---++--+-+--+ Do you ever met this error message? Any suggestion is appreciated. Furthermore, I do not use cephx authentication, therefore, I didn't setup CEPH_ARGS. Is it possible to cause this issue? Thanks! =/etc/cinder/cinder.conf = [DEFAULT] rootwrap_config = /etc/cinder/rootwrap.conf api_paste_confg = /etc/cinder/api-paste.ini sql_connection = mysql://cinder:password@localhost:3306/cinder iscsi_helper = tgtadm volume_name_template = volume-%s volume_group = cinder-volumes verbose = True auth_strategy = keystone state_path = /var/lib/cinder volume_driver=cinder.volume.driver.RBDDriver rabbit_password = password my_ip = 172.17.123.12 glance_host = 172.17.123.16 == /var/log/cinder/cinder-volume.log 2012-10-26 13:48:37 17411 DEBUG cinder.manager [-] Running periodic task VolumeManager._publish_service_capabilities periodic_tasks /usr/lib/python2.7/dist-packages/cinder/manager.py:164 2012-10-26 13:48:37 17411 DEBUG cinder.manager [-] Running periodic task VolumeManager._report_driver_status periodic_tasks /usr/lib/python2.7/dist-packages/cinder/manager.py:164 2012-10-26 13:48:38 17411 DEBUG cinder.openstack.common.rpc.amqp [-] received {u'_context_roles': [u'KeystoneServiceAdmin', u'KeystoneAdmin', u'admin'], u'_context_request_id': u'req-ec369d9d-581e-488b-84f1-e218b03ef1ea', u'_context_quota_class': None, u'args': {u'image_id': u'cad779fc-c851-4581-ac4d-474c3773bf89', u'snapshot_id': None, u'volume_id': u'b8af3932-b27a-41e4-a2cc-082b78083f79'}, u'_context_auth_token': 'SANITIZED', u'_context_is_admin': True, u'_context_project_id': u'eefa301a6a424e7da3d582649ad0e59e', u'_context_timestamp': u'2012-10-26T05:48:37.771007', u'_context_read_deleted': u'no', u'_context_user_id': u'fafd0583de8a4a1b93b924a6b2cb7eb5', u'method': u'create_volume', u'_context_remote_address': u'172.17.123.12'} _safe_log /usr/lib/python2.7/dist-packages/cinder/openstack/common/rpc/common.py:195 2012-10-26 13:48:38 17411 DEBUG cinder.openstack.common.rpc.amqp [-] unpacked context: {'user_id': u'fafd0583de8a4a1b93b924a6b2cb7eb5', 'roles': [u'KeystoneServiceAdmin', u'KeystoneAdmin', u'admin'], 'timestamp': u'2012-10-26T05:48:37.771007', 'auth_token': 'SANITIZED', 'remote_address': u'172.17.123.12', 'quota_class': None, 'is_admin': True, 'request_id': u'req-ec369d9d-581e-488b-84f1-e218b03ef1ea', 'project_id': u'eefa301a6a424e7da3d582649ad0e59e', 'read_deleted': u'no'} _safe_log /usr/lib/python2.7/dist-packages/cinder/openstack/common/rpc/common.py:195 2012-10-26 13:48:38 INFO cinder.volume.manager [req-ec369d9d-581e-488b-84f1-e218b03ef1ea fafd0583de8a4a1b93b924a6b2cb7eb5 eefa301a6a424e7da3d582649ad0e59e] volume volume-b8af3932-b27a-41e4-a2cc-082b78083f79: creating 2012-10-26 13:48:38 DEBUG cinder.volume.manager
Re: rbd map error with new rbd format
Hi, Josh: Yeah, format 2 and layering support is in progress for kernel rbd, but not ready yet. The userspace side is all ready in the master branch, but it takes more time to implement in the kernel. Btw, instead of --new-format you should use --format 2. It's in the man page in the master branch. As you mentioned before, http://www.spinics.net/lists/ceph-devel/msg08857.html The kernel rbd is not ready at September, so we cannot map rbd to a device. Would you mind to estimate when will be available? And which version of kernel? (3.5 or 3.6?) Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
accident corruption in osdmap
Dear all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd We create a rbd device and mount it as ext4 file system. During read/write data on the rbd device, one of the storage server is shutdown by accident. After reboot the server, we cannot access the rbd device any more. One of the log shows the osdmap is corrupted. Aug 5 15:37:24 ubuntu-002 kernel: [78579.998582] libceph: corrupt inc osdmap epoch 78 off 98 (c9000177d07e of c9000177d01c-c9000177edf2) We would like to know what kind of scenario would cause the corruption of osdmap and how to avoid it? It seems that osdmap corruption cannot be recovered by the ceph cluster itself. Is it the same issue with http://tracker.newdream.net/issues/2446? In which version of kernel that we can find this patch? Thanks! = /var/log/kern.log = Aug 5 15:31:44 ubuntu-002 kernel: [78240.712542] libceph: osd11 down Aug 5 15:31:49 ubuntu-002 kernel: [78244.817151] libceph: osd12 down Aug 5 15:31:52 ubuntu-002 kernel: [78248.151815] libceph: osd13 down Aug 5 15:31:52 ubuntu-002 kernel: [78248.151913] libceph: osd14 down Aug 5 15:31:53 ubuntu-002 kernel: [78249.250991] libceph: get_reply unknown tid 96452 from osd7 Aug 5 15:31:59 ubuntu-002 kernel: [78254.833033] libceph: osd15 down Aug 5 15:31:59 ubuntu-002 kernel: [78254.833037] libceph: osd16 down Aug 5 15:31:59 ubuntu-002 kernel: [78254.833039] libceph: osd17 down Aug 5 15:31:59 ubuntu-002 kernel: [78254.833040] libceph: osd18 down Aug 5 15:31:59 ubuntu-002 kernel: [78254.833042] libceph: osd19 down Aug 5 15:31:59 ubuntu-002 kernel: [78254.833062] libceph: osd20 down Aug 5 15:31:59 ubuntu-002 kernel: [78254.833064] libceph: osd21 down Aug 5 15:36:46 ubuntu-002 kernel: [78541.813963] libceph: osd11 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811236] libceph: osd12 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811238] libceph: osd13 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811264] libceph: osd14 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811265] libceph: osd15 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811266] libceph: osd16 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811271] libceph: osd17 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811272] libceph: osd18 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811273] libceph: osd19 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811314] libceph: osd20 weight 0x0 (out) Aug 5 15:37:09 ubuntu-002 kernel: [78564.811315] libceph: osd21 weight 0x0 (out) Aug 5 15:37:24 ubuntu-002 kernel: [78579.998582] libceph: corrupt inc osdmap epoch 78 off 98 (c9000177d07e of c9000177d01c-c9000177edf2) Aug 5 15:37:24 ubuntu-002 kernel: [78579.998737] osdmap: : 05 00 70 d6 52 f9 b3 cc 44 c5 a2 eb c1 33 1d a2 ..p.R...D3.. Aug 5 15:37:24 ubuntu-002 kernel: [78579.998739] osdmap: 0010: 45 3d 4e 00 00 00 b3 22 1e 50 d0 b3 f3 2d ff ff E=N.P...-.. Aug 5 15:37:24 ubuntu-002 kernel: [78579.998742] osdmap: 0020: ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ff ff ... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: cannot startup one of the osd
Hi, Samuel: And the ceph cluster stays at a not healthy status. How could we fix it? There are 230 object unfound and we cannot access some rbd devices now. It would hang at rbd info image_name. root@ubuntu:~$ ceph -s health HEALTH_WARN 96 pgs backfill; 96 pgs degraded; 96 pgs recovering; 96 pgs stuck unclean; recovery 4978/138644 de graded (3.590%); 230/69322 unfound (0.332%) monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 6, quorum 0,1,2 006,008,009 osdmap e2944: 24 osds: 23 up, 23 in pgmap v297084: 4608 pgs: 4512 active+clean, 50 active+recovering+degraded+remapped+backfill, 46 active+recovering+de graded+backfill; 257 GB data, 952 GB used, 19367 GB / 21390 GB avail; 4978/138644 degraded (3.590%); 230/69322 unfound ( 0.332%) mdsmap e1: 0/0/1 up -Original Message- From: Eric YH Chen/WYHQ/Wiwynn Sent: Wednesday, August 01, 2012 9:01 AM To: 'Samuel Just' Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: RE: cannot startup one of the osd Hi, Samuel: It happens every startup, I cannot fix it now. -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, August 01, 2012 1:36 AM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: Re: cannot startup one of the osd This crash happens on each startup? -Sam On Tue, Jul 31, 2012 at 2:32 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd root@ubuntu:~$ ceph -s health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 564, quorum 0,1,2 006,008,009 osdmap e1911: 24 osds: 23 up, 24 in pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%) mdsmap e1: 0/0/1 up I find one of the osd cannot startup anymore. Before that, I am testing HA of Ceph cluster. Step1: shutdown server1, wait 5 min Step2: bootup server1, wait 5 min until ceph enter health status Step3: shutdown server2, wait 5 min Step4: bootup server2, wait 5 min until ceph enter health status Repeat Step1~ Step4 several times, then I met this problem. Log of ceph-osd.22.log 2012-07-31 17:18:15.120678 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.122081 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.128544 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.257302 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.273163 7f9375300780 1 journal close /srv/disk10/journal 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited size xattrs -- filestore_xattr_use_omap enabled 2012-07-31 17:18:15.275169 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is supported and appears to work 2012-07-31 17:18:15.275180 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-07-31 17:18:15.275312 7f9375300780 0 filestore(/srv/disk10/data) mount did NOT detect btrfs 2012-07-31 17:18:15.276060 7f9375300780 0 filestore(/srv/disk10/data) mount syncfs(2) syscall fully supported (by glib and kernel) 2012-07-31 17:18:15.276154 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.277031 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.280906 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.307761 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:19.466921 7f9360a97700 0 -- 192.168.200.82:6830/18744 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 192.168.200.83:0/3485583732
RE: High-availability testing of ceph
Hi, Josh: Thanks for your reply. However, I had asked a question about replica setting before. http://www.spinics.net/lists/ceph-devel/msg07346.html If the performance of rbd device is n MB/s under replica=2, then that means the total io throughputs on hard disk is over 3 * n MB/s. Because I think the total number of copies is 3 in original. So, it seems not correct now, the total number of copies is only 2. The total io through puts on disk should be 2 * n MB/s. Right? -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Tuesday, July 31, 2012 1:56 PM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: Re: High-availability testing of ceph On 07/30/2012 07:46 PM, eric_yh_c...@wiwynn.com wrote: Hi, all: I am testing high-availability of ceph. Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48 Kernel: 3.2.0-27 We create a ceph cluster with 24 osd. Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 The crush rule is using default rule. rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 1536 pgp_num 1536 last_change 1172 owner 0 Test case 1: 1. Create a rbd device and read/write to it 2. Random turn off one osd on server1 (service ceph stop osd.0) 3. check the read/write of rbd device Test case 2: 1. Create a rbd device and read/write to it 2. Random turn off one osd on server1 (service ceph stop osd.0) 2. Random turn off one osd on server2 (service ceph stop osd.12) 3. check the read/write of rbd device About test case 1, we can access the rbd device as normal. But about test case 2, we would hang there and no response. Is it a correct scenario ? I imagine that we can turn off any two osd when we set the replication as 2. Because without the master data, we have two other copies on two different osd. Even when we turn off two osd, we can find the data on third osd. Any misunderstanding? Thanks! rep size is the total number of copies, so stopping two osds with rep size 2 may cause you to lose access to some objects. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The cluster do not aware some osd are disappear
Dear All: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd When I turn off the network of server1, we expect that server2 would aware 12 osd (on server 1) disappear. However, when I type ceph -s, it still show 24 osd there. And from the log of osd.0 and osd.11, we can find heartbeat check on server1, but not on server2. What happened to server2? Can we restart the heartbeat server? Thanks! root@wistor-002:~# ceph -s health HEALTH_WARN 1 mons down, quorum 1,2 008,009 monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 522, quorum 1,2 008,009 osdmap e1388: 24 osds: 24 up, 24 in pgmap v288663: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail mdsmap e1: 0/0/1 up log of ceph -w (we turn of server1 arround 15:20, that cause the new monitor election) 2012-07-31 15:21:25.966572 mon.0 [INF] pgmap v288658: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:20:10.400566 mon.1 [INF] mon.008 calling new monitor election 2012-07-31 15:21:36.030473 mon.1 [INF] mon.008 calling new monitor election 2012-07-31 15:21:36.079772 mon.2 [INF] mon.009 calling new monitor election 2012-07-31 15:21:46.102587 mon.1 [INF] mon.008@1 won leader election with quorum 1,2 2012-07-31 15:21:46.273253 mon.1 [INF] pgmap v288659: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:21:46.273379 mon.1 [INF] mdsmap e1: 0/0/1 up 2012-07-31 15:21:46.273495 mon.1 [INF] osdmap e1388: 24 osds: 24 up, 24 in 2012-07-31 15:21:46.273814 mon.1 [INF] monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0} 2012-07-31 15:21:46.587679 mon.1 [INF] pgmap v288660: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:22:01.245813 mon.1 [INF] pgmap v288661: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:22:33.970838 mon.1 [INF] pgmap v288662: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail Log of osd.0 (on server 1) 2012-07-31 15:20:25.309264 7fdc06470700 0 -- 192.168.200.81:6825/12162 192.168.200.82:6840/8772 pipe(0x4dbea00 sd=52 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1 2012-07-31 15:20:25.310887 7fdc1c551700 0 -- 192.168.200.81:6825/12162 192.168.200.82:6833/15570 pipe(0x4dbec80 sd=51 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1 2012-07-31 15:21:46.861458 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.12 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861496 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.13 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861506 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.14 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861514 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.15 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861522 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.16 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861530 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.17 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861538 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.18 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861546 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.19 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861556 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.20 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861576 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.21 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861609 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.22 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861618 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.23 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) Log of osd.12 (on server 2) 2012-07-31 15:20:31.475815 7f9eac5ba700 0 osd.12 1387 pg[2.16f( v 1356'10485 (465'9480,1356'10485] n=42 ec=1 les/c 1387/1387 1383/1383/1383) [12,0] r=0 lpr=1383 mlcod 0'0 active+clean] watch: oi.user_version=45 2012-07-31
cannot startup one of the osd
Hi, all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd root@ubuntu:~$ ceph -s health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 564, quorum 0,1,2 006,008,009 osdmap e1911: 24 osds: 23 up, 24 in pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%) mdsmap e1: 0/0/1 up I find one of the osd cannot startup anymore. Before that, I am testing HA of Ceph cluster. Step1: shutdown server1, wait 5 min Step2: bootup server1, wait 5 min until ceph enter health status Step3: shutdown server2, wait 5 min Step4: bootup server2, wait 5 min until ceph enter health status Repeat Step1~ Step4 several times, then I met this problem. Log of ceph-osd.22.log 2012-07-31 17:18:15.120678 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.122081 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.128544 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.257302 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.273163 7f9375300780 1 journal close /srv/disk10/journal 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited size xattrs -- filestore_xattr_use_omap enabled 2012-07-31 17:18:15.275169 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is supported and appears to work 2012-07-31 17:18:15.275180 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-07-31 17:18:15.275312 7f9375300780 0 filestore(/srv/disk10/data) mount did NOT detect btrfs 2012-07-31 17:18:15.276060 7f9375300780 0 filestore(/srv/disk10/data) mount syncfs(2) syscall fully supported (by glib and kernel) 2012-07-31 17:18:15.276154 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.277031 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.280906 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.307761 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:19.466921 7f9360a97700 0 -- 192.168.200.82:6830/18744 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0) 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7f9363a9d700 time 2012-07-31 17:18:19.670082 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid()) ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x6a3123] 2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0x684) [0x53f314] 3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, int, eversion_t, interval_setunsigned long, std::maphobject_t, interval_setunsigned long, std::lesshobject_t, std::allocatorstd::pairhobject_t const, interval_setunsigned long )+0x333) [0x54c873] 4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t const, int)+0x343) [0x54cdc3] 5: (ReplicatedPG::recover_object_replicas(hobject_t const, eversion_t)+0x35f) [0x5527bf] 6: (ReplicatedPG::wait_for_degraded_object(hobject_t const, std::tr1::shared_ptrOpRequest)+0x17b) [0x55406b] 7: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x9de) [0x56305e] 8: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x199) [0x5fda89] 9: (OSD::dequeue_op(PG*)+0x238) [0x5bf668] 10: (ThreadPool::worker()+0x605) [0x796d55] 11: (ThreadPool::WorkThread::entry()+0xd) [0x5d5d0d] 12: (()+0x7e9a) [0x7f9374794e9a] 13: (clone()+0x6d) [0x7f93734344bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -21 2012-07-31
RE: cannot startup one of the osd
Hi, Samuel: It happens every startup, I cannot fix it now. -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, August 01, 2012 1:36 AM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: Re: cannot startup one of the osd This crash happens on each startup? -Sam On Tue, Jul 31, 2012 at 2:32 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd root@ubuntu:~$ ceph -s health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 564, quorum 0,1,2 006,008,009 osdmap e1911: 24 osds: 23 up, 24 in pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%) mdsmap e1: 0/0/1 up I find one of the osd cannot startup anymore. Before that, I am testing HA of Ceph cluster. Step1: shutdown server1, wait 5 min Step2: bootup server1, wait 5 min until ceph enter health status Step3: shutdown server2, wait 5 min Step4: bootup server2, wait 5 min until ceph enter health status Repeat Step1~ Step4 several times, then I met this problem. Log of ceph-osd.22.log 2012-07-31 17:18:15.120678 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.122081 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.128544 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.257302 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.273163 7f9375300780 1 journal close /srv/disk10/journal 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited size xattrs -- filestore_xattr_use_omap enabled 2012-07-31 17:18:15.275169 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is supported and appears to work 2012-07-31 17:18:15.275180 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-07-31 17:18:15.275312 7f9375300780 0 filestore(/srv/disk10/data) mount did NOT detect btrfs 2012-07-31 17:18:15.276060 7f9375300780 0 filestore(/srv/disk10/data) mount syncfs(2) syscall fully supported (by glib and kernel) 2012-07-31 17:18:15.276154 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.277031 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.280906 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.307761 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:19.466921 7f9360a97700 0 -- 192.168.200.82:6830/18744 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0) 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7f9363a9d700 time 2012-07-31 17:18:19.670082 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid()) ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x6a3123] 2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0x684) [0x53f314] 3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, int, eversion_t, interval_setunsigned long, std::maphobject_t, interval_setunsigned long, std::lesshobject_t, std::allocatorstd::pairhobject_t const, interval_setunsigned long )+0x333) [0x54c873] 4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t const, int)+0x343) [0x54cdc3] 5: (ReplicatedPG::recover_object_replicas(hobject_t const, eversion_t)+0x35f) [0x5527bf] 6: (ReplicatedPG::wait_for_degraded_object(hobject_t const,
High-availability testing of ceph
Hi, all: I am testing high-availability of ceph. Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48 Kernel: 3.2.0-27 We create a ceph cluster with 24 osd. Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 The crush rule is using default rule. rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 1536 pgp_num 1536 last_change 1172 owner 0 Test case 1: 1. Create a rbd device and read/write to it 2. Random turn off one osd on server1 (service ceph stop osd.0) 3. check the read/write of rbd device Test case 2: 1. Create a rbd device and read/write to it 2. Random turn off one osd on server1 (service ceph stop osd.0) 2. Random turn off one osd on server2 (service ceph stop osd.12) 3. check the read/write of rbd device About test case 1, we can access the rbd device as normal. But about test case 2, we would hang there and no response. Is it a correct scenario ? I imagine that we can turn off any two osd when we set the replication as 2. Because without the master data, we have two other copies on two different osd. Even when we turn off two osd, we can find the data on third osd. Any misunderstanding? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rbd map fail when the crushmap algorithm changed to tree
Hi, Gregory: OS: Ubuntu 12.04 kernel: 3.2.0-26 ceph: 0.48 filesystem : ext4 My step to assign new crush map 1. ceph osd getcrushmap -o curmap 2. crushtool -d curmap -o curmap.txt 3. modify the curmap.txt and rename to newmap.txt 4. service ceph -a stop = destruct the cluster 5. mkcephfs -a -c ceph.conf --crushmap newmap 6. service ceph -a start 7. rbd map image_name I do not find any error log in dmesg or /var/log/ceph/osd.log. It just hang at step 7. -Original Message- From: Gregory Farnum [mailto:g...@inktank.com] Sent: Saturday, July 07, 2012 12:59 AM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: Re: rbd map fail when the crushmap algorithm changed to tree On Fri, Jul 6, 2012 at 12:27 AM, eric_yh_c...@wiwynn.com wrote: Hi all: Here is the original crushmap, I change the algorithm of host to tree and set back to ceph cluster. However, when I try to map one imge to rados block device (RBD), it would hang and no response until I press ctrl-c. ( rbd map = then hang) Is there any wrong in the crushmap? Thanks for help. Hmm, your crush map looks okay to me. What are the versions of everything (cluster, rbd tool, kernel), what's the exact command you run, and does it output anything? Is there any information in dmesg? -Greg = # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host store-001 { id -2 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 item osd.10 weight 1.000 item osd.11 weight 1.000 item osd.2 weight 1.000 item osd.3 weight 1.000 item osd.4 weight 1.000 item osd.5 weight 1.000 item osd.6 weight 1.000 item osd.7 weight 1.000 item osd.8 weight 1.000 item osd.9 weight 1.000 } rack unknownrack { id -3 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item store-001 weight 12.000 } pool default { id -1 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item unknownrack weight 12.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd map fail when the crushmap algorithm changed to tree
Hi all: Here is the original crushmap, I change the algorithm of host to tree and set back to ceph cluster. However, when I try to map one imge to rados block device (RBD), it would hang and no response until I press ctrl-c. ( rbd map = then hang) Is there any wrong in the crushmap? Thanks for help. = # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host store-001 { id -2 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 item osd.10 weight 1.000 item osd.11 weight 1.000 item osd.2 weight 1.000 item osd.3 weight 1.000 item osd.4 weight 1.000 item osd.5 weight 1.000 item osd.6 weight 1.000 item osd.7 weight 1.000 item osd.8 weight 1.000 item osd.9 weight 1.000 } rack unknownrack { id -3 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item store-001 weight 12.000 } pool default { id -1 # do not change unnecessarily # weight 12.000 alg straw hash 0 # rjenkins1 item unknownrack weight 12.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
What does replica size mean?
Hi, all: Just want to make sure one thing. If I set replica size as 2, that means one data with 2 copies, right? Therefore, if I measure the performance of rbd is 100MB/s, I can imagine the actually io throughputs on hard disk is over 100MB/s *3 = 300 MB/s. Am I correct? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Performance benchmark of rbd
Hi, Mark and all: I think you may miss this mail before, so I send it again. == I forget to mention one thing, I create the rbd at the same machine and test it. That means the network latency may be lower than normal case. 1. I use ext4 as the backend filesystem and with following attribute. data=writeback,noatime,nodiratime,user_xattr 2. I use the default replication number, I think it is 2, right? 3. On my platform, I have 192GB memory 4. Sorry about the column name is left-right reversal. Here is the correct one Seq-writeSeq-read 32 KB 23 MB/s 690 MB/s 512 KB 26 MB/s 960 MB/s 4 MB 27 MB/s 1290 MB/s 32 MB 36 MB/s 1435 MB/s 5. If I put all the journal data on a SSD device (Intel 520). The sequence write performance would reach 135MB/s instead of 27MB/s in original. (object size = 4MB). And others are no different, including random-write. I am curious why the SSD device doesn't help the performance of random-write. 6. For the random read write, the data I provided before was correct. But I can give you the detail. Is it too high than what you expected? rand-write-4k rand-write-16k bw iopsbw iops 3,524 881 9,032 564 mix-4k (50/50) r:bwr:iops w:bww:iops 2,925 731 2,924 731 mix-8k (50/50) r:bwr:iops w:bww:iops 4,509 563 4,509 563 mix-16k (50/50) r:bwr:iops w:bww:iops 8,366 522 8,345 521 7. Here is the hw raid cache policy we used now. Write PolicyWrite Back with BBU Read Policy ReadAhead If you are interested in how HW raid help the performance, I can do for little help, since we also want to know what is the best configuration on our platform. Any test you want to know? Furthermore, is there any suggestion for our platform that can improve the performance? Thanks! -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday, June 13, 2012 8:30 PM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org Subject: Re: Performance benchmark of rbd Hi Eric! On 6/13/12 5:06 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: I am doing some benchmark of rbd. The platform is on a NAS storage. CPU: Intel E5640 2.67GHz Memory: 192 GB Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm (H1~ H12) RAID Card: LSI 9260-4i OS: Ubuntu12.04 with Kernel 3.2.0-24 Network: 1 Gb/s We create 12 OSD on H1 ~ H12 with the journal is put on H0. Just to make sure I understand, you have a single node with 12 OSDs and 3 mons, and all 12 OSDs are using the H0 disk for their journals? What filesystem are you using for the OSDs? How much replication? We also create 3 MON in the cluster. In briefly, we setup a ceph cluster all-in-one, with 3 monitors and 12 OSD. The benchmark tool we used is fio 2.0.3. We had 7 basic test case 1) sequence write with bs=64k 2) sequence read with bs=64k 3) random write with bs=4k 4) random write with bs=16k 5) mix read/write with bs=4k 6) mix read/write with bs=8k 7) mix read/write with bs=16k We create several rbd with different object size for the benchmark. 1. size = 20G, object size = 32KB 2. size = 20G, object size = 512KB 3. size = 20G, object size = 4MB 4. size = 20G, object size = 32MB Given how much memory you have, you may want to increase the amount of data you are writing during each test to rule out caching. We have some conclusion after the benchmark. a. We can get better performance of sequence read/write when the object size is bigger. Seq-read Seq-write 32 KB23 MB/s 690 MB/s 512 KB26 MB/s 960 MB/s 4 MB27 MB/s 1290 MB/s 32 MB36 MB/s 1435 MB/s Which test are these results from? I'm suspicious that the write numbers are so high. Figure that even with a local client and 1X replication, your journals and data partitions are each writing out a copy of the data. You don't have enough disk in that box to sustain 1.4GB/s to both even under perfectly ideal conditions. Given that it sounds like you are using a single 7200rpm disk for 12 journals, I would expect far lower numbers... b. There is no obvious influence for random read/write when the object size is different. All the result are in a range not more than 10%. rand-write-4K rand-write-16K mix-4K mix-8kmix-16k 881 iops 564 iops 1462 iops
Performance benchmark of rbd
Hi, all: I am doing some benchmark of rbd. The platform is on a NAS storage. CPU: Intel E5640 2.67GHz Memory: 192 GB Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm (H1~ H12) RAID Card: LSI 9260-4i OS: Ubuntu12.04 with Kernel 3.2.0-24 Network: 1 Gb/s We create 12 OSD on H1 ~ H12 with the journal is put on H0. We also create 3 MON in the cluster. In briefly, we setup a ceph cluster all-in-one, with 3 monitors and 12 OSD. The benchmark tool we used is fio 2.0.3. We had 7 basic test case 1) sequence write with bs=64k 2) sequence read with bs=64k 3) random write with bs=4k 4) random write with bs=16k 5) mix read/write with bs=4k 6) mix read/write with bs=8k 7) mix read/write with bs=16k We create several rbd with different object size for the benchmark. 1. size = 20G, object size = 32KB 2. size = 20G, object size = 512KB 3. size = 20G, object size = 4MB 4. size = 20G, object size = 32MB We have some conclusion after the benchmark. a. We can get better performance of sequence read/write when the object size is bigger. Seq-read Seq-write 32 KB 23 MB/s 690 MB/s 512 KB 26 MB/s 960 MB/s 4 MB 27 MB/s 1290 MB/s 32 MB 36 MB/s 1435 MB/s b. There is no obvious influence for random read/write when the object size is different. All the result are in a range not more than 10%. rand-write-4Krand-write-16K mix-4K mix-8k mix-16k 881 iops 564 iops 1462 iops 1127 iops 1044 iops c. It we change the environment, for every 3 hard drive, we bind them together by RAID0. (LSI 9260-4i RAID card) So the ceph cluster becomes 3 MONs and 4 OSD (3T for each) We can get better performance on all items, around 10% ~ 20% enhancement. d. If we change H0 to a SSD device, and we also put all journal on it. We can get better performance on sequence-write. It would reach 135MB/s. However, there are no different for other test items. We want to check with you, if all the conclusion are reasonable for you? Or any seems strange? Thanks! Here is some data if I use command provided by rados. rados -p rbd bench 120 write -t 8 Total time run:120.751713 Total writes made: 930 Write size:4194304 Bandwidth (MB/sec):30.807 Average Latency: 1.03807 Max latency: 2.63197 Min latency: 0.205726 [INF] bench: wrote 1024 MB in blocks of 4096 KB in 13.219819 sec at 79318 KB/sec -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Performance benchmark of rbd
Hi, Mark: I forget to mention one thing, I create the rbd at the same machine and test it. That means the network latency may be lower than normal case. 1. I use ext4 as the backend filesystem and with following attribute. data=writeback,noatime,nodiratime,user_xattr 2. I use the default replication number, I think it is 2, right? 3. On my platform, I have 192GB memory 4. Sorry about the column name is left-right reversal. Here is the correct one Seq-writeSeq-read 32 KB 23 MB/s 690 MB/s 512 KB 26 MB/s 960 MB/s 4 MB 27 MB/s 1290 MB/s 32 MB 36 MB/s 1435 MB/s 5. If I put all the journal data on a SSD device (Intel 520). The sequence write performance would reach 135MB/s instead of 27MB/s in original. (object size = 4MB). And others are no different, including random-write. I am curious why the SSD device doesn't help the performance of random-write. 6. For the random read write, the data I provided before was correct. But I can give you the detail. Is it too high than what you expected? rand-write-4k rand-write-16k bw iopsbw iops 3,524 881 9,032 564 mix-4k (50/50) r:bwr:iops w:bww:iops 2,925 731 2,924 731 mix-8k (50/50) r:bwr:iops w:bww:iops 4,509 563 4,509 563 mix-16k (50/50) r:bwr:iops w:bww:iops 8,366 522 8,345 521 7. Here is the hw raid cache policy we used now. Write PolicyWrite Back with BBU Read Policy ReadAhead If you are interested in how HW raid help the performance, I can do for little help, since we also want to know what is the best configuration on our platform. Any test you want to know? Furthermore, is there any suggestion for our platform that can improve the performance? Thanks! -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday, June 13, 2012 8:30 PM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org Subject: Re: Performance benchmark of rbd Hi Eric! On 6/13/12 5:06 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: I am doing some benchmark of rbd. The platform is on a NAS storage. CPU: Intel E5640 2.67GHz Memory: 192 GB Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm (H1~ H12) RAID Card: LSI 9260-4i OS: Ubuntu12.04 with Kernel 3.2.0-24 Network: 1 Gb/s We create 12 OSD on H1 ~ H12 with the journal is put on H0. Just to make sure I understand, you have a single node with 12 OSDs and 3 mons, and all 12 OSDs are using the H0 disk for their journals? What filesystem are you using for the OSDs? How much replication? We also create 3 MON in the cluster. In briefly, we setup a ceph cluster all-in-one, with 3 monitors and 12 OSD. The benchmark tool we used is fio 2.0.3. We had 7 basic test case 1) sequence write with bs=64k 2) sequence read with bs=64k 3) random write with bs=4k 4) random write with bs=16k 5) mix read/write with bs=4k 6) mix read/write with bs=8k 7) mix read/write with bs=16k We create several rbd with different object size for the benchmark. 1. size = 20G, object size = 32KB 2. size = 20G, object size = 512KB 3. size = 20G, object size = 4MB 4. size = 20G, object size = 32MB Given how much memory you have, you may want to increase the amount of data you are writing during each test to rule out caching. We have some conclusion after the benchmark. a. We can get better performance of sequence read/write when the object size is bigger. Seq-read Seq-write 32 KB23 MB/s 690 MB/s 512 KB26 MB/s 960 MB/s 4 MB27 MB/s 1290 MB/s 32 MB36 MB/s 1435 MB/s Which test are these results from? I'm suspicious that the write numbers are so high. Figure that even with a local client and 1X replication, your journals and data partitions are each writing out a copy of the data. You don't have enough disk in that box to sustain 1.4GB/s to both even under perfectly ideal conditions. Given that it sounds like you are using a single 7200rpm disk for 12 journals, I would expect far lower numbers... b. There is no obvious influence for random read/write when the object size is different. All the result are in a range not more than 10%. rand-write-4K rand-write-16K mix-4K mix-8kmix-16k 881 iops 564 iops 1462 iops 1127 iops 1044 iops c. It we change the environment, for every 3
Journal size of each disk
Dear all: I would like to know if the journal size influence the performance of disk. If the size of each of my disk is 1T, how much size should I prepare for the journal? Thanks for any comment. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to turn on async write of rbd ?
Dear All: I saw rbd support async writes since 0.36, http://ceph.com/2011/09/ But I cannot find related document that how to turn on it. Should I just write enabled to /sys/devices/rbd/0/power/async? One more thing, If I want to implement iSCSI multipath with RBD, just like http://ceph.com/wiki/ISCSI Can I turn on the async writes under this situation? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cannot restart the osd successful after reboot machine
Dear All: In my testing environment, we deploy ceph cluster by version 0.43, kernel 3.2.0. (We deploy it several months ago, so the version is not the latest one) There are 5 MON and 8 OSD in the cluster. We have 5 servers for the monitors. And two storages servers, 4 OSD for each. We meet a situation that we cannot restart the osd service successful after reboot one of the storage server. (contains 4 OSD). Let me describe the scenario more detail. 1. One of the storage server's network have problem. Therefore, we lose four OSD in the cluster. When I type 'ceph -s', I get some strange message like this. (sorry that I did not copy the clear message immediately) -276/3108741 degraded (the number is a negative number, I am sure) 8 osds: 4 up, 4 in 2. After fix the broken network, I try to restart the four OSD on it. But some of OSD would fail. 3. I repeat to execute 'service ceph start' on the storage server. Maybe after 10 times, all the OSDs finally work fine. And 'ceph health' returns HEALTH_OK Appreciate for any comment for this situation, thanks! If you want the complete log for all OSD, I can send it to you. 2012-06-07 14:42:36.482616 7f62a02547a0 ceph version 0.43 (commit:9fa8781c0147d66fcef7c2dd0e09cd3c69747d37), process ceph-osd, pid 7146 2012-06-07 14:42:36.510945 7f62a02547a0 filestore(/srv/disk0) mount FIEMAP ioctl is supported 2012-06-07 14:42:36.511002 7f62a02547a0 filestore(/srv/disk0) mount did NOT detect btrfs 2012-06-07 14:42:36.511372 7f62a02547a0 filestore(/srv/disk0) mount found snaps 2012-06-07 14:42:36.640990 7f62a02547a0 filestore(/srv/disk0) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-06-07 14:42:36.816868 7f62a02547a0 journal _open /srv/disk0.journal fd 16: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-06-07 14:42:36.848522 7f62a02547a0 journal read_entry 410750976 : seq 1115076 1278 bytes 2012-06-07 14:42:36.848582 7f62a02547a0 journal read_entry 410759168 : seq 1115077 1275 bytes 2012-06-07 14:42:36.848810 7f62a02547a0 journal read_entry 410767360 : seq 1115078 1275 bytes 2012-06-07 14:42:36.848835 7f62a02547a0 journal read_entry 410775552 : seq 1115079 1272 bytes 2012-06-07 14:42:36.848859 7f62a02547a0 journal read_entry 410783744 : seq 1115080 1281 bytes 2012-06-07 14:42:36.848872 7f62a02547a0 journal read_entry 410791936 : seq 1115081 1281 bytes 2012-06-07 14:42:36.849181 7f62a02547a0 journal read_entry 410800128 : seq 1115082 1275 bytes 2012-06-07 14:42:36.849207 7f62a02547a0 journal read_entry 410808320 : seq 1115083 1278 bytes 2012-06-07 14:42:36.849225 7f62a02547a0 journal read_entry 410816512 : seq 1115084 1281 bytes 2012-06-07 14:42:36.849239 7f62a02547a0 journal read_entry 410824704 : seq 1115085 1275 bytes 2012-06-07 14:42:36.849255 7f62a02547a0 journal read_entry 410832896 : seq 1115086 1281 bytes 2012-06-07 14:42:36.849267 7f62a02547a0 journal read_entry 410841088 : seq 1115087 1278 bytes 2012-06-07 14:42:36.849282 7f62a02547a0 journal read_entry 410849280 : seq 1115088 1275 bytes 2012-06-07 14:42:36.849293 7f62a02547a0 journal read_entry 410857472 : seq 1115089 1275 bytes 2012-06-07 14:42:36.849328 7f62a02547a0 journal _open /srv/disk0.journal fd 16: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-06-07 14:42:36.851593 7f62a02547a0 journal close /srv/disk0.journal 2012-06-07 14:42:36.852625 7f62a02547a0 filestore(/srv/disk0) mount FIEMAP ioctl is supported 2012-06-07 14:42:36.852642 7f62a02547a0 filestore(/srv/disk0) mount did NOT detect btrfs 2012-06-07 14:42:36.852695 7f62a02547a0 filestore(/srv/disk0) mount found snaps 2012-06-07 14:42:36.852714 7f62a02547a0 filestore(/srv/disk0) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-06-07 14:42:36.855399 7f62a02547a0 journal _open /srv/disk0.journal fd 24: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-06-07 14:42:36.855429 7f62a02547a0 journal read_entry 410750976 : seq 1115076 1278 bytes 2012-06-07 14:42:36.855450 7f62a02547a0 journal read_entry 410759168 : seq 1115077 1275 bytes 2012-06-07 14:42:36.855464 7f62a02547a0 journal read_entry 410767360 : seq 1115078 1275 bytes 2012-06-07 14:42:36.855476 7f62a02547a0 journal read_entry 410775552 : seq 1115079 1272 bytes 2012-06-07 14:42:36.855487 7f62a02547a0 journal read_entry 410783744 : seq 1115080 1281 bytes 2012-06-07 14:42:36.855501 7f62a02547a0 journal read_entry 410791936 : seq 1115081 1281 bytes 2012-06-07 14:42:36.855514 7f62a02547a0 journal read_entry 410800128 : seq 1115082 1275 bytes 2012-06-07 14:42:36.855525 7f62a02547a0 journal read_entry 410808320 : seq 1115083 1278 bytes 2012-06-07 14:42:36.855536 7f62a02547a0 journal read_entry 410816512 : seq 1115084 1281 bytes 2012-06-07 14:42:36.855547 7f62a02547a0 journal read_entry 410824704 : seq 1115085 1275 bytes 2012-06-07 14:42:36.88 7f62a02547a0 journal read_entry 410832896 : seq 1115086 1281 bytes 2012-06-07 14:42:36.855569
Snapshot/Clone in RBD
Hi all: According to the document, snapshot in RBD is Read-only. That is to say, if I want to clone the image, I should use rbd_copy. Right? I am curious about if this function is optimized, ex: copy-on-write, to speed up the performance. What I want to do is to integrate rbd with kvm. I hope that I can clone the image as soon as possible to speed up the vm creation. (One base image, and create other vm from this.) Do you have any suggestion about this point? Thanks a lot! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD id in configuration file
Hi, all: For the scalability consideration, we would like to name the first harddisk as 00101 on first server. And named the first harddisk as 00201 on second server. The ceph.conf seems like this: [osd] osd data = /srv/osd.$id osd journal = /srv/osd.$id.journal osd journal size = 1000 [osd.00101] host = server-001 btrfs dev= /dev/sda [osd.00102] host = server-001 btrfs dev= /dev/sdb [osd.00103] host = server-001 btrfs dev= /dev/sdc [osd.00201] host = server-002 btrfs dev= /dev/sda [osd.00202] host = server-002 btrfs dev= /dev/sdb [osd.00203] host = server-002 btrfs dev= /dev/sdc [osd.00301] host = server-003 btrfs dev= /dev/sda [osd.00302] host = server-003 btrfs dev= /dev/sdb [osd.00303] host = server-003 btrfs dev= /dev/sdc But we are worried about if it is an acceptable configuration for ceph. As I see, the maximum osd is 304 now, although there are only 9 osds in the cluster. Will this configuration influence about the performance? And what if we add osd.00204 in the future? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph based on ext4
Hi, all: We want to test the stability and performance of ceph with ext4 file system. Here is the information from `mount`, do we set all attributes correct? /dev/sda1 on /srv/osd.0 type ext4 (rw,noatime,nodiratime,errors=remount-ro,data=writeback,user_xattr) And from the tutorial at http://ceph.newdream.net/wiki/Creating_a_new_file_system We also disable the journal functionality by ` tune2fs -O ^has_journal /dev/sda1` Any extra comment to get better performance? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Why only support odd number monitors in ceph cluster?
Hi, all: I am curious about why ceph cluster only support odd number monitors. If we lose 1 monitor (becomes odd number), would it cause any problem if we do not handle this situation in short time? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Do not understand some terms about cluster health
Hi, All When I type 'ceph health' to get the status of cluster, it will show some information. Would you please to explain the term? Ex: HEALTH_WARN 3/54 degraded (5.556%) What does degraded mean ? Is it a serious error and how to fix it ? Ex: HEALTH_WARN 264 pgs degraded, 6/60 degraded (10.000%); 3/27 unfound (11.111%) What does unfound mean? Could we recover the data? Would it cause the whole data in rbd image corrupted and never access ? When I type 'ceph pg dump', it would show like this. Would you please explain what is hb in and hb out ? osdstat kbused kbavail kb hb in hb out 0 173008721884175720 1906311168 [6,7,8,9,10,11] [6,7,8,9,10,11] 1 166616641884808728 1906311168 [6,7,8,9,10,11] [6,7,8,9,10,11] 2 156956641886027584 1906311168 [6,7,8,9,10,11] [6,7,8,9,10,11] 3 164634401885005328 1906311168 [6,7,8,9,10,11] [6,7,8,9,10,11] 4 141010161888130760 1906311168 [6,7,8,9,10,11] [6,7,8,9,10,11] 5 140158041888215124 1906311168 [6,7,8,9,10,11] [6,7,8,9,10,11] 6 193122801881660776 1906311168 [0,1,2,3,4,5] [0,1,2,3,4,5] 7 144519921887521200 1906311168 [0,1,2,3,4,5] [0,1,2,3,4,5] 8 163360281885393468 1906311168 [0,1,2,3,4,5] [0,1,2,3,4,5] 9 166978681884773940 1906311168 [0,1,2,3,4,5] [0,1,2,3,4,5] 10 135304561888695776 1906311168 [0,1,2,3,4,5] [0,1,2,3,4,5] 11 139219081888307364 1906311168 [0,1,2,3,4,5] [0,1,2,3,4,5] sum188488992 22632715768 22875734016 And from the latest document, I know we can do the cluster snapshot by ceph osd cluster_snap name Is that means we can rollback the data from the snapshot? Do you have any related document to show how to operate it? Thanks a lot! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hang at 'rbd info image'
Hi, all: I met a situation that ceph system would hang at ‘rbd info image. Only can reproduced on one image. How can I know what happened? Is there any log that I can provide to you for analysis? Thanks!
Hang when mapping a long name rbd image
Hi, all: My ceph version is ceph version 0.39 (commit:321ecdaba2ceeddb0789d8f4b7180a8ea5785d83) When I try to map a long name rbd image to device, it would hang for long time. For example: sudo rbd map iqn.2012-01.com.sample:storage.ttt --secret /etc/ceph/secretfile sudo rbd map iqn.2012-01.com.sample:storage.abcdef --secret /etc/ceph/secretfile It would not hang at every long name image, however, it only happens when the image name is very long. Is it a known issue? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to know the status of the monitors?
Hi, all: Is there any command or API can retrieve the status of all monitors? I use ‘ceph mon stat’ to get this information. However, the mon.e is not available (server is shutdown). 2011-12-13 17:08:35.499022 mon - [mon,stat] 2011-12-13 17:08:35.499716 mon.1 - 'e3: 5 mons at {a=172.16.33.5:6789/0,b=172.16.33.6:6789/0,c=172.16.33.7:6789/0,d=172.16.33.71:6789/0,e=172.16.33.72:6789/0}, election epoch 16, quorum 0,1,2,3' (0) Even I use ‘ceph mon dump’, I still cannot know if the monitor is alive or not. 0: 172.16.33.5:6789/0 mon.a 1: 172.16.33.6:6789/0 mon.b 2: 172.16.33.7:6789/0 mon.c 3: 172.16.33.71:6789/0 mon.d 4: 172.16.33.72:6789/0 mon.e Please give some help. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rbd device would disappear after re-boot
Hi, Tommi: What I see is like this. lrwxrwxrwx 1 root root 10 2011-12-07 16:48 foo1:0 - ../../rbd0 lrwxrwxrwx 1 root root 10 2011-12-07 16:50 foo2:1 - ../../rbd1 The extra number (:0 and :1) behind the image name make the problem still exists. -Original Message- From: Tommi Virtanen [mailto:tommi.virta...@dreamhost.com] Sent: Thursday, December 08, 2011 1:31 AM To: Eric YH Chen/WHQ/Wistron Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WHQ/Wistron Subject: Re: rbd device would disappear after re-boot On Tue, Dec 6, 2011 at 18:38, eric_yh_c...@wistron.com wrote: I have another little question. Could I map specific image to specific device? For example: Before re-boot id pool image snap device 0 rbd foo1 - /dev/rbd0 1 rbd foo3 - /dev/rbd2 How could I re-map the image(foo3) to device(/dev/rbd2) but skip rbd1? The command provided by ceph CLI cannot achieve this. If I re-map the image to another device, it would not sync with iSCSI configuration and may cause problem. I'm just confirming what Damien said. Don't rely on the numbering to be consistent even from one boot to another; we provide a udev helper that assigns more permanent names, based on the pool, image name, and potentially snapshot name. You'll get a symlink like /dev/rbd/POOL/IMAGE or /dev/rbd/POOL/IMAGE@SNAP, use that name.
How to sync data on different server but with the same image
Dear All: I map the same rbd image to the rbd device on two different servers. For example: 1. create rbd image named foo 2. map foo to /dev/rbd0 on server A, mount /dev/rbd0 to /mnt 3. map foo to /dev/rbd0 on server B, mount /dev/rbd0 to /mnt If I put add a file to /mnt via server A, I hope I can see the same file on server B. However, I can't see it until I umount /mnt on server A and re-mount /mnt on server B. Do you have any comment about this scenario? How could I force the data synchronization? Actually, I want to implement the iSCSI High Available multipath on http://ceph.newdream.net/wiki/ISCSI. Therefore, I tried this small experiment first, but fail. Would you please give me some suggestion before I start to implement it? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd device would disappear after re-boot
Dear All: Another question about the rbd device. All rbd device would disappear after server re-boot. Do you have any plan to implement a feature that server would re-map all the device during initialization? If not, do you have any suggestion (about technical part) if we want to implement this feature? Furthermore, we would like to send a pull request after we finish it, any policy should we take-care? Thanks a lot! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: How to sync data on different server but with the same image
Hi, Wido: This is a preliminary experiment before implement iSCSI High Available multipath. http://ceph.newdream.net/wiki/ISCSI Therefore, we use Ceph as rbd device instead of file system. -Original Message- From: Wido den Hollander [mailto:w...@widodh.nl] Sent: Tuesday, December 06, 2011 5:33 PM To: Eric YH Chen/WHQ/Wistron; ceph-devel@vger.kernel.org Cc: Chris YT Huang/WHQ/Wistron Subject: Re: How to sync data on different server but with the same image Hi, - Original message - Dear All: I map the same rbd image to the rbd device on two different servers. For example: 1. create rbd image named foo 2. map foo to /dev/rbd0 on server A, mount /dev/rbd0 to /mnt 3. map foo to /dev/rbd0 on server B, mount /dev/rbd0 to /mnt If I put add a file to /mnt via server A, I hope I can see the same file on server B. However, I can't see it until I umount /mnt on server A and re-mount /mnt on server B. You'd have to use an cluster filesystem like GFS or OCFS2 to let this work. But why not use Ceph as a filesystem instead of RBD? That seems to do what you want. Wido Do you have any comment about this scenario? How could I force the data synchronization? Actually, I want to implement the iSCSI High Available multipath on http://ceph.newdream.net/wiki/ISCSI. Therefore, I tried this small experiment first, but fail. Would you please give me some suggestion before I start to implement it? Thanks!
RE: How to sync data on different server but with the same image
Hi, Brian: I would like to use SCST target or LIO target. But I found they are not supported on ubuntu 11.10 server. (LIO iSCSI is released under 3.1 kernel, but 11.10 server uses 3.0 kernel) Could you kindly to share how to use SCST target or LIO target on ubuntu ? I try the multipath with STGT, however, but it doesn't work well. I think that is because it doesn't support BLOCKIO, am I right? Thanks! -Original Message- From: Brian Chrisman [mailto:brchris...@gmail.com] Sent: Wednesday, December 07, 2011 12:01 AM To: Eric YH Chen/WHQ/Wistron Cc: w...@widodh.nl; ceph-devel@vger.kernel.org; Chris YT Huang/WHQ/Wistron Subject: Re: How to sync data on different server but with the same image Eric, If you export the rbd device directly via your iSCSI target driver, it should work. I verified this with the SCST target, but the LIO target should work as well. As Wido said, you don't want to mount the same rbd device on multiple clients without a shared filesystem (and in that case, you might as well use cephfs), but exporting rbd over iSCSI works. On Tue, Dec 6, 2011 at 1:38 AM, eric_yh_c...@wistron.com wrote: Hi, Wido: This is a preliminary experiment before implement iSCSI High Available multipath. http://ceph.newdream.net/wiki/ISCSI Therefore, we use Ceph as rbd device instead of file system. -Original Message- From: Wido den Hollander [mailto:w...@widodh.nl] Sent: Tuesday, December 06, 2011 5:33 PM To: Eric YH Chen/WHQ/Wistron; ceph-devel@vger.kernel.org Cc: Chris YT Huang/WHQ/Wistron Subject: Re: How to sync data on different server but with the same image Hi, - Original message - Dear All: I map the same rbd image to the rbd device on two different servers. For example: 1. create rbd image named foo 2. map foo to /dev/rbd0 on server A, mount /dev/rbd0 to /mnt 3. map foo to /dev/rbd0 on server B, mount /dev/rbd0 to /mnt If I put add a file to /mnt via server A, I hope I can see the same file on server B. However, I can't see it until I umount /mnt on server A and re-mount /mnt on server B. You'd have to use an cluster filesystem like GFS or OCFS2 to let this work. But why not use Ceph as a filesystem instead of RBD? That seems to do what you want. Wido Do you have any comment about this scenario? How could I force the data synchronization? Actually, I want to implement the iSCSI High Available multipath on http://ceph.newdream.net/wiki/ISCSI. Therefore, I tried this small experiment first, but fail. Would you please give me some suggestion before I start to implement it? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rbd device would disappear after re-boot
Hi, Tommi: Thanks for your suggestion. We will try to implement a workaround solution for our internal experiment. I have another little question. Could I map specific image to specific device? For example: Before re-boot id poolimage snapdevice 0 rbd foo1- /dev/rbd0 1 rbd foo3- /dev/rbd2 How could I re-map the image(foo3) to device(/dev/rbd2) but skip rbd1? The command provided by ceph CLI cannot achieve this. If I re-map the image to another device, it would not sync with iSCSI configuration and may cause problem. Thanks a lot! -Original Message- From: Tommi Virtanen [mailto:tommi.virta...@dreamhost.com] Sent: Wednesday, December 07, 2011 3:11 AM To: Eric YH Chen/WHQ/Wistron Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WHQ/Wistron Subject: Re: rbd device would disappear after re-boot On Tue, Dec 6, 2011 at 01:31, eric_yh_c...@wistron.com wrote: Another question about the rbd device. All rbd device would disappear after server re-boot. Do you have any plan to implement a feature that server would re-map all the device during initialization? If not, do you have any suggestion (about technical part) if we want to implement this feature? Furthermore, we would like to send a pull request after we finish it, any policy should we take-care? Yes, all mappings need to be re-established after a reboot. There is no convention yet on configuring what should be re-established. Right now, putting an rbd map in something like /etc/rc.local is a decent workaround. I created http://tracker.newdream.net/issues/1790 to track this feature.
Suggest to return [] when no image in the pool
Hi, developers, When I used the API in rbd.py, I found RBD().list(ioctx) would return [] when there is no image in the pool. I suggest it should return [] in this case. It would avoid some programming problem. Thanks! regards, Eric/Pjack
RE: Cannot execute rados.py with sudoer
Hi, Tommi, Here is my ceph.conf. The /var/log/ceph folder is created by myself. Because the script in 0.37 didn't create it. Maybe the problem is I did not set correct permission to the folder. ; global [global] auth supported = cephx max open files = 131072 log file = /var/log/ceph/$name.log pid file = /var/run/ceph/$name.pid keyring = /etc/ceph/$name.keyring [mon] mon data = /srv/mon.$id [mon.a] host = ubuntu1104-64-5 mon addr = 172.16.33.5:6789 [mds] [mds.a] host = ubuntu1104-64-5 [osd] osd data = /srv/osd.$id osd journal = /srv/osd.$id.journal osd journal size = 1000 ; journal size, in megabytes [osd.0] host = ubuntu1104-64-6 btrfs devs = /dev/mapper/ubuntu1104--64--6-lvol0 [osd.1] host = ubuntu64-33-7 btrfs devs = /dev/mapper/ubuntu64--33--7-lvol0 [osd.2] host = ubuntu1104-64-5 btrfs devs = /dev/mapper/ubuntu1104--64--5-lvol0 regards, Eric/Pjack -Original Message- From: Tommi Virtanen [mailto:tommi.virta...@dreamhost.com] Sent: Friday, November 04, 2011 1:51 AM To: Eric YH Chen/WHQ/Wistron Cc: gregory.far...@dreamhost.com; ceph-devel@vger.kernel.org Subject: Re: Cannot execute rados.py with sudoer On Wed, Nov 2, 2011 at 22:24, eric_yh_c...@wistron.com wrote: The log is generated by ceph service at runtime. Even I change the permission, it would be overwritten by the service someday. Did you change ceph.conf and set one of the log options? The default config writes to /var/log only from the daemons, not from the libraries. Can you please share your configuration. As for needing to be able to read client.admin, that file is not changed by the ceph services starting, you can safely chown/chmod it. Alternatively, give the non-root user a new key, and authorize that with ceph auth add.
Cannot execute rados.py with sudoer
Hi, all: When I use raods.py, I met some problem even if the user is in sudoer. I found it would access /etc/ceph/client.admin.keyring and /var/log/ceph/client.admin.log which is only available to root. Do you have any suggestion? I cannot execute the python problem with “root” account. It would cause some security issue. Thanks a lot! Here is the sample code. import rados cluster = rados.Rados() cluster.conf_read_file() failed to open log file '/var/log/ceph/client.admin.log': error 13: Permission denied cluster.connect() 2011-11-03 11:49:20.937991 7f9fe5320720 monclient(hunting): MonClient::init(): Failed to create keyring 2011-11-03 11:49:50.938235 7f9fe5320720 monclient(hunting): authenticate timed out after 30 2011-11-03 11:49:50.938283 7f9fe5320720 librados: client.admin authentication error error 110: Connection timed out Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.7/rados.py, line 182, in connect raise make_ex(ret, error calling connect) rados.Error: error calling connect: error code 110 -rw--- 1 root root 92 2011-11-02 18:13 client.admin.keyring -rw--- 1 root root 0 2011-11-03 07:47 client.admin.log regards, Eric/Pjack N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: Cannot execute rados.py with sudoer
Hi, Greg, The log is generated by ceph service at runtime. Even I change the permission, it would be overwritten by the service someday. I am afraid if there is any other permission problem when I execute other commands. Ex: I need to modify more files' permission Ex: The library use any API in kernel space. Anyway, thanks for your reply, I will try to modify the two files' permission first. regards, Eric/Pjack -Original Message- From: Gregory Farnum [mailto:gregory.far...@dreamhost.com] Sent: Thursday, November 03, 2011 12:49 PM To: Eric YH Chen/WHQ/Wistron Cc: ceph-devel@vger.kernel.org Subject: Re: Cannot execute rados.py with sudoer This looks like your standard permissions issue to me. The keyring and log were probably created by mkcephfs running under sudo? But if you give your current user the ability to read/write from them everything should work fine. -Greg On Wed, Nov 2, 2011 at 8:55 PM, eric_yh_c...@wistron.com wrote: Hi, all: When I use raods.py, I met some problem even if the user is in sudoer. I found it would access /etc/ceph/client.admin.keyring and /var/log/ceph/client.admin.log which is only available to root. Do you have any suggestion? I cannot execute the python problem with “root” account. It would cause some security issue. Thanks a lot! Here is the sample code. import rados cluster = rados.Rados() cluster.conf_read_file() failed to open log file '/var/log/ceph/client.admin.log': error 13: Permission denied cluster.connect() 2011-11-03 11:49:20.937991 7f9fe5320720 monclient(hunting): MonClient::init(): Failed to create keyring 2011-11-03 11:49:50.938235 7f9fe5320720 monclient(hunting): authenticate timed out after 30 2011-11-03 11:49:50.938283 7f9fe5320720 librados: client.admin authentication error error 110: Connection timed out Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.7/rados.py, line 182, in connect raise make_ex(ret, error calling connect) rados.Error: error calling connect: error code 110 -rw--- 1 root root 92 2011-11-02 18:13 client.admin.keyring -rw--- 1 root root 0 2011-11-03 07:47 client.admin.log regards, Eric/Pjack
Some question about Ceph system (release 0.37)
Hi, everyone Nice to meet you. I would like to do some experience on ceph system. After I installed the latest release 0.37, the ceph system seems work well. (type ceph -s ) However, I found some error messages in dmesg that I never saw it in release 0.34. My environment is ubuntu 11.04 with linux kernel 2.6.38. Would you please give me some hint where is the problem? Maybe it is just configuration issue. Thanks! Message from dmesg [2304674.857044] libceph: mon0 172.16.33.5:6789 session established [2304677.935334] libceph: auth method 'x' error -1 [2304697.923903] libceph: auth method 'x' error -1 [2304709.822745] libceph: tid 154730 timed out on osd0, will reset osd [2304717.912381] libceph: auth method 'x' error -1 [2304737.900906] libceph: auth method 'x' error -1 [2304757.889406] libceph: auth method 'x' error -1 [2304769.788260] libceph: tid 154730 timed out on osd0, will reset osd [2304777.877900] libceph: auth method 'x' error -1 [2304797.866390] libceph: auth method 'x' error -1 And in mon.a.log 2011-10-26 09:48:11.297463 7f7cb8560700 -- 172.16.33.5:6789/0 172.16.33.5:0/325292109 pipe(0x28e2c80 sd=12 pgs=0 cs=0 l=0).accept peer addr is really 172.16.33.5:0/325292109 (socket is 172.16.33.5:39994/0) 2011-10-26 09:48:11.297952 7f7cb9d6a700 cephx server client.admin: unexpected key: req.key=9366eb4b71c5f40d expected_key=dc17c2ee925c8134 2011-10-26 09:48:16.412964 7f7cb9d6a700 cephx server client.admin: unexpected key: req.key=bd7776ed4b075d7b expected_key=b2b0f2b114bed067 2011-10-26 09:48:36.453005 7f7cb9d6a700 cephx server client.admin: unexpected key: req.key=8b03356c9af2d0f8 expected_key=83a1d2af73204029 2011-10-26 09:48:56.492979 7f7cb9d6a700 cephx server client.admin: unexpected key: req.key=c09d0833fe6b4819 expected_key=cb1e41b59fd22966 Text by ceph -s 2011-10-26 09:51:33.846500 pg v433: 396 pgs: 345 active, 51 active+clean+degraded; 40985 KB data, 84016 KB used, 395 GB / 400 GB avail; 4/64 degraded (6.250%) 2011-10-26 09:51:33.847231 mds e4: 1/1/1 up {0=a=up:active} 2011-10-26 09:51:33.847253 osd e22: 2 osds: 2 up, 2 in 2011-10-26 09:51:33.847431 log 2011-10-25 17:29:29.924855 osd.0 172.16.33.6:6800/22812 475 : [INF] 2.6f scrub ok 2011-10-26 09:51:33.847473 mon e1: 1 mons at {a=172.16.33.5:6789/0} Text by ceph auth list 2011-10-26 09:51:58.966092 mon - [auth,list] 2011-10-26 09:51:58.966678 mon.0 - 'installed auth entries: mon. key: AQAq559O8IroBRAAOkS/pDrdo5jNFvemZenGGg== mds.a key: AQAq559OkPp5ABAAHgj9X0lesN5OrnjgTdv4QQ== caps: [mds] allow caps: [mon] allow rwx caps: [osd] allow * osd.0 key: AQAu559O0GPAFRAAJ6XaaFsc7AXBypxAGcmYbA== caps: [mon] allow rwx caps: [osd] allow * osd.1 key: AQAb559OWKdAKRAA69ho1SgvmNYbBXXgzJYn2g== caps: [mon] allow rwx caps: [osd] allow * osd.2 key: AQDQ659OQN6aIxAAprfrLDYFmS+XR/+aoRxqmA== caps: [mon] allow rwx caps: [osd] allow * client.admin key: AQAq559OsFC0BBAABpVi56Uvr0H1qBck7dHhbg== caps: [mds] allow caps: [mon] allow * caps: [osd] allow * regards, Eric/Pjack -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html