Re: [ceph-users] trouble starting second monitor
[celtic][DEBUG ] create the mon path if it does not exist mkdir /var/lib/ceph/mon/ 2014-12-01 4:32 GMT+03:00 K Richard Pixley r...@noir.com: What does this mean, please? --rich ceph@adriatic:~/my-cluster$ ceph status cluster 1023db58-982f-4b78-b507-481233747b13 health HEALTH_OK monmap e1: 1 mons at {black=192.168.1.77:6789/0}, election epoch 2, quorum 0 black mdsmap e7: 1/1/1 up {0=adriatic=up:active}, 3 up:standby osdmap e17: 4 osds: 4 up, 4 in pgmap v48: 192 pgs, 3 pools, 1884 bytes data, 20 objects 29134 MB used, 113 GB / 149 GB avail 192 active+clean ceph@adriatic:~/my-cluster$ ceph-deploy mon create celtic [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.20): /usr/bin/ceph-deploy mon create celtic [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts celtic [ceph_deploy.mon][DEBUG ] detecting platform for host celtic ... [celtic][DEBUG ] connection detected need for sudo [celtic][DEBUG ] connected to host: celtic [celtic][DEBUG ] detect platform information from remote host [celtic][DEBUG ] detect machine type [ceph_deploy.mon][INFO ] distro info: Ubuntu 14.04 trusty [celtic][DEBUG ] determining if provided host has same hostname in remote [celtic][DEBUG ] get remote short hostname [celtic][DEBUG ] deploying mon to celtic [celtic][DEBUG ] get remote short hostname [celtic][DEBUG ] remote hostname: celtic [celtic][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [celtic][DEBUG ] create the mon path if it does not exist [celtic][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-celtic/ done [celtic][DEBUG ] create a done file to avoid re-doing the mon deployment [celtic][DEBUG ] create the init path if it does not exist [celtic][DEBUG ] locating the `service` executable... [celtic][INFO ] Running command: sudo initctl emit ceph-mon cluster=ceph id=celtic [celtic][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.celtic.asok mon_status [celtic][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory [celtic][WARNIN] monitor: mon.celtic, might not be running yet [celtic][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.celtic.asok mon_status [celtic][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory [celtic][WARNIN] celtic is not defined in `mon initial members` [celtic][WARNIN] monitor celtic does not exist in monmap [celtic][WARNIN] neither `public_addr` nor `public_network` keys are defined for monitors [celtic][WARNIN] monitors may not be able to form quorum ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, further to your email I have switched back to the 3.18 kernel that you've sent and I got similar looking dmesg output as I had on the 3.17 kernel. Please find it attached for your reference. As before, this is the command I've ran on the client: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct Can you run that command again - on 3.18 kernel, to completion - and paste - the entire dmesg - time results for each dd ? Compare those to your results with four dds (or any other number which doesn't trigger page allocation failures). Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to shrink/rewrite rbd image ?
I think if you enable TRIM support on your RBD, then run fstrim on your filesystems inside the guest (assuming ext4 / XFS guest filesystem), Ceph should reclaim the trimmed space. Yes, it's working fine. (you need to use virtio-scsi and enable discard option) - Mail original - De: Daniel Swarbrick daniel.swarbr...@profitbricks.com À: ceph-users@lists.ceph.com Envoyé: Vendredi 28 Novembre 2014 17:16:14 Objet: Re: [ceph-users] Fastest way to shrink/rewrite rbd image ? Take a look at http://ceph.com/docs/master/rbd/qemu-rbd/#enabling-discard-trim I think if you enable TRIM support on your RBD, then run fstrim on your filesystems inside the guest (assuming ext4 / XFS guest filesystem), Ceph should reclaim the trimmed space. On 28/11/14 17:05, Christoph Adomeit wrote: Hi, I would like to shrink a thin provisioned rbd image which has grown to maximum. 90% of the data in the image is deleted data which is still hidden in the image and marked as deleted. So I think I can fill the whole Image with zeroes and then qemu-img convert it. So the newly created image should be only 10% of the maximum size. I will do something like qemu-img convert -O raw rbd:pool/origimage rbd:pool/smallimage rbd rename origimage origimage-saved rbd rename smallimage origimage Would this be the best and fastest way or are there other ways to do this ? Thanks Christoph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Removing Snapshots Killing Cluster Performance
Hi! We take regular (nightly) snapshots of our Rados Gateway Pools for backup purposes. This allows us - with some manual pokery - to restore clients' documents should they delete them accidentally. The cluster is a 4 server setup with 12x4TB spinning disks each, totaling about 175TB. We are running firefly. We have now completed our first month of snapshots and want to remove the oldest ones. Unfortunately doing so practically kills everything else that is using the cluster, because performance drops to almost zero while the OSDs work their disks 100% (as per iostat). It seems this is the same phenomenon I asked about some time ago where we were deleting whole pools. I could not find any way to throttle the background deletion activity (the command returns almost immediately). Here is a graph the I/O operations waiting (colored by device) while deleting a few snapshots. Each of the blocks in the graph show one snapshot being removed. The big one in the middle was a snapshot of the .rgw.buckets pool. It took about 15 minutes during which basically nothing relying on the cluster was working due to immense slowdowns. This included users getting kicked off their SSH sessions due to timeouts. https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048 While this is a big issue in itself for us, we would at least try to estimate how long the process will take per snapshot / per pool. I assume the time needed is a function of the number of objects that were modified between two snapshots. We tried to get an idea of at least how many objects were added/removed in total by running `rados df` with a snapshot specified as a parameter, but it seems we still always get the current values: $ sudo rados -p .rgw df --snap backup-20141109 selected snap 13 'backup-20141109' pool name category KB objects .rgw - 276165 1368545 $ sudo rados -p .rgw df --snap backup-20141124 selected snap 28 'backup-20141124' pool name category KB objects .rgw - 276165 1368546 $ sudo rados -p .rgw df pool name category KB objects .rgw - 276165 1368547 So there are a few questions: 1) Is there any way to control how much such an operation will tax the cluster (we would be happy to have it run longer, if that meant not utilizing all disks fully during that time)? 2) Is there a way to get a decent approximation of how much work deleting a specific snapshot will entail (in terms of objects, time, whatever)? 3) Would SSD journals help here? Or any other hardware configuration change for that matter? 4) Any other recommendations? We definitely need to remove the data, not because of a lack of space (at least not at the moment), but because when customers delete stuff / cancel accounts, we are obliged to remove their data at least after a reasonable amount of time. Cheers, Daniel___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing Snapshots Killing Cluster Performance
Hi, Which version of Ceph are you using? This could be related: http://tracker.ceph.com/issues/9487 See ReplicatedPG: don't move on to the next snap immediately; basically, the OSD is getting into a tight loop trimming the snapshot objects. The fix above breaks out of that loop more frequently, and then you can use the osd snap trim sleep option to throttle it further. I’m not sure if the fix above will be sufficient if you have many objects to remove per snapshot. That commit is only in giant at the moment. The backport to dumpling is in the dumpling branch but not yet in a release, and firefly is still pending. Cheers, Dan On 01 Dec 2014, at 10:51, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi! We take regular (nightly) snapshots of our Rados Gateway Pools for backup purposes. This allows us - with some manual pokery - to restore clients' documents should they delete them accidentally. The cluster is a 4 server setup with 12x4TB spinning disks each, totaling about 175TB. We are running firefly. We have now completed our first month of snapshots and want to remove the oldest ones. Unfortunately doing so practically kills everything else that is using the cluster, because performance drops to almost zero while the OSDs work their disks 100% (as per iostat). It seems this is the same phenomenon I asked about some time ago where we were deleting whole pools. I could not find any way to throttle the background deletion activity (the command returns almost immediately). Here is a graph the I/O operations waiting (colored by device) while deleting a few snapshots. Each of the blocks in the graph show one snapshot being removed. The big one in the middle was a snapshot of the .rgw.buckets pool. It took about 15 minutes during which basically nothing relying on the cluster was working due to immense slowdowns. This included users getting kicked off their SSH sessions due to timeouts. https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048 While this is a big issue in itself for us, we would at least try to estimate how long the process will take per snapshot / per pool. I assume the time needed is a function of the number of objects that were modified between two snapshots. We tried to get an idea of at least how many objects were added/removed in total by running `rados df` with a snapshot specified as a parameter, but it seems we still always get the current values: $ sudo rados -p .rgw df --snap backup-20141109 selected snap 13 'backup-20141109' pool name category KB objects .rgw- 276165 1368545 $ sudo rados -p .rgw df --snap backup-20141124 selected snap 28 'backup-20141124' pool name category KB objects .rgw- 276165 1368546 $ sudo rados -p .rgw df pool name category KB objects .rgw- 276165 1368547 So there are a few questions: 1) Is there any way to control how much such an operation will tax the cluster (we would be happy to have it run longer, if that meant not utilizing all disks fully during that time)? 2) Is there a way to get a decent approximation of how much work deleting a specific snapshot will entail (in terms of objects, time, whatever)? 3) Would SSD journals help here? Or any other hardware configuration change for that matter? 4) Any other recommendations? We definitely need to remove the data, not because of a lack of space (at least not at the moment), but because when customers delete stuff / cancel accounts, we are obliged to remove their data at least after a reasonable amount of time. Cheers, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compile from source with Kinetic support
I'm sorry but the compilation still fails after including the cpp-client headers : CXX os/libos_la-KeyValueDB.lo os/KeyValueDB.cc: In static member function 'static KeyValueDB* KeyValueDB::create(CephContext*, const string, const string)': os/KeyValueDB.cc:18:16: error: expected type-specifier before 'KineticStore' return new KineticStore(cct); ^ os/KeyValueDB.cc:18:16: error: expected ';' before 'KineticStore' os/KeyValueDB.cc:18:32: error: 'KineticStore' was not declared in this scope return new KineticStore(cct); ^ os/KeyValueDB.cc: In static member function 'static int KeyValueDB::test_init(const string, const string)': os/KeyValueDB.cc:36:12: error: 'KineticStore' has not been declared return KineticStore::_test_init(g_ceph_context); ^ CXX os/libos_la-KeyValueStore.lo make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1 make[3]: *** Waiting for unfinished jobs In file included from os/KeyValueStore.cc:53:0: os/KineticStore.h:13:29: fatal error: kinetic/kinetic.h: No such file or directory #include kinetic/kinetic.h ^ compilation terminated. make[3]: *** [os/libos_la-KeyValueStore.lo] Error 1 -- Julien On 11/28/2014 08:54 PM, Nigel Williams wrote: On Sat, Nov 29, 2014 at 5:19 AM, Julien Lutran julien.lut...@ovh.net wrote: Where can I find this kinetic devel package ? I guess you want this (C== kinetic client)? it has kinetic.h at least. https://github.com/Seagate/kinetic-cpp-client ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] LevelDB support status is still experimental on Giant?
Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? Cheers Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com, Gregory Farnum g...@gregs42.com Sent: Monday, 1 December, 2014 8:22:08 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, further to your email I have switched back to the 3.18 kernel that you've sent and I got similar looking dmesg output as I had on the 3.17 kernel. Please find it attached for your reference. As before, this is the command I've ran on the client: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct Can you run that command again - on 3.18 kernel, to completion - and paste - the entire dmesg - time results for each dd ? Compare those to your results with four dds (or any other number which doesn't trigger page allocation failures). Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] large reads become 512 kbyte reads on qemu-kvm rbd
On Mon, Dec 1, 2014 at 1:09 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Ilya, On 28 Nov 2014, at 17:56, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 5:46 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Andrei, Yes, I’m testing from within the guest. Here is an example. First, I do 2MB reads when the max_sectors_kb=512, and we see the reads are split into 4. (fio sees 25 iops, though iostat reports 100 smaller iops): # echo 512 /sys/block/vdb/queue/max_sectors_kb # this is the default # fio --readonly --name /dev/vdb --rw=read --size=1G --ioengine=libaio --direct=1 --runtime=10s --blocksize=2m /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1 fio-2.0.13 Starting 1 process Jobs: 1 (f=1): [R] [100.0% done] [51200K/0K/0K /s] [25 /0 /0 iops] [eta 00m:00s] meanwhile iostat is reporting 100 iops of average size 1024 sectors (i.e. 512kB): Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util vdb 0.00 0.00 100.000.0050.00 0.00 1024.00 3.02 30.25 10.00 100.00 Now increase the max_sectors_kb to 4MB, and the IOs are no longer split: # echo 4096 /sys/block/vdb/queue/max_sectors_kb # fio --readonly --name /dev/vdb --rw=read --size=1G --ioengine=libaio --direct=1 --runtime=10s --blocksize=2m /dev/vdb: (g=0): rw=read, bs=2M-2M/2M-2M/2M-2M, ioengine=libaio, iodepth=1 fio-2.0.13 Starting 1 process Jobs: 1 (f=1): [R] [100.0% done] [200.0M/0K/0K /s] [100 /0 /0 iops] [eta 00m:00s] iostat reports 100 iops, 4096 sectors each read (i.e. 2MB): Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util vdb 300.00 0.00 100.000.00 200.00 0.00 4096.00 0.999.94 9.94 99.40 We set the hard request size limit to rbd object size (4M typically) blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); Are you referring to librbd or krbd? My observations are limited to librbd at the moment. (I didn’t try this on krbd). Yes, I was referring to krbd. But it looks like that patch from Christoph will change this for qemu+librbd as well - an artificial soft limit imposed by the VM kernel will disappear. CC'ing Josh. but block layer then sets the soft limit for fs requests to 512K BLK_DEF_MAX_SECTORS = 1024, limits-max_sectors = min_t(unsigned int, max_hw_sectors, BLK_DEF_MAX_SECTORS); which you are supposed to change on a per-device basis via sysfs. We could probably raise the soft limit to rbd object size by default as well - I don't see any harm in that. Indeed, this patch which was being targeted for 3.19: https://lkml.org/lkml/2014/9/6/123 Oh good, I was just about to send a patch for krbd. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. That's easily explained - those splats in dmesg indicate a case of a severe memory pressure. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? No, no need if it's a production cluster. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, I see. My server is has 24GB of ram + 3GB of swap. While running the tests, I've noticed that the server had 14GB of ram shown as cached and only 2MB were used from the swap. Not sure if this is helpful to your debugging. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com, Gregory Farnum g...@gregs42.com Sent: Monday, 1 December, 2014 11:06:37 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. That's easily explained - those splats in dmesg indicate a case of a severe memory pressure. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? No, no need if it's a production cluster. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to shrink/rewrite rbd image ?
On 01/12/14 10:22, Alexandre DERUMIER wrote: Yes, it's working fine. (you need to use virtio-scsi and enable discard option) Does it work with virtio-blk if you attach the RBD as a LUN? Supposedly, SCSI pass-through works in this mode, e.g. disk type='block' device='lun' target dev='vda' bus='virtio'/ ... /disk However, it seems that virtio-scsi is slowly becoming preferred over virtio-blk. Are there any disadvantages to using virtio-scsi now? Does it support live migration? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing Snapshots Killing Cluster Performance
On 2014-12-01 10:03:35 +, Dan Van Der Ster said: Which version of Ceph are you using? This could be related: http://tracker.ceph.com/issues/9487 Firefly. I had seen this ticket earlier (when deleting a whole pool) and hoped the backport of the fix would be available some time soon. I must admin, I did not look this up before posting, because I had forgotten about it. See ReplicatedPG: don't move on to the next snap immediately; basically, the OSD is getting into a tight loop trimming the snapshot objects. The fix above breaks out of that loop more frequently, and then you can use the osd snap trim sleep option to throttle it further. I’m not sure if the fix above will be sufficient if you have many objects to remove per snapshot. Just so I get this right: With the fix alone you are not sure it would be nice enough, so adjusting the snap trim sleep option in addition might be needed? I assume the loop that will be broken up with 9487 does not take the sleep time into account? That commit is only in giant at the moment. The backport to dumpling is in the dumpling branch but not yet in a release, and firefly is still pending. Holding my breath :) Any thoughts on the other items I had in the original post? 2) Is there a way to get a decent approximation of how much work deleting a specific snapshot will entail (in terms of objects, time, whatever)? 3) Would SSD journals help here? Or any other hardware configuration change for that matter? Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Degraded
Hi Andrei! I had a similar setting with replicated size 2 and min_size also 2. Changing that didn't change the status of the cluster. I 've also tried to remove the pools and recreate them without success. Removing and re-adding the OSDs also didn't have any influence! Therefore and since I didn't have any data at all I performed a force recreate on all PGs and after that things went back to normal. Thanks for your reply! Best, George On Sat, 29 Nov 2014 11:39:51 + (GMT), Andrei Mikhailovsky wrote: I think I had a similar issue recently when I've added a new pool. All pgs that corresponded to the new pool were shown as degraded/unclean. After doing a bit of testing I've realized that my issue was down to this: replicated size 2 min_size 2 replicated size and min size was the same. In my case, i've got 2 osd servers with total replica of 2. The minimal size should be set to 1 - so that the cluster would still work with at least one PG being up. After I've changed the min_size to 1 the cluster sorted itself out. Try doing this for your pools. Andrei - FROM: Georgios Dimitrakakis TO: ceph-users@lists.ceph.com SENT: Saturday, 29 November, 2014 11:13:05 AM SUBJECT: [ceph-users] Ceph Degraded Hi all! I am setting UP a new cluster with 10 OSDs and the state is degraded! # ceph health HEALTH_WARN 940 pgs degraded; 1536 pgs stuck unclean # There are only the default pools # ceph osd lspools 0 data,1 metadata,2 rbd, with each one having 512 pg_num and 512 pgp_num # ceph osd dump | grep replic pool 0 'data' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 286 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 287 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 288 flags hashpspool stripe_width 0 No data yet so is there something I can do to repair it as it is? Best regards, George ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fastest way to shrink/rewrite rbd image ?
Does it work with virtio-blk if you attach the RBD as a LUN? virtio-blk don't support discard and triming Supposedly, SCSI pass-through works in this mode, e.g. SCSI pass-through works only with virtio-scsi, not virtio-blk However, it seems that virtio-scsi is slowly becoming preferred over virtio-blk. Are there any disadvantages to using virtio-scsi now? It's a little bit slower sometimes. (but I can be faster than virtio-blk with multiqueues and iscsi passtrough). With librbd, I see a little slowdown vs virtio-blk (maybe 20% slower). Does it support live migration? yes of course - Mail original - De: Daniel Swarbrick daniel.swarbr...@profitbricks.com À: ceph-users@lists.ceph.com Envoyé: Lundi 1 Décembre 2014 13:32:15 Objet: Re: [ceph-users] Fastest way to shrink/rewrite rbd image ? On 01/12/14 10:22, Alexandre DERUMIER wrote: Yes, it's working fine. (you need to use virtio-scsi and enable discard option) Does it work with virtio-blk if you attach the RBD as a LUN? Supposedly, SCSI pass-through works in this mode, e.g. disk type='block' device='lun' target dev='vda' bus='virtio'/ ... /disk However, it seems that virtio-scsi is slowly becoming preferred over virtio-blk. Are there any disadvantages to using virtio-scsi now? Does it support live migration? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] LevelDB support status is still experimental on Giant?
Yeah, mainly used by test env. On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai satoru.fu...@gmail.com wrote: Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] LevelDB support status is still experimental on Giant?
We have tested it for a while, basically it seems kind of stable but show terrible bad performance. This is not the fault of Ceph , but levelDB, or more generally, all K-V storage with LSM design(RocksDB,etc), the LSM tree structure naturally introduce very large write amplification 10X to 20X when you have tens GB of data per OSD. So you can always see very bad sequential write performance (~200MB/s for a 12SSD setup), we can share more details on the performance meeting. To this end, key-value backend with LevelDB is not useable for RBD usage, but maybe workable(not tested) in the LOSF cases ( tons of small objects stored via rados , k-v backend can prevent the FS metadata become the bottleneck) From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Haomai Wang Sent: Monday, December 1, 2014 9:48 PM To: Satoru Funai Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Yeah, mainly used by test env. On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai satoru.fu...@gmail.commailto:satoru.fu...@gmail.com wrote: Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] do I have to use sudo for CEPH install
Hi. Do I have to install sudo in Debian Wheezy to deploy CEPH succesfully? I dont normally use sudo. Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] LevelDB support status is still experimental on Giant?
Exactly, I'm just looking forward a better DB backend suitable for KeyValueStore. It maybe traditional B-tree design. Kinetic original I think it was a good backend, but it doesn't support range query :-( On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: We have tested it for a while, basically it seems kind of stable but show terrible bad performance. This is not the fault of Ceph , but levelDB, or more generally, all K-V storage with LSM design(RocksDB,etc), the LSM tree structure naturally introduce very large write amplification 10X to 20X when you have tens GB of data per OSD. So you can always see very bad sequential write performance (~200MB/s for a 12SSD setup), we can share more details on the performance meeting. To this end, key-value backend with LevelDB is not useable for RBD usage, but maybe workable(not tested) in the LOSF cases ( tons of small objects stored via rados , k-v backend can prevent the FS metadata become the bottleneck) *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Haomai Wang *Sent:* Monday, December 1, 2014 9:48 PM *To:* Satoru Funai *Cc:* ceph-us...@ceph.com *Subject:* Re: [ceph-users] LevelDB support status is still experimental on Giant? Yeah, mainly used by test env. On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai satoru.fu...@gmail.com wrote: Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compile from source with Kinetic support
Hmm, src/os/KeyValueDB.cc lack of lines: #ifdef WITH_KINETIC #include KineticStore.h #endif On Mon, Dec 1, 2014 at 6:14 PM, Julien Lutran julien.lut...@ovh.net wrote: I'm sorry but the compilation still fails after including the cpp-client headers : CXX os/libos_la-KeyValueDB.lo os/KeyValueDB.cc: In static member function 'static KeyValueDB* KeyValueDB::create(CephContext*, const string, const string)': os/KeyValueDB.cc:18:16: error: expected type-specifier before 'KineticStore' return new KineticStore(cct); ^ os/KeyValueDB.cc:18:16: error: expected ';' before 'KineticStore' os/KeyValueDB.cc:18:32: error: 'KineticStore' was not declared in this scope return new KineticStore(cct); ^ os/KeyValueDB.cc: In static member function 'static int KeyValueDB::test_init(const string, const string)': os/KeyValueDB.cc:36:12: error: 'KineticStore' has not been declared return KineticStore::_test_init(g_ceph_context); ^ CXX os/libos_la-KeyValueStore.lo make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1 make[3]: *** Waiting for unfinished jobs In file included from os/KeyValueStore.cc:53:0: os/KineticStore.h:13:29: fatal error: kinetic/kinetic.h: No such file or directory #include kinetic/kinetic.h ^ compilation terminated. make[3]: *** [os/libos_la-KeyValueStore.lo] Error 1 -- Julien On 11/28/2014 08:54 PM, Nigel Williams wrote: On Sat, Nov 29, 2014 at 5:19 AM, Julien Lutran julien.lut...@ovh.net wrote: Where can I find this kinetic devel package ? I guess you want this (C== kinetic client)? it has kinetic.h at least. https://github.com/Seagate/kinetic-cpp-client ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] LevelDB support status is still experimental on Giant?
Range query is not that important in nowadays SSDyou can see very high read random read IOPS in ssd spec, and getting higher day by day.The key problem here is trying to exactly matching one query(get/put) to one SSD IO(read/write), eliminate the read/write amplification. We kind of believe OpenNvmKV may be the right approach. Back to the context of Ceph, can we find some use case of nowadays key-value backend? We would like to learn from community what’s the workload pattern if you wants a K-V backed Ceph? Or just have a try? I think before we get a suitable DB backend ,we had better off to optimize the key-value backend code to support specified kind of load. From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Monday, December 1, 2014 10:14 PM To: Chen, Xiaoxi Cc: Satoru Funai; ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Exactly, I'm just looking forward a better DB backend suitable for KeyValueStore. It maybe traditional B-tree design. Kinetic original I think it was a good backend, but it doesn't support range query :-( On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi xiaoxi.c...@intel.commailto:xiaoxi.c...@intel.com wrote: We have tested it for a while, basically it seems kind of stable but show terrible bad performance. This is not the fault of Ceph , but levelDB, or more generally, all K-V storage with LSM design(RocksDB,etc), the LSM tree structure naturally introduce very large write amplification 10X to 20X when you have tens GB of data per OSD. So you can always see very bad sequential write performance (~200MB/s for a 12SSD setup), we can share more details on the performance meeting. To this end, key-value backend with LevelDB is not useable for RBD usage, but maybe workable(not tested) in the LOSF cases ( tons of small objects stored via rados , k-v backend can prevent the FS metadata become the bottleneck) From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Haomai Wang Sent: Monday, December 1, 2014 9:48 PM To: Satoru Funai Cc: ceph-us...@ceph.commailto:ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Yeah, mainly used by test env. On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai satoru.fu...@gmail.commailto:satoru.fu...@gmail.com wrote: Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Removing Snapshots Killing Cluster Performance
Thanks for your input. We will see what we can find out with the logs and how to proceed from there. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] To clarify requirements for Monitors
Thank you, Paulo. Metadata = mds, so metadata server should have cpu power. --Roman On 14-11-28 05:34 PM, Paulo Almeida wrote: On Fri, 2014-11-28 at 16:37 -0500, Roman Naumenko wrote: And if I understand correctly, monitors are the access points to the cluster, so they should provide enough aggregated network output for all connected clients based on number of OSDs in the cluster? I'm not sure what you mean by access points to the cluster, but the monitors only provide the cluster map to the client, which then communicates directly with the OSDs. Quoting the documentation[1]: Ceph eliminates the centralized gateway to enable clients to interact with Ceph OSD Daemons directly. (...) Before Ceph Clients can read or write data, they must contact a Ceph Monitor to obtain the most recent copy of the cluster map. [1] http://ceph.com/docs/master/architecture/ Cheers, Paulo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rsync mirror for repository?
Is there a place I can download the entire repository for giant? I'm really just looking for a rsync server that presents all the files here: http://download.ceph.com/ceph/giant/centos6.5/ I know that eu.ceph.com runs one, but I'm not sure how up to date that is (because of http://eu.ceph.com/rpm-giant/el6/x86_64/ , it has two versions in that directory). Ceph is fairly critical to us, so we don't want to rely on an external mirror (we've had issues with other software where the files on the external mirror suddenly become broken). For now, I downloaded it via 'wget -r', but this really isn't ideal. I already tried: $ rsync rsync://download.ceph.com rsync: failed to connect to download.ceph.com: Connection refused (111) $ rsync rsync://ceph.com --contimeout=2 rsync error: timeout waiting for daemon connection (code 35) at socket.c(279) [receiver=3.0.6] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] LevelDB support status is still experimental on Giant?
Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Problems with pgs incomplete
Hi all, I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. I check ceph status and see this information [root@node-1 ceph-0]# ceph -s cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean monmap e1: 3 mons at {a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election epoch 294, quorum 0,1,2 b,a,c osdmap e418: 6 osds: 5 up, 5 in pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects 5241 MB used, 494 GB / 499 GB avail 308 active+clean 4 incomplete Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having replicated size 2 and min_size 2? My osd tree [root@node-1 ceph-0]# ceph osd tree # idweight type name up/down reweight -1 4 root croc -2 4 region ru -4 3 datacenter vol-5 -5 1 host node-1 0 1 osd.0 down0 -6 1 host node-2 1 1 osd.1 up 1 -7 1 host node-3 2 1 osd.2 up 1 -3 1 datacenter comp -8 1 host node-4 3 1 osd.3 up 1 -9 1 host node-5 4 1 osd.4 up 1 -10 1 host node-6 5 1 osd.5 up 1 Addition information: [root@node-1 ceph-0]# ceph health detail HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last acting [1,3] pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last acting [1,2] pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last acting [1,3] pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last acting [1,3] pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last acting [1,3] pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last acting [1,2] pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last acting [1,3] pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last acting [1,3] pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') [root@node-1 ceph-0]# ceph osd dump | grep 'pool' pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 36 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0 pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0 pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 46 flags hashpspool stripe_width 0 pool 9 '.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 48 flags hashpspool stripe_width 0 pool 10 'test' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 136 pgp_num 136 last_change 68 flags hashpspool stripe_width 0 pool 11 '.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0
Re: [ceph-users] Problems with pgs incomplete
Hi! I had a very similar issue a few days ago. For me it wasn't too much of a problem since the cluster was new without data and I could force recreate the PGs. I really hope that in your case it won't be necessary to do the same thing. As a first step try to reduce the min_size from 2 to 1 as suggested for the .rgw.buckets pool and see if this can bring you cluster back to health. Regards, George On Mon, 01 Dec 2014 17:09:31 +0300, Butkeev Stas wrote: Hi all, I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. I check ceph status and see this information [root@node-1 ceph-0]# ceph -s cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean monmap e1: 3 mons at {a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election epoch 294, quorum 0,1,2 b,a,c osdmap e418: 6 osds: 5 up, 5 in pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects 5241 MB used, 494 GB / 499 GB avail 308 active+clean 4 incomplete Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having replicated size 2 and min_size 2? My osd tree [root@node-1 ceph-0]# ceph osd tree # idweight type name up/down reweight -1 4 root croc -2 4 region ru -4 3 datacenter vol-5 -5 1 host node-1 0 1 osd.0 down0 -6 1 host node-2 1 1 osd.1 up 1 -7 1 host node-3 2 1 osd.2 up 1 -3 1 datacenter comp -8 1 host node-4 3 1 osd.3 up 1 -9 1 host node-5 4 1 osd.4 up 1 -10 1 host node-6 5 1 osd.5 up 1 Addition information: [root@node-1 ceph-0]# ceph health detail HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last acting [1,3] pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last acting [1,2] pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last acting [1,3] pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last acting [1,3] pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last acting [1,3] pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last acting [1,2] pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last acting [1,3] pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last acting [1,3] pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') [root@node-1 ceph-0]# ceph osd dump | grep 'pool' pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 36 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0 pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0 pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
Re: [ceph-users] Problems with pgs incomplete
On Mon, Dec 01, 2014 at 05:09:31PM +0300, Butkeev Stas wrote: Hi all, I have Ceph cluster+rgw. Now I have problems with one of OSD, it's down now. I check ceph status and see this information [root@node-1 ceph-0]# ceph -s cluster fc8c3ecc-ccb8-4065-876c-dc9fc992d62d health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean monmap e1: 3 mons at {a=10.29.226.39:6789/0,b=10.29.226.29:6789/0,c=10.29.226.40:6789/0}, election epoch 294, quorum 0,1,2 b,a,c osdmap e418: 6 osds: 5 up, 5 in pgmap v23588: 312 pgs, 16 pools, 141 kB data, 594 objects 5241 MB used, 494 GB / 499 GB avail 308 active+clean 4 incomplete Why am I having 4 pgs incomplete in bucket .rgw.buckets if I am having replicated size 2 and min_size 2? My osd tree [root@node-1 ceph-0]# ceph osd tree # idweight type name up/down reweight -1 4 root croc -2 4 region ru -4 3 datacenter vol-5 -5 1 host node-1 0 1 osd.0 down0 -6 1 host node-2 1 1 osd.1 up 1 -7 1 host node-3 2 1 osd.2 up 1 -3 1 datacenter comp -8 1 host node-4 3 1 osd.3 up 1 -9 1 host node-5 4 1 osd.4 up 1 -10 1 host node-6 5 1 osd.5 up 1 Addition information: [root@node-1 ceph-0]# ceph health detail HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean pg 13.6 is stuck inactive for 1547.665758, current state incomplete, last acting [1,3] pg 13.4 is stuck inactive for 1547.652111, current state incomplete, last acting [1,2] pg 13.5 is stuck inactive for 4502.009928, current state incomplete, last acting [1,3] pg 13.2 is stuck inactive for 4501.979770, current state incomplete, last acting [1,3] pg 13.6 is stuck unclean for 4501.969914, current state incomplete, last acting [1,3] pg 13.4 is stuck unclean for 4502.001114, current state incomplete, last acting [1,2] pg 13.5 is stuck unclean for 4502.009942, current state incomplete, last acting [1,3] pg 13.2 is stuck unclean for 4501.979784, current state incomplete, last acting [1,3] pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.6 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.4 is incomplete, acting [1,2] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') pg 13.5 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') [root@node-1 ceph-0]# ceph osd dump | grep 'pool' pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 34 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 2 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 36 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 3 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 38 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 4 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool stripe_width 0 pool 5 '.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 40 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 6 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 42 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 7 '.users' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 44 flags hashpspool stripe_width 0 pool 8 '.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 46 flags hashpspool stripe_width 0 pool 9 '.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 48 flags hashpspool stripe_width 0 pool 10 'test' replicated size 2 min_size 2 crush_ruleset 0
Re: [ceph-users] Revisiting MDS memory footprint
I meant to chime in earlier here but then the weekend happened, comments inline On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander w...@42on.com wrote: Why would you want all CephFS metadata in memory? With any filesystem that will be a problem. The latency associated with a cache miss (RADOS OMAP dirfrag read) is fairly high, so the goal when sizing will to allow the MDSs to keep a very large proportion of the metadata in RAM. In a local FS, the filesystem metadata in RAM is relatively small, and the speed to disk is relatively high. In Ceph FS, that is reversed: we want to compensate for the cache miss latency by having lots of RAM in the MDS and a big cache. hot-standby MDSs are another manifestation of the expected large cache: we expect these caches to be big, to the point where refilling from the backing store on a failure would be annoyingly slow, and it's worth keeping that hot standby cache. Also, remember that because we embed inodes in dentries, when we load a directory fragment we are also loading all the inodes in that directory fragment -- if you have only one file open, but it has an ancestor with lots of files, then you'll have more files in cache than you might have expected. We do however need a good rule of thumb of how much memory is used for each inode. Yes -- and ideally some practical measurements too :-) One important point that I don't think anyone mentioned so far: the memory consumption per inode depends on how many clients have capabilities on the inode. So if many clients hold a read capability on a file, more memory will be used MDS-side for that file. If designing a benchmark for this, the client count, and level of overlap in the client workloads would be an important dimension. The number of *open* files on clients strongly affects the ability of the MDS to trim is cache, since the MDS pins in cache any inode which is in use by a client. We recently added health checks so that the MDS can complain about clients that are failing to respond to requests to trim their caches, and the way we test this is to have a client obstinately keep some number of files open. We also allocate memory for pending metadata updates (so-called 'projected inodes') while they are in the journal, so the memory usage will depend on the journal size and the number of writes in flight. It would be really useful to come up with a test script that monitors MDS memory consumption as a function of number of files in cache, number of files opened by clients, number of clients opening the same files. I feel a 3d chart plot coming on :-) Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problems with pgs incomplete
Le 01/12/2014 15:09, Butkeev Stas a écrit : pg 13.2 is incomplete, acting [1,3] (reducing pool .rgw.buckets min_size from 2 may help; search ceph.com/docs for 'incomplete') The answer is in the logs: your .rgw.buckets pool is using min_size = 2. So it doesn't have enough valid pg replicas to start recovering. IIRC past messages on this list you must have size min_size to recover from a failed OSD as Ceph doesn't try to use available data to recover if it doesn't respect min_size. I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd expect ~1/3rd of your pgs to be incomplete given your ceph osd tree output) but reducing min_size to 1 should be harmless and should unfreeze the recovering process. Best regards, Lionel Bouton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problems with pgs incomplete
Le 01/12/2014 17:08, Lionel Bouton a écrit : I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd expect ~1/3rd of your pgs to be incomplete given your ceph osd tree output) but reducing min_size to 1 should be harmless and should unfreeze the recovering process. Ignore this part : I wasn't paying enough attention to the osd tree output and mixed osd/host levels. Others have pointed out that you have size = 3 for some pools. In this case you might have lost an OSD before a previous recovering process finished which would explain your current state (in this case my earlier advice still applies). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Revisiting MDS memory footprint
On Fri, Nov 28, 2014 at 1:48 PM, Florian Haas flor...@hastexo.com wrote: Out of curiosity: would it matter at all whether or not a significant fraction of the files in CephFS were hard links? Clearly the only thing that differs in metadata between individual hard-linked files is the file name, but I wonder if the Ceph MDS actually takes this into consideration. In other words, I'm not sure whether the MDS simply adds another pointer to the same set of metadata, or whether that set of metadata is actually duplicated in MDS memory. I am guessing the latter, but it would be nice to be sure. When we load a hard link dentry (in CDir::_omap_fetched), if we already have the inode in cache then we just refer to that copy -- we never have two of the same inode (CInode object) in memory. If we don't have the inode in cache, then the inode isn't loaded until someone tries to traverse the dentry (i.e. touch the file in any way), at which point we go to fetch the backtrace from the RADOS object for that file. So hard links may incur less memory overhead when loading a directory fragment, but you will take an I/O hammering when dereferencing them if the linked inode is not already in cache, as each individual hard link has to be followed via a separate RADOS object. In general I would be very cautious about workloads that do a lot of reads of cold hard linked files, e.g. if benchmarking this case for backups then you should try to create the hard links, let the files fall out of cache, then observe the performance of a restore where many hard links are being dereferenced via backtraces. I'm mostly reading this from the code rather than from memory, so I'm sure Greg or Sage will jump in if I'm getting any of these cases wrong. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problems with pgs incomplete
Thank you Lionel, Indeed I have forgotten about size min_size. I have set min_size to 1 and my cluster is UP now. I have deleted crash osd and have set size to 3 and min_size to 2. --- With regards, Stanislav 01.12.2014, 19:15, Lionel Bouton lionel-subscript...@bouton.name: Le 01/12/2014 17:08, Lionel Bouton a écrit : I may be wrong here (I'm surprised you only have 4 incomplete pgs, I'd expect ~1/3rd of your pgs to be incomplete given your ceph osd tree output) but reducing min_size to 1 should be harmless and should unfreeze the recovering process. Ignore this part : I wasn't paying enough attention to the osd tree output and mixed osd/host levels. Others have pointed out that you have size = 3 for some pools. In this case you might have lost an OSD before a previous recovering process finished which would explain your current state (in this case my earlier advice still applies). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to see which crush tunables are active in a ceph-cluster?
Hi all, http://ceph.com/docs/master/rados/operations/crush-map/#crush-tunables described how to set the tunables to legacy, argonaut, bobtail, firefly or optimal. But how can I see, which profile is active in an ceph-cluster? With ceph osd getcrushmap I got not realy much info (only tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50) Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compile from source with Kinetic support
On 11/28/14 7:04 AM, Haomai Wang wrote: Yeah, ceph source repo doesn't contain Kinetic header file and library souce, you need to install kinetic devel package separately. Hi Haomai, I'm wondering if we need AC_CHECK_HEADER([kinetic/kinetic.h], ...) in configure.ac to double-check when the user specifies --with-kinetic? It might help to avoid some user confusion if we can have ./configure bail out early instead of continuing all the way through the build. Something like this? (completely untested) --- a/configure.ac +++ b/configure.ac @@ -557,7 +557,13 @@ AC_ARG_WITH([kinetic], #AS_IF([test x$with_kinetic = xyes], #[PKG_CHECK_MODULES([KINETIC], [kinetic_client], [], [true])]) AS_IF([test x$with_kinetic = xyes], -[AC_DEFINE([HAVE_KINETIC], [1], [Defined if you have kinetic enable +[AC_CHECK_HEADER([kinetic/kinetic.h], + [AC_DEFINE( + [HAVE_KINETIC], [1], [Defined if you have kinetic enabled])], + [AC_MSG_FAILURE( + [Can't find kinetic headers; please install them]) +)] +]) AM_CONDITIONAL(WITH_KINETIC, [ test $with_kinetic = yes ]) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compile from source with Kinetic support
Sorry, It didn't change anything : root@host:~/sources/ceph# head -12 src/os/KeyValueDB.cc // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- // vim: ts=8 sw=2 smarttab #include KeyValueDB.h #include LevelDBStore.h #ifdef HAVE_LIBROCKSDB #include RocksDBStore.h #endif #ifdef WITH_KINETIC #include KineticStore.h #endif root@host:~/sources/ceph# make [...] CXX os/libos_la-KeyValueDB.lo os/KeyValueDB.cc: In static member function 'static KeyValueDB* KeyValueDB::create(CephContext*, const string, const string)': os/KeyValueDB.cc:21:16: error: expected type-specifier before 'KineticStore' return new KineticStore(cct); ^ os/KeyValueDB.cc:21:16: error: expected ';' before 'KineticStore' os/KeyValueDB.cc:21:32: error: 'KineticStore' was not declared in this scope return new KineticStore(cct); ^ os/KeyValueDB.cc: In static member function 'static int KeyValueDB::test_init(const string, const string)': os/KeyValueDB.cc:39:12: error: 'KineticStore' has not been declared return KineticStore::_test_init(g_ceph_context); ^ make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1 On 12/01/2014 03:22 PM, Haomai Wang wrote: #ifdef WITH_KINETIC #include KineticStore.h #endif ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compile from source with Kinetic support
Sorry, it's a typo /WITH_KINETIC/HAVE_KINETIC/ :-) On Tue, Dec 2, 2014 at 12:51 AM, Julien Lutran julien.lut...@ovh.net wrote: Sorry, It didn't change anything : root@host:~/sources/ceph# head -12 src/os/KeyValueDB.cc // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- // vim: ts=8 sw=2 smarttab #include KeyValueDB.h #include LevelDBStore.h #ifdef HAVE_LIBROCKSDB #include RocksDBStore.h #endif #ifdef WITH_KINETIC #include KineticStore.h #endif root@host:~/sources/ceph# make [...] CXX os/libos_la-KeyValueDB.lo os/KeyValueDB.cc: In static member function 'static KeyValueDB* KeyValueDB::create(CephContext*, const string, const string)': os/KeyValueDB.cc:21:16: error: expected type-specifier before 'KineticStore' return new KineticStore(cct); ^ os/KeyValueDB.cc:21:16: error: expected ';' before 'KineticStore' os/KeyValueDB.cc:21:32: error: 'KineticStore' was not declared in this scope return new KineticStore(cct); ^ os/KeyValueDB.cc: In static member function 'static int KeyValueDB::test_init(const string, const string)': os/KeyValueDB.cc:39:12: error: 'KineticStore' has not been declared return KineticStore::_test_init(g_ceph_context); ^ make[3]: *** [os/libos_la-KeyValueDB.lo] Error 1 On 12/01/2014 03:22 PM, Haomai Wang wrote: #ifdef WITH_KINETIC #include KineticStore.h #endif -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Compile from source with Kinetic support
On Tue, Dec 2, 2014 at 12:38 AM, Ken Dreyer kdre...@redhat.com wrote: On 11/28/14 7:04 AM, Haomai Wang wrote: Yeah, ceph source repo doesn't contain Kinetic header file and library souce, you need to install kinetic devel package separately. Hi Haomai, I'm wondering if we need AC_CHECK_HEADER([kinetic/kinetic.h], ...) in configure.ac to double-check when the user specifies --with-kinetic? It might help to avoid some user confusion if we can have ./configure bail out early instead of continuing all the way through the build. Something like this? (completely untested) --- a/configure.ac +++ b/configure.ac @@ -557,7 +557,13 @@ AC_ARG_WITH([kinetic], #AS_IF([test x$with_kinetic = xyes], #[PKG_CHECK_MODULES([KINETIC], [kinetic_client], [], [true])]) AS_IF([test x$with_kinetic = xyes], -[AC_DEFINE([HAVE_KINETIC], [1], [Defined if you have kinetic enable +[AC_CHECK_HEADER([kinetic/kinetic.h], + [AC_DEFINE( + [HAVE_KINETIC], [1], [Defined if you have kinetic enabled])], + [AC_MSG_FAILURE( + [Can't find kinetic headers; please install them]) +)] +]) AM_CONDITIONAL(WITH_KINETIC, [ test $with_kinetic = yes ]) Yeah, it's better. Anyone who help to add these? You can close https://github.com/ceph/ceph/pull/3046 and create a PR. I don't have a std-c++11 env to test it at all :-( ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Optimal or recommended threads values
I'm still using the default values, mostly because I haven't had time to test. On Thu, Nov 27, 2014 at 2:44 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hi Craig, Are you keeping the filestore, disk and op threads at their default values? or did you also change them? Cheers Tuning these values depends on a lot more than just the SSDs and HDDs. Which kernel and IO scheduler are you using? Does your HBA do write caching? It also depends on what your goals are. Tuning for a RadosGW cluster is different that for a RDB cluster. The short answer is that you are the only person that can can tell you what your optimal values are. As always, the best benchmark is production load. In my small cluster (5 nodes, 44 osds), I'm optimizing to minimize latency during recovery. When the cluster is healthy, bandwidth and latency are more than adequate for my needs. Even with journals on SSDs, I've found that reducing the number of operations and threads has reduced my average latency. I use injectargs to try out new values while I monitor cluster latency. I monitor latency while the cluster is healthy and recovering. If a change is deemed better, only then will I persist the change to ceph.conf. This gives me a fallback that any changes that causes massive problems can be undone with a restart or reboot. So far, the configs that I've written to ceph.conf are [global] mon osd down out interval = 900 mon osd min down reporters = 9 mon osd min down reports = 12 osd pool default flag hashpspool = true [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 I have it on my list to investigate filestore max sync interval. And now that I've pasted that, I need to revisit the min down reports/reporters. I have some nodes with 10 OSDs, and I don't want any one node able to mark the rest of the cluster as down (it happened once). On Sat, Nov 22, 2014 at 6:24 AM, Andrei Mikhailovsky and...@arhont.com wrote: Hello guys, Could some one comment on the optimal or recommended values of various threads values in ceph.conf? At the moment I have the following settings: filestore_op_threads = 8 osd_disk_threads = 8 osd_op_threads = 8 filestore_merge_threshold = 40 filestore_split_multiple = 8 Are these reasonable for a small cluster made of 7.2K SAS disks with ssd journals with a ratio of 4:1? What are the settings that other people are using? Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-fs-common ceph-mds on ARM Raspberry Debian 7.6
Hi Paulo, Thanks a lot. I’ve just added into /etc/apst/sources.list below back ports: deb http://ftp.debian.org/debian/ wheezy-backports main And : apt-get update But ceph-deploy still throw alerts. So I added package manually (to take them from wheezy-backports) : apt-get -t wheezy-backports install ceph ceph-mds ceph-common ceph-fs-common gdisk And ceph-deploy is now OK : root@socrate:~/cluster# ceph-deploy install socrate.flox-arts.in … [socrate.flox-arts.in][DEBUG ] ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) Thanks Florent Monthel Le 1 déc. 2014 à 00:03, Paulo Almeida palme...@igc.gulbenkian.pt a écrit : Hi, You should be able to use the wheezy-backports repository, which has ceph 0.80.7. Cheers, Paulo On Sun, 2014-11-30 at 19:31 +0100, Florent MONTHEL wrote: Hi, I’m trying to deploy CEPH (with ceph-deploy) on Raspberry Debian 7.6 and I have below error on ceph-deploy install command : [socrate.flox-arts.in][INFO ] Running command: env DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get -q -o Dpkg::Options::=--force-confnew --no-install-recommends --assume-yes install -- ceph ceph-mds ceph-common ceph-fs-common gdisk [socrate.flox-arts.in][DEBUG ] Reading package lists... [socrate.flox-arts.in][DEBUG ] Building dependency tree... [socrate.flox-arts.in][DEBUG ] Reading state information... [socrate.flox-arts.in][WARNIN] E: Unable to locate package ceph-mds [socrate.flox-arts.in][WARNIN] E: Unable to locate package ceph-fs-common [socrate.flox-arts.in][ERROR ] RuntimeError: command returned non-zero exit status: 100 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get -q -o Dpkg::Options::=--force-confnew --no-install-recommends --assume-yes install -- ceph ceph-mds ceph-common ceph-fs-common gdisk Do you know how I can have these 2 package on this platform ? Thanks Florent Monthel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] do I have to use sudo for CEPH install
You have to be a root user, either via login, su or sudo. So no, you don't have to use sudo - just logon as root. On 2 December 2014 at 00:05, Jiri Kanicky ji...@ganomi.com wrote: Hi. Do I have to install sudo in Debian Wheezy to deploy CEPH succesfully? I dont normally use sudo. Thank you Jiri ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if the raw objects still exist Yehuda Not sure where to go from here, and our cluster is slowly filling up while not clearing any space. I did the last section: I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if the raw objects still exist Yehuda Not sure where to go from here, and our cluster is slowly filling up while not clearing any space. I did the last section: I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if the raw objects still exist Yehuda Not sure where to go from here, and our cluster is slowly filling up while not clearing any space. I did the last section: I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On 2014-12-02 09:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if the raw objects still exist Yehuda Not sure where to go from here, and our cluster is slowly filling up while not clearing any space. I did the last section: I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try
Re: [ceph-users] Radosgw agent only syncing metadata
On 25/11/14 12:40, Mark Kirkwood wrote: On 25/11/14 11:58, Yehuda Sadeh wrote: On Mon, Nov 24, 2014 at 2:43 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 22/11/14 10:54, Yehuda Sadeh wrote: On Thu, Nov 20, 2014 at 6:52 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Fri Nov 21 02:13:31 2014 x-amz-copy-source:bucketbig/_multipart_big.dat.2/fjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta /bucketbig/__multipart_big.dat.2%2Ffjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta 2014-11-21 15:13:31.914925 7fb5e3f87700 15 generated auth header: AWS us-west key:tk7RgBQMD92je2Nz1m2D/GV+VNM= 2014-11-21 15:13:31.914964 7fb5e3f87700 20 sending request to http://ceph2:80/bucketbig/__multipart_big.dat.2%2Ffjid6CneDQYKisHf0pRFOT5cEWF_EQr.meta?rgwx-uid=us-westrgwx-region=usrgwx-prepend-metadata=us 2014-11-21 15:13:31.920510 7fb5e3f87700 10 receive_http_header 2014-11-21 15:13:31.920525 7fb5e3f87700 10 received header:HTTP/1.1 411 Length Required It looks like you're running the wrong fastcgi module. Yehuda Thanks Yehuda - so what would be the right fastcgi? Do you mean http://gitbuilder.ceph.com/libapache-mod-fastcgi-deb-precise-x86_64-basic/ref/master/ This one should work, yeah. Looks that that was the issue: $ rados df|grep bucket .us-east.rgw.buckets - 93740 24 00 0 3493746 216 93740 .us-east.rgw.buckets.index - 01 00 0 24 25 270 .us-west.rgw.buckets - 93740 24 00 000 215 93740 .us-west.rgw.buckets.index - 01 00 0 19 18 190 Now I reinstalled the Ceph patched apache2 and fastcgi module () not sure if needed to do apache2 as well): $ cat /etc/apt/sources.list.d/ceph.list ... deb http://gitbuilder.ceph.com/libapache-mod-fastcgi-deb-precise-x86_64-basic/ref/master/ precise main deb http://gitbuilder.ceph.com/apache2-deb-precise-x86_64-basic/ref/master/ precise main Now that I've got that working I'll look at getting a more complex setup Just for the record, using these apache and fastcgi modules seems to be the story - I've managed to run through the more complicated examples: - zones in different ceph clusters - zones in different regions ... and get replication working (on Ubuntu 12.04 and 14.04 with Ceph 0.87). Thanks for your help. I have some further questions that I'll ask in a new thread (as they are not really about 'how to make it work'). regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On Mon, Dec 1, 2014 at 3:20 PM, Ben b@benjackson.email wrote: On 2014-12-02 09:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if the raw objects still exist Yehuda Not sure where to go from here, and our cluster is slowly filling up while not clearing any space. I
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On 2014-12-02 11:21, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:47 PM, Ben b@benjackson.email wrote: On 2014-12-02 10:40, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:20 PM, Ben b@benjackson.email wrote: On 2014-12-02 09:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if the raw objects still exist
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On Mon, Dec 1, 2014 at 4:23 PM, Ben b@benjackson.email wrote: On 2014-12-02 11:21, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:47 PM, Ben b@benjackson.email wrote: On 2014-12-02 10:40, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:20 PM, Ben b@benjackson.email wrote: On 2014-12-02 09:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On 2014-12-02 11:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 4:23 PM, Ben b@benjackson.email wrote: On 2014-12-02 11:21, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:47 PM, Ben b@benjackson.email wrote: On 2014-12-02 10:40, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:20 PM, Ben b@benjackson.email wrote: On 2014-12-02 09:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours,
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On Mon, Dec 1, 2014 at 3:47 PM, Ben b@benjackson.email wrote: On 2014-12-02 10:40, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 3:20 PM, Ben b@benjackson.email wrote: On 2014-12-02 09:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 2:10 PM, Ben b@benjackson.email wrote: On 2014-12-02 08:39, Yehuda Sadeh wrote: On Sat, Nov 29, 2014 at 2:26 PM, Ben b@benjackson.email wrote: On 29/11/14 11:40, Yehuda Sadeh wrote: On Fri, Nov 28, 2014 at 1:38 PM, Ben b@benjackson.email wrote: On 29/11/14 01:50, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 9:22 PM, Ben b@benjackson.email wrote: On 2014-11-28 15:42, Yehuda Sadeh wrote: On Thu, Nov 27, 2014 at 2:15 PM, b b@benjackson.email wrote: On 2014-11-27 11:36, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:49 PM, b b@benjackson.email wrote: On 2014-11-27 10:21, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 3:09 PM, b b@benjackson.email wrote: On 2014-11-27 09:38, Yehuda Sadeh wrote: On Wed, Nov 26, 2014 at 2:32 PM, b b@benjackson.email wrote: I've been deleting a bucket which originally had 60TB of data in it, with our cluster doing only 1 replication, the total usage was 120TB. I've been deleting the objects slowly using S3 browser, and I can see the bucket usage is now down to around 2.5TB or 5TB with duplication, but the usage in the cluster has not changed. I've looked at garbage collection (radosgw-admin gc list --include all) and it just reports square brackets [] I've run radosgw-admin temp remove --date=2014-11-20, and it doesn't appear to have any effect. Is there a way to check where this space is being consumed? Running 'ceph df' the USED space in the buckets pool is not showing any of the 57TB that should have been freed up from the deletion so far. Running 'radosgw-admin bucket stats | jshon | grep size_kb_actual' and adding up all the buckets usage, this shows that the space has been freed from the bucket, but the cluster is all sorts of messed up. ANY IDEAS? What can I look at? Can you run 'radosgw-admin gc list --include-all'? Yehuda I've done it before, and it just returns square brackets [] (see below) radosgw-admin gc list --include-all [] Do you know which of the rados pools have all that extra data? Try to list that pool's objects, verify that there are no surprises there (e.g., use 'rados -p pool ls'). Yehuda I'm just running that command now, and its taking some time. There is a large number of objects. Once it has finished, what should I be looking for? I assume the pool in question is the one that holds your objects data? You should be looking for objects that are not expected to exist anymore, and objects of buckets that don't exist anymore. The problem here is to identify these. I suggest starting by looking at all the existing buckets, compose a list of all the bucket prefixes for the existing buckets, and try to look whether there are objects that have different prefixes. Yehuda Any ideas? I've found the prefix, the number of objects in the pool that match that prefix numbers in the 21 millions The actual 'radosgw-admin bucket stats' command reports it as only having 1.2 million. Well, the objects you're seeing are raw objects, and since rgw stripes the data, it is expected to have more raw objects than objects in the bucket. Still, it seems that you have much too many of these. You can try to check whether there are pending multipart uploads that were never completed using the S3 api. At the moment there's no easy way to figure out which raw objects are not supposed to exist. The process would be like this: 1. rados ls -p data pool keep the list sorted 2. list objects in the bucket 3. for each object in (2), do: radosgw-admin object stat --bucket=bucket --object=object --rgw-cache-enabled=false (disabling the cache so that it goes quicker) 4. look at the result of (3), and generate a list of all the parts. 5. sort result of (4), compare it to (1) Note that if you're running firefly or later, the raw objects are not specified explicitly in the command you run at (3), so you might need a different procedure, e.g., find out the raw objects random string that is being used, remove it from the list generated in 1, etc.) That's basically it. I'll be interested to figure out what happened, why the garbage collection didn't work correctly. You could try verifying that it's working by: - create an object (let's say ~10MB in size). - radosgw-admin object stat --bucket=bucket --object=object (keep this info, see - remove the object - run radosgw-admin gc list --include-all and verify that the raw parts are listed there - wait a few hours, repeat last step, see that the parts don't appear there anymore - run rados -p pool ls, check to see if
[ceph-users] Incomplete PGs
Hi all, I have a problem with some incomplete pgs. Here’s the backstory: I had a pool that I had accidently left with a size of 2. On one of the ods nodes, the system hdd started to fail and I attempted to rescue it by sacrificing one of my osd nodes. That went ok and I was able to bring the node back up minus the one osd. Now I have 11 incomplete osds. I believe these are mostly from the pool that only had size two, but I cant tell for sure. I found another thread on here that talked about using ceph_objectstore_tool to add or remove pg data to get out of an incomplete state. Let’s start with the one pg I’ve been playing with, this is a loose description of where I’ve been. First I saw that it had the missing osd in “down_osds_we_would_probe” when I queried it, and some reading around that told me to recreate the missing osd, so I did that. It (obviously) didnt have the missing data, but it took the pg from down+incomplete to just incomplete. Then I tried pg_force_create and that didnt seem to make a difference. Some more googling then brought me to ceph_objectstore_tool and I started to take a closer look at the results from pg query. I noticed that the list of probing osds gets longer and longer till the end of the query has something like: probing_osds: [ 0, 3, 4, 16, 23, 26, 35, 41, 44, 51, 56”], So I took a look at those osds and noticed that some of them have data in the directory for the troublesome pg and others dont. So I tried picking one with the *most* data and i used ceph_objectstore_tool to export the pg. It was 6G so a fair amount of data is still there. I then imported it (after removing) into all the others in that list. Unfortunately, it is still incomplete. I’m not sure what my next step should be here. Here’s some other stuff from the query on it: info: { pgid: 0.63b, last_update: 50495'8246, last_complete: 50495'8246, log_tail: 20346'5245, last_user_version: 8246, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 1, last_epoch_started: 51102, last_epoch_clean: 50495, last_epoch_split: 0, same_up_since: 68312, same_interval_since: 68312, same_primary_since: 68190, last_scrub: 28158'8240, last_scrub_stamp: 2014-11-18 17:08:49.368486, last_deep_scrub: 28158'8240, last_deep_scrub_stamp: 2014-11-18 17:08:49.368486, last_clean_scrub_stamp: 2014-11-18 17:08:49.368486}, stats: { version: 50495'8246, reported_seq: 84279, reported_epoch: 69394, state: down+incomplete, last_fresh: 2014-12-01 23:23:07.355308, last_change: 2014-12-01 21:28:52.771807, last_active: 2014-11-24 13:37:09.784417, last_clean: 2014-11-22 21:59:49.821836, last_became_active: 0.00, last_unstale: 2014-12-01 23:23:07.355308, last_undegraded: 2014-12-01 23:23:07.355308, last_fullsized: 2014-12-01 23:23:07.355308, mapping_epoch: 68285, log_start: 20346'5245, ondisk_log_start: 20346'5245, created: 1, last_epoch_clean: 50495, parent: 0.0, parent_split_bits: 0, last_scrub: 28158'8240, last_scrub_stamp: 2014-11-18 17:08:49.368486, last_deep_scrub: 28158'8240, last_deep_scrub_stamp: 2014-11-18 17:08:49.368486, last_clean_scrub_stamp: 2014-11-18 17:08:49.368486, log_size: 3001, ondisk_log_size: 3001, Also in the peering section, all the peers now have the same last_update: which makes me think it should just pick up and take off. There is another think I’m having problems with and I’m not sure if it’s related or not. I set a crush map manually as I have a mix of ssd and platter osds and it seems to work when I set it, the cluster starts rebalancing, etc, but if I do a restart ceph-all on all my nodes the crush maps seems to revert to the one I didn’t set. I don’t know if its being blocked from taking by these incomplete pgs or if I’m missing a step to get it to “stick” It makes me think when I’m stopping and starting these osds to use ceph_objectstore_tool on them they may be getting out of sync with the cluster. Any insights would be greatly appreciated, Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Client forward compatibility
On Tue, Nov 25, 2014 at 1:00 AM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Hi Greg, On 24 Nov 2014, at 22:01, Gregory Farnum g...@gregs42.com wrote: On Thu, Nov 20, 2014 at 9:08 AM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi all, What is compatibility/incompatibility of dumpling clients to talk to firefly and giant clusters? We sadly don't have a good matrix about this yet, but in general you should assume that anything which changed the way the data is physically placed on the cluster will prevent them from communicating; if you don't enable those features then they should remain compatible. It would be good to have such a compat matrix, as I was confused, probably others are confused, and if I’m not wrong, even you are confused ... see below. In particular I know that tunables=firefly will prevent dumpling clients from talking to a firefly cluster, but how about the existence or not of erasure pools? As you mention, updating the tunables will prevent old clients from accessing them (although that shouldn't be the case in future now that they're all set by the crush map for later interpretation). Erasure pools are a special case (precisely because people had issues with them) and you should be able to communicate with a cluster that has EC pools while using old clients That’s what we’d hoped, but alas we get the same error mentioned here: http://tracker.ceph.com/issues/8178 In our case (0.67.11 clients talking to the latest firefly gitbuilder build) we get: protocol feature mismatch, my 407 peer 417 missing 10 By adding an EC pool, we lose connectivity for dumpling clients to even the replicated pools. The good news is that when we remove the EC pool, the 10 feature bit is removed so dumpling clients can connect again. But nevertheless it leaves open the possibility of accidentally breaking the users’ access. yep. Sorry, apparently we tried to do this and didn't quite make it all the way. :/ We discussed last week trying to build and maintain a forward compatibility matrix briefly, but haven't done it yet. There's one floating around somewhere in the docs for the kernel client but a userspace one just hasn't been anything people have asked for previously, so we never thought of it. Meanwhile, I'm sure it's not the most pleasant way to do things but if you go over the upgrade notes for each major release they should include the possible break points. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sun, Nov 30, 2014 at 1:15 PM, Andrei Mikhailovsky and...@arhont.com wrote: Greg, thanks for your comment. Could you please share what OS, kernel and any nfs/cephfs settings you've used to achieve the pretty well stability? Also, what kind of tests have you ran to check that? We're just doing it on our testing cluster with the teuthology/ceph-qa-suite stuff in https://github.com/ceph/ceph-qa-suite/tree/master/suites/knfs/basic So that'll be running our ceph-client kernel, which I believe is usually a recent rc release with the new Ceph changes on top, with knfs exporting a kcephfs mount, and then running each of the tasks named in the tasks folder on top of a client to that knfs export. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Revisiting MDS memory footprint
On Mon, Dec 1, 2014 at 8:06 AM, John Spray john.sp...@redhat.com wrote: I meant to chime in earlier here but then the weekend happened, comments inline On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander w...@42on.com wrote: Why would you want all CephFS metadata in memory? With any filesystem that will be a problem. The latency associated with a cache miss (RADOS OMAP dirfrag read) is fairly high, so the goal when sizing will to allow the MDSs to keep a very large proportion of the metadata in RAM. In a local FS, the filesystem metadata in RAM is relatively small, and the speed to disk is relatively high. In Ceph FS, that is reversed: we want to compensate for the cache miss latency by having lots of RAM in the MDS and a big cache. hot-standby MDSs are another manifestation of the expected large cache: we expect these caches to be big, to the point where refilling from the backing store on a failure would be annoyingly slow, and it's worth keeping that hot standby cache. I actually don't think the cache misses should be *dramatically* more expensive than local FS misses. They'll be larger since it's remote and a leveldb lookup is a bit slower than hitting the rest spot on disk, but everything's nicely streamed in and such so it's not too bad. But I'm also making this up as much as you are the rest of it, which looks good to me. :) The one thing I'd also bring up is just to be a bit more explicit about CephFS in-memory inode size having nothing to do with that of a local FS. We don't need to keep track of things like block locations, but we do keep track of file capabilities (leases) and a whole bunch of other state like the scrubbing/fsck status of it (coming soon!), the clean/dirty status in a lot more detail than the kernel does, any old versions of the inode that have been snapshotted, etc etc etc. Once upon a time Sage did have some numbers indicating that a cached dentry took about 1KB, but things change in both directions pretty frequently and memory use will likely be a thing we look at around the time we're wondering if we should declare CephFS to be ready for community use in production previews. -Greg Also, remember that because we embed inodes in dentries, when we load a directory fragment we are also loading all the inodes in that directory fragment -- if you have only one file open, but it has an ancestor with lots of files, then you'll have more files in cache than you might have expected. We do however need a good rule of thumb of how much memory is used for each inode. Yes -- and ideally some practical measurements too :-) One important point that I don't think anyone mentioned so far: the memory consumption per inode depends on how many clients have capabilities on the inode. So if many clients hold a read capability on a file, more memory will be used MDS-side for that file. If designing a benchmark for this, the client count, and level of overlap in the client workloads would be an important dimension. The number of *open* files on clients strongly affects the ability of the MDS to trim is cache, since the MDS pins in cache any inode which is in use by a client. We recently added health checks so that the MDS can complain about clients that are failing to respond to requests to trim their caches, and the way we test this is to have a client obstinately keep some number of files open. We also allocate memory for pending metadata updates (so-called 'projected inodes') while they are in the journal, so the memory usage will depend on the journal size and the number of writes in flight. It would be really useful to come up with a test script that monitors MDS memory consumption as a function of number of files in cache, number of files opened by clients, number of clients opening the same files. I feel a 3d chart plot coming on :-) Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] trouble starting second monitor
Hm. Already exists. And now I'm completely confused. Ok, so I'm trying to start over. I've ceph-deploy purge'd all my machines a few times with ceph-deploy purgedata intermixed. I've manually removed all the files I could see that were generated, except my osd directories, which I apparently can't remove. ceph@adriatic:~$ sudo rm -rf osd rm: cannot remove â: Operation not permitted rm: cannot remove â: Operation not permitted rm: cannot remove â: Operation not permitted What's up with that and how do I get rid of it in order to start over? --rich On 12/1/14 00:01 , Irek Fasikhov wrote: [celtic][DEBUG ] create the mon path if it does not exist mkdir /var/lib/ceph/mon/ 2014-12-01 4:32 GMT+03:00 K Richard Pixley r...@noir.com mailto:r...@noir.com: What does this mean, please? --rich ceph@adriatic:~/my-cluster$ ceph status cluster 1023db58-982f-4b78-b507-481233747b13 health HEALTH_OK monmap e1: 1 mons at {black=192.168.1.77:6789/0 http://192.168.1.77:6789/0}, election epoch 2, quorum 0 black mdsmap e7: 1/1/1 up {0=adriatic=up:active}, 3 up:standby osdmap e17: 4 osds: 4 up, 4 in pgmap v48: 192 pgs, 3 pools, 1884 bytes data, 20 objects 29134 MB used, 113 GB / 149 GB avail 192 active+clean ceph@adriatic:~/my-cluster$ ceph-deploy mon create celtic [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.20): /usr/bin/ceph-deploy mon create celtic [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts celtic [ceph_deploy.mon][DEBUG ] detecting platform for host celtic ... [celtic][DEBUG ] connection detected need for sudo [celtic][DEBUG ] connected to host: celtic [celtic][DEBUG ] detect platform information from remote host [celtic][DEBUG ] detect machine type [ceph_deploy.mon][INFO ] distro info: Ubuntu 14.04 trusty [celtic][DEBUG ] determining if provided host has same hostname in remote [celtic][DEBUG ] get remote short hostname [celtic][DEBUG ] deploying mon to celtic [celtic][DEBUG ] get remote short hostname [celtic][DEBUG ] remote hostname: celtic [celtic][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [celtic][DEBUG ] create the mon path if it does not exist [celtic][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-celtic/done [celtic][DEBUG ] create a done file to avoid re-doing the mon deployment [celtic][DEBUG ] create the init path if it does not exist [celtic][DEBUG ] locating the `service` executable... [celtic][INFO ] Running command: sudo initctl emit ceph-mon cluster=ceph id=celtic [celtic][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.celtic.asok mon_status [celtic][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory [celtic][WARNIN] monitor: mon.celtic, might not be running yet [celtic][INFO ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.celtic.asok mon_status [celtic][ERROR ] admin_socket: exception getting command descriptions: [Errno 2] No such file or directory [celtic][WARNIN] celtic is not defined in `mon initial members` [celtic][WARNIN] monitor celtic does not exist in monmap [celtic][WARNIN] neither `public_addr` nor `public_network` keys are defined for monitors [celtic][WARNIN] monitors may not be able to form quorum ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On Mon, Dec 1, 2014 at 4:26 PM, Ben b@benjackson.email wrote: On 2014-12-02 11:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 4:23 PM, Ben b@benjackson.email wrote: ... How can I tell if the shard has an object in it from the logs? Search for a different sequence (e.g., search for rgw.gc_remove). Yehuda 0 Results in the logs for rgw.gc_remove Well, something is modifying the gc log. Do you happen to have more than one radosgw running on the same cluster? Yehuda We have 2 radosgw servers obj01 and obj02 Are both of them pointing at the same zone? Yes, they are load balanced Well, the gc log show entries, and then it doesn't, so something clears these up. Try reproducing again with logs on, see if you see new entries in the rgw logs. If you don't see these, maybe try turning on 'debug ms = 1' on your osds (ceph tell osd.* injectargs '--debug_ms 1'), and look in your osd logs for such messages. These might give you some hint for their origin. Also, could it be that you ran 'radosgw-admin gc process', instead of waiting for the gc cycle to complete? Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deleting buckets and objects fails to reduce reported cluster usage
On 2014-12-02 15:03, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 4:26 PM, Ben b@benjackson.email wrote: On 2014-12-02 11:25, Yehuda Sadeh wrote: On Mon, Dec 1, 2014 at 4:23 PM, Ben b@benjackson.email wrote: ... How can I tell if the shard has an object in it from the logs? Search for a different sequence (e.g., search for rgw.gc_remove). Yehuda 0 Results in the logs for rgw.gc_remove Well, something is modifying the gc log. Do you happen to have more than one radosgw running on the same cluster? Yehuda We have 2 radosgw servers obj01 and obj02 Are both of them pointing at the same zone? Yes, they are load balanced Well, the gc log show entries, and then it doesn't, so something clears these up. Try reproducing again with logs on, see if you see new entries in the rgw logs. If you don't see these, maybe try turning on 'debug ms = 1' on your osds (ceph tell osd.* injectargs '--debug_ms 1'), and look in your osd logs for such messages. These might give you some hint for their origin. Also, could it be that you ran 'radosgw-admin gc process', instead of waiting for the gc cycle to complete? Yehuda I did anohter test, this time with a 600mb file. I uploaded it, then deleted the file and did a gc list --include all. It displayed around 143 _shadow_ files. I let GC process itself (I did not force this process) and I checked the pool afterward by running 'rados ls -p .rgw.buckets | grep gc-listed-shadowfiles' and they no longer exist. I've added the debug ms to the OSDs, I'll do another test with the 600mb file. Also worth noting, I have started clearing out files from the .rgw.buckets pool that are from a bucket which has been deleted and no longer visible by running 'rados -p .rgw.gc rm' over all the _shadow_ files contained in that bucket prefix default.4804.14__shadow_ Is this alright to do, or is there a better way to clear out files? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] LevelDB support status is still experimental on Giant?
Compared to Filestore on SSD(We run levelDB on top of SSD). The usage pattern is RBD sequential write(64K * QD8) and random write( 4K * QD8), read seems on par. I would suspect KV backend on HDD will be even worse ,compared to Filestore on HDD. From: Satoru Funai [mailto:satoru.fu...@gmail.com] Sent: Tuesday, December 2, 2014 1:27 PM To: Chen, Xiaoxi Cc: ceph-us...@ceph.com; Haomai Wang Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Hi Xiaoxi, Thanks for very useful information. Can you share more details about Terrible bad performance is compare against what? and what kind of usage pattern? I'm just interested in key/value backend for more cost/performance without expensive HW such as ssd/fusion io. Regards, Satoru Funai 差出人: Xiaoxi Chen xiaoxi.c...@intel.commailto:xiaoxi.c...@intel.com 宛先: Haomai Wang haomaiw...@gmail.commailto:haomaiw...@gmail.com Cc: Satoru Funai satoru.fu...@gmail.commailto:satoru.fu...@gmail.com, ceph-us...@ceph.commailto:ceph-us...@ceph.com 送信済み: 2014年12月1日, 月曜日 午後 11:26:56 件名: RE: [ceph-users] LevelDB support status is still experimental on Giant? Range query is not that important in nowadays SSDyou can see very high read random read IOPS in ssd spec, and getting higher day by day.The key problem here is trying to exactly matching one query(get/put) to one SSD IO(read/write), eliminate the read/write amplification. We kind of believe OpenNvmKV may be the right approach. Back to the context of Ceph, can we find some use case of nowadays key-value backend? We would like to learn from community what’s the workload pattern if you wants a K-V backed Ceph? Or just have a try? I think before we get a suitable DB backend ,we had better off to optimize the key-value backend code to support specified kind of load. From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Monday, December 1, 2014 10:14 PM To: Chen, Xiaoxi Cc: Satoru Funai; ceph-us...@ceph.commailto:ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Exactly, I'm just looking forward a better DB backend suitable for KeyValueStore. It maybe traditional B-tree design. Kinetic original I think it was a good backend, but it doesn't support range query :-( On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi xiaoxi.c...@intel.commailto:xiaoxi.c...@intel.com wrote: We have tested it for a while, basically it seems kind of stable but show terrible bad performance. This is not the fault of Ceph , but levelDB, or more generally, all K-V storage with LSM design(RocksDB,etc), the LSM tree structure naturally introduce very large write amplification 10X to 20X when you have tens GB of data per OSD. So you can always see very bad sequential write performance (~200MB/s for a 12SSD setup), we can share more details on the performance meeting. To this end, key-value backend with LevelDB is not useable for RBD usage, but maybe workable(not tested) in the LOSF cases ( tons of small objects stored via rados , k-v backend can prevent the FS metadata become the bottleneck) From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Haomai Wang Sent: Monday, December 1, 2014 9:48 PM To: Satoru Funai Cc: ceph-us...@ceph.commailto:ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Yeah, mainly used by test env. On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai satoru.fu...@gmail.commailto:satoru.fu...@gmail.com wrote: Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] LevelDB support status is still experimental on Giant?
Hi Xiaoxi, Thanks for very useful information. Can you share more details about Terrible bad performance is compare against what? and what kind of usage pattern? I'm just interested in key/value backend for more cost/performance without expensive HW such as ssd/fusion io. Regards, Satoru Funai - 元のメッセージ - 差出人: Xiaoxi Chen xiaoxi.c...@intel.com 宛先: Haomai Wang haomaiw...@gmail.com Cc: Satoru Funai satoru.fu...@gmail.com, ceph-us...@ceph.com 送信済み: 2014年12月1日, 月曜日 午後 11:26:56 件名: RE: [ceph-users] LevelDB support status is still experimental on Giant? Range query is not that important in nowadays SSDyou can see very high read random read IOPS in ssd spec, and getting higher day by day.The key problem here is trying to exactly matching one query(get/put) to one SSD IO(read/write), eliminate the read/write amplification. We kind of believe OpenNvmKV may be the right approach. Back to the context of Ceph, can we find some use case of nowadays key-value backend? We would like to learn from community what’s the workload pattern if you wants a K-V backed Ceph? Or just have a try? I think before we get a suitable DB backend ,we had better off to optimize the key-value backend code to support specified kind of load. From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Monday, December 1, 2014 10:14 PM To: Chen, Xiaoxi Cc: Satoru Funai; ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Exactly, I'm just looking forward a better DB backend suitable for KeyValueStore. It maybe traditional B-tree design. Kinetic original I think it was a good backend, but it doesn't support range query :-( On Mon, Dec 1, 2014 at 10:04 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: We have tested it for a while, basically it seems kind of stable but show terrible bad performance. This is not the fault of Ceph , but levelDB, or more generally, all K-V storage with LSM design(RocksDB,etc), the LSM tree structure naturally introduce very large write amplification 10X to 20X when you have tens GB of data per OSD. So you can always see very bad sequential write performance (~200MB/s for a 12SSD setup), we can share more details on the performance meeting. To this end, key-value backend with LevelDB is not useable for RBD usage, but maybe workable(not tested) in the LOSF cases ( tons of small objects stored via rados , k-v backend can prevent the FS metadata become the bottleneck) From: ceph-users [mailto: ceph-users-boun...@lists.ceph.com ] On Behalf Of Haomai Wang Sent: Monday, December 1, 2014 9:48 PM To: Satoru Funai Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] LevelDB support status is still experimental on Giant? Yeah, mainly used by test env. On Mon, Dec 1, 2014 at 6:29 PM, Satoru Funai satoru.fu...@gmail.com wrote: Hi guys, I'm interested in to use key/value store as a backend of Ceph OSD. When firefly release, LevelDB support is mentioned as experimental, is it same status on Giant release? Regards, Satoru Funai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com