[ceph-users] TRIM / DISCARD run at low priority by the OSDs?
Hi All, Is it possible to give TRIM / DISCARD initiated by krbd low priority on the OSDs? I know it is possible to run fstrim at Idle priority on the rbd mount point, e.g. ionice -c Idle fstrim -v $MOUNT . But this Idle priority (it appears) only is within the context of the node executing fstrim . If the node executing fstrim is Idle then the OSDs are very busy and performance suffers. Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at low priority also? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw hanging - blocking rgw.bucket_list ops
tried removing, but no luck: rados -p .be-east.rgw.buckets rm be-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity error removing .be-east.rgw.bucketsbe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity: (2) anyone? On 21-08-15 13:06, Sam Wouters wrote: I suspect these to be the cause: rados ls -p .be-east.rgw.buckets | grep sanitybe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity be-east.5436.1__sanity be-east.5436.1__:2vBijaGnVQF4Q0IjZPeyZSKeUmBGn9X__sanity be-east.5436.1__sanity be-east.5436.1__:4JTCVFxB1qoDWPu1nhuMDuZ3QNPaq5n__sanity be-east.5436.1__sanity be-east.5436.1__:9jFwd8xvqJMdrqZuM8Au4mi9M62ikyo__sanity be-east.5436.1__sanity be-east.5436.1__:BlfbGYGvLi92QPSiabT2mP7OeuETz0P__sanity be-east.5436.1__sanity be-east.5436.1__:MigpcpJKkan7Po6vBsQsSD.hEIRWuim__sanity be-east.5436.1__sanity be-east.5436.1__:QDTxD5p0AmVlPW4v8OPU3vtDLzenj4y__sanity be-east.5436.1__sanity be-east.5436.1__:S43EiNAk5hOkzgfbOynbOZOuLtUv0SB__sanity be-east.5436.1__sanity be-east.5436.1__:UKlOVMQBQnlK20BHJPyvnG6m.2ogBRW__sanity be-east.5436.1__sanity be-east.5436.1__:kkb6muzJgREie6XftdEJdFHxR2MaFeB__sanity be-east.5436.1__sanity be-east.5436.1__:oqPhWzFDSQ-sNPtppsl1tPjoryaHNZY__sanity be-east.5436.1__sanity be-east.5436.1__:pLhygPGKf3uw7C7OxSJNCw8rQEMOw5l__sanity be-east.5436.1__sanity be-east.5436.1__:tO1Nf3S2WOfmcnKVPv0tMeXbwa5JR36__sanity be-east.5436.1__sanity be-east.5436.1__:ye4oRwDDh1cGckbMbIo56nQvM7OEyPM__sanity be-east.5436.1__sanity be-east.5436.1___sanitybe-east.5436.1__sanity would it be save and/or help to remove those with rados rm, and try an bucket check --fix --check-objects? On 21-08-15 11:28, Sam Wouters wrote: Hi, We are running hammer 0.94.2 and have an increasing amount of heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had timed out after 600 messages in our radosgw logs, with radosgw eventually stalling. A restart of the radosgw helps for a few minutes, but after that it hangs again. ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests shows call rgw.bucket_list ops. No new bucket lists are requested, so those ops seem to stay there. Anyone any idea how to get rid of those. Restart of the affecting osd didn't help neither. I'm not sure if its related, but we have an object called _sanity in the bucket where the listing was performed on. I know there is some bug with objects starting with _. Any help would be much appreciated. r, Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw hanging - blocking rgw.bucket_list ops
I suspect these to be the cause: rados ls -p .be-east.rgw.buckets | grep sanitybe-east.5436.1__:2bpm.1OR-cqyOLUHek8m2RdPVRZ.pDT__sanity be-east.5436.1__sanity be-east.5436.1__:2vBijaGnVQF4Q0IjZPeyZSKeUmBGn9X__sanity be-east.5436.1__sanity be-east.5436.1__:4JTCVFxB1qoDWPu1nhuMDuZ3QNPaq5n__sanity be-east.5436.1__sanity be-east.5436.1__:9jFwd8xvqJMdrqZuM8Au4mi9M62ikyo__sanity be-east.5436.1__sanity be-east.5436.1__:BlfbGYGvLi92QPSiabT2mP7OeuETz0P__sanity be-east.5436.1__sanity be-east.5436.1__:MigpcpJKkan7Po6vBsQsSD.hEIRWuim__sanity be-east.5436.1__sanity be-east.5436.1__:QDTxD5p0AmVlPW4v8OPU3vtDLzenj4y__sanity be-east.5436.1__sanity be-east.5436.1__:S43EiNAk5hOkzgfbOynbOZOuLtUv0SB__sanity be-east.5436.1__sanity be-east.5436.1__:UKlOVMQBQnlK20BHJPyvnG6m.2ogBRW__sanity be-east.5436.1__sanity be-east.5436.1__:kkb6muzJgREie6XftdEJdFHxR2MaFeB__sanity be-east.5436.1__sanity be-east.5436.1__:oqPhWzFDSQ-sNPtppsl1tPjoryaHNZY__sanity be-east.5436.1__sanity be-east.5436.1__:pLhygPGKf3uw7C7OxSJNCw8rQEMOw5l__sanity be-east.5436.1__sanity be-east.5436.1__:tO1Nf3S2WOfmcnKVPv0tMeXbwa5JR36__sanity be-east.5436.1__sanity be-east.5436.1__:ye4oRwDDh1cGckbMbIo56nQvM7OEyPM__sanity be-east.5436.1__sanity be-east.5436.1___sanitybe-east.5436.1__sanity would it be save and/or help to remove those with rados rm, and try an bucket check --fix --check-objects? On 21-08-15 11:28, Sam Wouters wrote: Hi, We are running hammer 0.94.2 and have an increasing amount of heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had timed out after 600 messages in our radosgw logs, with radosgw eventually stalling. A restart of the radosgw helps for a few minutes, but after that it hangs again. ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests shows call rgw.bucket_list ops. No new bucket lists are requested, so those ops seem to stay there. Anyone any idea how to get rid of those. Restart of the affecting osd didn't help neither. I'm not sure if its related, but we have an object called _sanity in the bucket where the listing was performed on. I know there is some bug with objects starting with _. Any help would be much appreciated. r, Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Testing CephFS
On Thu, Aug 20, 2015 at 11:07 AM, Simon Hallam s...@pml.ac.uk wrote: Hey all, We are currently testing CephFS on a small (3 node) cluster. The setup is currently: Each server has 12 OSDs, 1 Monitor and 1 MDS running on it: The servers are running: 0.94.2-0.el7 The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64 ceph -s cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd health HEALTH_OK monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} election epoch 20, quorum 0,1,2 ceph1,ceph2,ceph3 mdsmap e12: 1/1/1 up {0=ceph3=up:active}, 2 up:standby osdmap e389: 36 osds: 36 up, 36 in pgmap v19370: 8256 pgs, 3 pools, 51217 MB data, 14035 objects 95526 MB used, 196 TB / 196 TB avail 8256 active+clean Our Ceph.conf is relatively simple at the moment: cat /etc/ceph/ceph.conf [global] fsid = 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd mon_initial_members = ceph1, ceph2, ceph3 mon_host = 10.15.0.1,10.15.0.2,10.15.0.3 mon_pg_warn_max_per_osd = 1000 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_pool_default_size = 2 When I pulled the plug on the master MDS last time (ceph1), it stopped all IO until I plugged it back in. I was under the assumption that the MDS would fail over the other 2 MDS’s and IO would continue? Is there something I need to do to allow the MDS’s to failover from each other without too much interruption? Or is this because the clients ceph version? That's quite strange. How long did you wait for it to fail over? Did the output of ceph -s (or ceph -w, whichever) change during that time? By default the monitors should have detected the MDS was dead after 30 seconds and put one of the other MDS nodes into replay and active. ...I wonder if this is because you lost a monitor at the same time as the MDS. What kind of logging do you have available from during your test? -Greg Cheers, Simon Hallam Linux Support Development Officer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados: Undefined symbol error
It sounds like you have rados CLI tool from an earlier Ceph release ( Hammer) installed and it is attempting to use the librados shared library from a newer (= Hammer) version of Ceph. Jason - Original Message - From: Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com To: ceph-us...@ceph.com Sent: Thursday, August 20, 2015 11:47:26 PM Subject: [ceph-users] Rados: Undefined symbol error Hello, I cloned the master branch of Ceph and after setting up the cluster, when I tried to use the rados commands, I got this error: rados: symbol lookup error: rados: undefined symbol: _ZN5MutexC1ERKSsbbbP11CephContext I saw a similar post here: http://tracker.ceph.com/issues/12563 but I am not clear on the solution for this problem. I am not performing an upgrade here but the error seems to be similar. Could anybody shed more light on the issue and how to solve it? Thanks a lot! Aakanksha ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
Hi, First of all, we are sure that the return to the default configuration fixed it. As soon as we restarted only one of the ceph nodes with the default configuration, it sped up recovery tremedously. We had already restarted before with the old conf and recovery was never that fast. Regarding the configuration, here's the old one with comments : [global] fsid = * mon_initial_members = cephmon1 mon_host = *** auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true // Let's you use xattributes of xfs/ext4/btrfs filesystems osd_pool_default_pgp_num = 450 // default pgp number for new pools osd_pg_bits = 12 // number of bits used to designate pgps. Lets you have 2^12 pgps osd_pool_default_size = 3 // default copy number for new pools osd_pool_default_pg_num = 450// default pg number for new pools public_network = * cluster_network = *** osd_pgp_bits = 12 // number of bits used to designate pgps. Let's you have 2^12 pgps [osd] filestore_queue_max_ops = 5000// set to 500 by default Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. filestore_fd_cache_random = true// journal_queue_max_ops = 100 // set to 500 by default. Number of operations allowed in the journal queue filestore_omap_header_cache_size = 100 // Determines the size of the LRU used to cache object omap headers. Larger values use more memory but may reduce lookups on omap. filestore_fd_cache_size = 100 // not in the ceph documentation. Seems to be a common tweak for SSD clusters though. max_open_files = 100 // lets ceph set the max file descriptor in the OS to prevent running out of file descriptors osd_journal_size = 1 // journal max size for each OSD New conf: [global] fsid = * mon_initial_members = cephmon1 mon_host = auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public_network = ** cluster_network = ** You might notice, I have a few undocumented settings in the old configuration. These are settings I took from a certain openstack summit presentation and they may have contributed to this whole problem. Here's a list of settings that I think might be a possible cause for these speed issues: filestore_fd_cache_random = true filestore_fd_cache_size = 100 Additionally, my colleague thinks these settings may have contributed : filestore_queue_max_ops = 5000 journal_queue_max_ops = 100 We will do further tests on these settings once we have our lab ceph test environment as we are also curious as to exactly what caused this. On 2015-08-20 11:43 AM, Alex Gorbachev wrote: Just to update the mailing list, we ended up going back to default ceph.conf without any additional settings than what is mandatory. We are now reaching speeds we never reached before, both in recovery and in regular usage. There was definitely something we set in the ceph.conf bogging everything down. Could you please share the old and new ceph.conf, or the section that was removed? Best regards, Alex On 2015-08-20 4:06 AM, Christian Balzer wrote: Hello, from all the pertinent points by Somnath, the one about pre-conditioning would be pretty high on my list, especially if this slowness persists and nothing else (scrub) is going on. This might be fixed by doing a fstrim. Additionally the levelDB's per OSD are of course sync'ing heavily during reconstruction, so that might not be the favorite thing for your type of SSDs. But ultimately situational awareness is very important, as in what is actually going and slowing things down. As usual my recommendations would be to use atop, iostat or similar on all your nodes and see if your OSD SSDs are indeed the bottleneck or if it is maybe just one of them or something else entirely. Christian On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote: Also, check if scrubbing started in the cluster or not. That may considerably slow down the cluster. -Original Message- From: Somnath Roy Sent: Wednesday, August 19, 2015 1:35 PM To: 'J-P Methot'; ceph-us...@ceph.com Subject: RE: [ceph-users] Bad performances in recovery All the writes will go through the journal. It may happen your SSDs are not preconditioned well and after a lot of writes during recovery IOs are stabilized to lower number. This is quite common for SSDs if that is the
Re: [ceph-users] Bad performances in recovery
filestore_fd_cache_random = true not true Shinobu On Fri, Aug 21, 2015 at 10:20 PM, Jan Schermer j...@schermer.cz wrote: Thanks for the config, few comments inline:, not really related to the issue On 21 Aug 2015, at 15:12, J-P Methot jpmet...@gtcomm.net wrote: Hi, First of all, we are sure that the return to the default configuration fixed it. As soon as we restarted only one of the ceph nodes with the default configuration, it sped up recovery tremedously. We had already restarted before with the old conf and recovery was never that fast. Regarding the configuration, here's the old one with comments : [global] fsid = * mon_initial_members = cephmon1 mon_host = *** auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true // Let's you use xattributes of xfs/ext4/btrfs filesystems This actually did the opposite, but this option doesn't exist anymore osd_pool_default_pgp_num = 450 // default pgp number for new pools osd_pg_bits = 12 // number of bits used to designate pgps. Lets you have 2^12 pgps Could someone comment on those? What exactly does it do? What if I have more PGs than num_osds*osd_pg_bits? osd_pool_default_size = 3 // default copy number for new pools osd_pool_default_pg_num = 450// default pg number for new pools public_network = * cluster_network = *** osd_pgp_bits = 12 // number of bits used to designate pgps. Let's you have 2^12 pgps [osd] filestore_queue_max_ops = 5000// set to 500 by default Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. filestore_fd_cache_random = true// No docs, I don't see this in my ancient cluster :-) journal_queue_max_ops = 100 // set to 500 by default. Number of operations allowed in the journal queue filestore_omap_header_cache_size = 100 // Determines the size of the LRU used to cache object omap headers. Larger values use more memory but may reduce lookups on omap. filestore_fd_cache_size = 100 // You don't really need to set this so high, but not sure what the implications are if you go too high (it probably doesn't eat more memory until it opens so many files). If you have 4MB object on a 1TB drive than you really only need 250K to keep all files open. not in the ceph documentation. Seems to be a common tweak for SSD clusters though. max_open_files = 100 // lets ceph set the max file descriptor in the OS to prevent running out of file descriptors This is too low if you were really using all of the fd_cache. There are going to be thousands of tcp connection which need to be accounted for as well. (in my experience there can be hundreds to thousands tcp connection from just one RBD client and 200 OSDs, which is a lot). osd_journal_size = 1 // journal max size for each OSD New conf: [global] fsid = * mon_initial_members = cephmon1 mon_host = auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public_network = ** cluster_network = ** You might notice, I have a few undocumented settings in the old configuration. These are settings I took from a certain openstack summit presentation and they may have contributed to this whole problem. Here's a list of settings that I think might be a possible cause for these speed issues: filestore_fd_cache_random = true filestore_fd_cache_size = 100 Additionally, my colleague thinks these settings may have contributed : filestore_queue_max_ops = 5000 journal_queue_max_ops = 100 We will do further tests on these settings once we have our lab ceph test environment as we are also curious as to exactly what caused this. On 2015-08-20 11:43 AM, Alex Gorbachev wrote: Just to update the mailing list, we ended up going back to default ceph.conf without any additional settings than what is mandatory. We are now reaching speeds we never reached before, both in recovery and in regular usage. There was definitely something we set in the ceph.conf bogging everything down. Could you please share the old and new ceph.conf, or the section that was removed? Best regards, Alex On 2015-08-20 4:06 AM, Christian Balzer wrote: Hello, from all the pertinent points by Somnath, the one
Re: [ceph-users] Bad performances in recovery
Thanks for the config, few comments inline:, not really related to the issue On 21 Aug 2015, at 15:12, J-P Methot jpmet...@gtcomm.net wrote: Hi, First of all, we are sure that the return to the default configuration fixed it. As soon as we restarted only one of the ceph nodes with the default configuration, it sped up recovery tremedously. We had already restarted before with the old conf and recovery was never that fast. Regarding the configuration, here's the old one with comments : [global] fsid = * mon_initial_members = cephmon1 mon_host = *** auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true // Let's you use xattributes of xfs/ext4/btrfs filesystems This actually did the opposite, but this option doesn't exist anymore osd_pool_default_pgp_num = 450 // default pgp number for new pools osd_pg_bits = 12 // number of bits used to designate pgps. Lets you have 2^12 pgps Could someone comment on those? What exactly does it do? What if I have more PGs than num_osds*osd_pg_bits? osd_pool_default_size = 3 // default copy number for new pools osd_pool_default_pg_num = 450// default pg number for new pools public_network = * cluster_network = *** osd_pgp_bits = 12 // number of bits used to designate pgps. Let's you have 2^12 pgps [osd] filestore_queue_max_ops = 5000// set to 500 by default Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. filestore_fd_cache_random = true// No docs, I don't see this in my ancient cluster :-) journal_queue_max_ops = 100 // set to 500 by default. Number of operations allowed in the journal queue filestore_omap_header_cache_size = 100 // Determines the size of the LRU used to cache object omap headers. Larger values use more memory but may reduce lookups on omap. filestore_fd_cache_size = 100 // You don't really need to set this so high, but not sure what the implications are if you go too high (it probably doesn't eat more memory until it opens so many files). If you have 4MB object on a 1TB drive than you really only need 250K to keep all files open. not in the ceph documentation. Seems to be a common tweak for SSD clusters though. max_open_files = 100 // lets ceph set the max file descriptor in the OS to prevent running out of file descriptors This is too low if you were really using all of the fd_cache. There are going to be thousands of tcp connection which need to be accounted for as well. (in my experience there can be hundreds to thousands tcp connection from just one RBD client and 200 OSDs, which is a lot). osd_journal_size = 1 // journal max size for each OSD New conf: [global] fsid = * mon_initial_members = cephmon1 mon_host = auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public_network = ** cluster_network = ** You might notice, I have a few undocumented settings in the old configuration. These are settings I took from a certain openstack summit presentation and they may have contributed to this whole problem. Here's a list of settings that I think might be a possible cause for these speed issues: filestore_fd_cache_random = true filestore_fd_cache_size = 100 Additionally, my colleague thinks these settings may have contributed : filestore_queue_max_ops = 5000 journal_queue_max_ops = 100 We will do further tests on these settings once we have our lab ceph test environment as we are also curious as to exactly what caused this. On 2015-08-20 11:43 AM, Alex Gorbachev wrote: Just to update the mailing list, we ended up going back to default ceph.conf without any additional settings than what is mandatory. We are now reaching speeds we never reached before, both in recovery and in regular usage. There was definitely something we set in the ceph.conf bogging everything down. Could you please share the old and new ceph.conf, or the section that was removed? Best regards, Alex On 2015-08-20 4:06 AM, Christian Balzer wrote: Hello, from all the pertinent points by Somnath, the one about pre-conditioning would be pretty high on my list, especially if this slowness persists and nothing else (scrub) is going on. This might be fixed by doing a fstrim. Additionally the levelDB's per OSD are of
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just sj...@redhat.com wrote: Odd, did you happen to capture osd logs? No, but the reproducer is trivial to cut paste. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Odd, did you happen to capture osd logs? -Sam On Thu, Aug 20, 2015 at 8:10 PM, Ilya Dryomov idryo...@gmail.com wrote: On Fri, Aug 21, 2015 at 2:02 AM, Samuel Just sj...@redhat.com wrote: What's supposed to happen is that the client transparently directs all requests to the cache pool rather than the cold pool when there is a cache pool. If the kernel is sending requests to the cold pool, that's probably where the bug is. Odd. It could also be a bug specific 'forward' mode either in the client or on the osd. Why did you have it in that mode? I think I reproduced this on today's master. Setup, cache mode is writeback: $ ./ceph osd pool create foo 12 12 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** pool 'foo' created $ ./ceph osd pool create foo-hot 12 12 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** pool 'foo-hot' created $ ./ceph osd tier add foo foo-hot *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** pool 'foo-hot' is now (or already was) a tier of 'foo' $ ./ceph osd tier cache-mode foo-hot writeback *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** set cache-mode for pool 'foo-hot' to writeback $ ./ceph osd tier set-overlay foo foo-hot *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** overlay for 'foo' is now (or already was) 'foo-hot' Create an image: $ ./rbd create --size 10M --image-format 2 foo/bar $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt $ sudo mkfs.ext4 /mnt/bar $ sudo umount /mnt Create a snapshot, take md5sum: $ ./rbd snap create foo/bar@snap $ ./rbd export foo/bar /tmp/foo-1 Exporting image: 100% complete...done. $ ./rbd export foo/bar@snap /tmp/snap-1 Exporting image: 100% complete...done. $ md5sum /tmp/foo-1 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-1 $ md5sum /tmp/snap-1 83f5d244bb65eb19eddce0dc94bf6dda /tmp/snap-1 Set the cache mode to forward and do a flush, hashes don't match - the snap is empty - we bang on the hot tier and don't get redirected to the cold tier, I suspect: $ ./ceph osd tier cache-mode foo-hot forward *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** set cache-mode for pool 'foo-hot' to forward $ ./rados -p foo-hot cache-flush-evict-all rbd_data.100a6b8b4567.0002 rbd_id.bar rbd_directory rbd_header.100a6b8b4567 bar.rbd rbd_data.100a6b8b4567.0001 rbd_data.100a6b8b4567. $ ./rados -p foo-hot cache-flush-evict-all $ ./rbd export foo/bar /tmp/foo-2 Exporting image: 100% complete...done. $ ./rbd export foo/bar@snap /tmp/snap-2 Exporting image: 100% complete...done. $ md5sum /tmp/foo-2 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-2 $ md5sum /tmp/snap-2 f1c9645dbc14efddc7d8a322685f26eb /tmp/snap-2 $ od /tmp/snap-2 000 00 00 00 00 00 00 00 00 * 5000 Disable the cache tier and we are back to normal: $ ./ceph osd tier remove-overlay foo *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** there is now (or already was) no overlay for 'foo' $ ./rbd export foo/bar /tmp/foo-3 Exporting image: 100% complete...done. $ ./rbd export foo/bar@snap /tmp/snap-3 Exporting image: 100% complete...done. $ md5sum /tmp/foo-3 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-3 $ md5sum /tmp/snap-3 83f5d244bb65eb19eddce0dc94bf6dda /tmp/snap-3 I first reproduced it with the kernel client, rbd export was just to take it out of the equation. Also, Igor sort of raised a question in his second message: if, after setting the cache mode to forward and doing a flush, I open an image (not a snapshot, so may not be related to the above) for write (e.g. with rbd-fuse), I get an rbd header object in the hot pool, even though it's in forward mode: $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt $ sudo mount /mnt/bar /media $ sudo umount /media $ sudo umount /mnt $ ./rados -p foo-hot ls rbd_header.100a6b8b4567 $ ./rados -p foo ls | grep rbd_header rbd_header.100a6b8b4567 It's been a while since I looked into tiering, is that how it's supposed to work? It looks like it happens because rbd_header op replies don't redirect? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw only delivers whats cached if latency between keyrequest and actual download is above 90s
We heavily use radosgw here for most of our work and we have seen a weird truncation issue with radosgw/s3 requests. We have noticed that if the time between the initial ticket to grab the object key and grabbing the data is greater than 90 seconds the object returned is truncated to whatever RGW has grabbed/cached after the initial connection and this seems to be around 512k. Here is some PoC. This will work on most objects I have tested mostly 1G to 5G keys in RGW:: #!/usr/bin/env python import os import sys import json import time import boto import boto.s3.connection if __name__ == '__main__': import argparse parser = argparse.ArgumentParser(description='Delayed download.') parser.add_argument('credentials', type=argparse.FileType('r'), help='Credentials file.') parser.add_argument('endpoint') parser.add_argument('bucket') parser.add_argument('key') args = parser.parse_args() credentials= json.load(args.credentials)[args.endpoint] conn = boto.connect_s3( aws_access_key_id = credentials.get('access_key'), aws_secret_access_key = credentials.get('secret_key'), host = credentials.get('host'), port = credentials.get('port'), is_secure = credentials.get('is_secure',False), calling_format= boto.s3.connection.OrdinaryCallingFormat(), ) key = conn.get_bucket(args.bucket).get_key(args.key) key.BufferSize = 1048576 key.open_read(headers={}) time.sleep(120) key.get_contents_to_file(sys.stdout) The format of the credentials file is just standard:: = = { cluster: { access_key: blahblahblah, secret_key: blahblahblah, host: blahblahblah, port: 443, is_secure: true } } = = From here your object will almost always be truncated to whatever the gateway has cached in the time after the initial key request. This can be a huge issue as if the radosgw or cluster is tasked some requests can be minutes long. You can end up grabbing the rest of the object by doing a range request against the gateway so I know the data is intact but I don't think the radosgw should be acting as if the download is completed successfully and I think it should instead return an error of some kind if it can no longer service the request. We are using hammer (ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)) and using civetweb as our gateway. This is on a 3 node test cluster but I have tried on our larger cluster with the same behavior. If I can provide any other information please let me know. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
I think I found the bug -- need to whiteout the snapset (or decache it) upon evict. http://tracker.ceph.com/issues/12748 -Sam On Fri, Aug 21, 2015 at 8:04 AM, Ilya Dryomov idryo...@gmail.com wrote: On Fri, Aug 21, 2015 at 5:59 PM, Samuel Just sj...@redhat.com wrote: Odd, did you happen to capture osd logs? No, but the reproducer is trivial to cut paste. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Object Storage and POSIX Mix
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Shouldn't this already be possible with HTTP Range requests? I don't work with RGW or S3 so please ignore me if I'm talking crazy. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, Aug 21, 2015 at 3:27 PM, Scottix wrote: I saw this article on Linux Today and immediately thought of Ceph. http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html I was thinking would it theoretically be possible with RGW to do a GET and set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file. The other option to append data to a RGW object instead of rewriting the entire object. And so on... Just food for thought. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v1.0.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV15/ICRDmVDuy+mK58QAAnkAP/3q804Y7xJDqadNxFjWd A1hzTcRfN6oqzZCf0T8stteTTG93Jt1R01ae2ZoVCM8EsefbovaPX68qy6kC sw4JN+G9h2Ow01X5nWD1mvQIPde0+kdTqK6jejTPr8tWQ/J1/98kkkqH4FGp TI3bOVBHik38RMt1G+yzVOS8E2lmckujzUsoQqA8kOyodsglQqAVj3kD8KAc me+BlcOvZhP2eV0Tg8FtAjaUp22bJbh/V+a2ycwoNKKS5YsiP3bQHbaI8FAK DYzndaS6UiwAhYjszmADRCqLXfmo8KkNYCr6xzr8oHSdPR33V87eFnkkaNmX pkGSuwblA19QT0PiVan8B5XRUd7HcdcjUPrbGtjmRsrF2QtzHD+Fda6qw48/ TljMye6rnMX6A87UuIVpIj33OZiJRdiFwjMXQuSWCMl7WIYXU75KZKR5rsss zX6NRIF3tSq0TBjcOFQN3+531XuCgsjwe3/zu2f1a/1JaGMAmMCO6vMdPhxU dgkk31Ou7BbIuOzZmfagnNvRSdNLu5AUXZLlu5D+BhrH28kxzW0fXtoqyqU5 tGk83pP+sr6sJaAk4nfzEQWLE8LHxtkS21CE5Aa0u1av9Sg0T5R84hYfPw+W skc67t2TVPHnphuLF2x2+xPArG3Ghuf2qD2Roz6zwkhpKQVprI8eiuu1lIfd Yl/b =w+bI -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD GHz vs. Cores Question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We are looking to purchase our next round of Ceph hardware and based off the work by Nick Fisk [1] our previous thought of cores over clock is being revisited. I have two camps of thoughts and would like to get some feedback, even if it is only theoretical. We currently have 12 disks per node (2 SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used for journals and cache tier (when [2] and fstrim are resolved). We also want to stay with a single processor for cost, power and NUMA considerations. 1. For 12 disks with three threads each (2 client and 1 background), lots of slower cores would allow I/O (ceph code) to be scheduled as soon as a core is available. 2. Faster cores would get through the Ceph code faster but there would be less cores and so some I/O may have to wait to be scheduled. I'm leaning towards #2 for these reasons, please expose anything I may be missing: * The latency will only really be improved in the SSD I/O with faster clock speed, all writes and any reads from the cache tier. So 8 fast cores might be sufficient, reading from spindle and flushing the journal will have a substantial amount of sleep to allow other Ceph I/O to be hyperthreaded. * Even though SSDs are much faster than spindles they are still orders of magnitude slower than the processor, so it is still possible to get more lines of code executed between SSD I/O with a faster processor even with less cores. * As the Ceph code is improved through optimization and less code has to be executed for each I/O, faster clock speeds will only provide even more benefit (lower latency, less waiting for cores) as the delay shifts more from CPU to disk. Since our workload is typically small I/O 12K-18K, latency means a lot to our performance. Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz [1] http://www.spinics.net/lists/ceph-users/msg19305.html [2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713 Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v1.0.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJV16pfCRDmVDuy+mK58QAA9cgP/RwsZESriIMWZHeC0PmS CH8iEFCXCRCzvW+lYMwB9FOvPmBLlhayp39Z93Djv3sef02t3Z9NFPq7fUmb ZwZ9SnH9oVmRElbQyNtt8MfJ2cqXRU6JtYsTHnZ5G0+sFvv+BY+mYD89nULw xwbsosUCBA9Rp8geq++XLSbuEBt8AfreYaSBzY1kg51Ovtmb97R0hB7bQBWP oUgi/ET24w4sUqLSo4WBNBZ0WeWsRA4w5PEzHk28ynBY0B/GAtiGadtZWOFX 6bNz3KjMbLEWU9UF+7WyL+ppru6RIUZeayFp3tdIzqQdMbeBDPO54miOezwv 9iFNuzxj2P6jqlp18W2SZYN2JF5qCgrG5mXlU2bOM9k4IlQAqG2V3iD/rSF8 LmL/FSzU6C4k8PffaNis/grZAtjN4tCLRAoWUmsXSRW1NpSNm13l6wJfg5xq XGLQ4CfGMV/o3a1Oz1M7jfMLWb0b6TeYlqC8eeHUp9ipa8IaVKsGNDJYQOnM LvyRuyB7yIM6dEXmJjE5ZQPwbh0se3+hUhNolQ949aKrY2u8Q2kHhKqOyzuw EAAyHkeqBtAZFW+DActHYVCi9lJO8shmeWuVKxAuzKYJGYzD8yVIS+AVqZ2k OH2/NNAXzBKefsL1gd8DT4QuYqDoEN2arO+PN0vZeEruQ4vg6qZvabqeB/4o kUd4 =F5Sx -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw only delivers whats cached if latency between keyrequest and actual download is above 90s
I just tried this (with some smaller objects, maybe 4.5 MB, as well as with a 16 GB file and it worked fine. However, i am using apache + fastcgi interface to rgw, rather than civetweb. -Ben On Fri, Aug 21, 2015 at 12:19 PM, Sean seapasu...@uchicago.edu wrote: We heavily use radosgw here for most of our work and we have seen a weird truncation issue with radosgw/s3 requests. We have noticed that if the time between the initial ticket to grab the object key and grabbing the data is greater than 90 seconds the object returned is truncated to whatever RGW has grabbed/cached after the initial connection and this seems to be around 512k. Here is some PoC. This will work on most objects I have tested mostly 1G to 5G keys in RGW:: #!/usr/bin/env python import os import sys import json import time import boto import boto.s3.connection if __name__ == '__main__': import argparse parser = argparse.ArgumentParser(description='Delayed download.') parser.add_argument('credentials', type=argparse.FileType('r'), help='Credentials file.') parser.add_argument('endpoint') parser.add_argument('bucket') parser.add_argument('key') args = parser.parse_args() credentials= json.load(args.credentials)[args.endpoint] conn = boto.connect_s3( aws_access_key_id = credentials.get('access_key'), aws_secret_access_key = credentials.get('secret_key'), host = credentials.get('host'), port = credentials.get('port'), is_secure = credentials.get('is_secure',False), calling_format= boto.s3.connection.OrdinaryCallingFormat(), ) key = conn.get_bucket(args.bucket).get_key(args.key) key.BufferSize = 1048576 key.open_read(headers={}) time.sleep(120) key.get_contents_to_file(sys.stdout) The format of the credentials file is just standard:: = = { cluster: { access_key: blahblahblah, secret_key: blahblahblah, host: blahblahblah, port: 443, is_secure: true } } = = From here your object will almost always be truncated to whatever the gateway has cached in the time after the initial key request. This can be a huge issue as if the radosgw or cluster is tasked some requests can be minutes long. You can end up grabbing the rest of the object by doing a range request against the gateway so I know the data is intact but I don't think the radosgw should be acting as if the download is completed successfully and I think it should instead return an error of some kind if it can no longer service the request. We are using hammer (ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)) and using civetweb as our gateway. This is on a 3 node test cluster but I have tried on our larger cluster with the same behavior. If I can provide any other information please let me know. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Object Storage and POSIX Mix
I saw this article on Linux Today and immediately thought of Ceph. http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html I was thinking would it theoretically be possible with RGW to do a GET and set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file. The other option to append data to a RGW object instead of rewriting the entire object. And so on... Just food for thought. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Object Storage and POSIX Mix
On Fri, Aug 21, 2015 at 10:27 PM, Scottix scot...@gmail.com wrote: I saw this article on Linux Today and immediately thought of Ceph. http://www.enterprisestorageforum.com/storage-management/object-storage-vs.-posix-storage-something-in-the-middle-please-1.html I was thinking would it theoretically be possible with RGW to do a GET and set a BEGIN_SEEK and OFFSET to only retrieve a specific portion of the file. The other option to append data to a RGW object instead of rewriting the entire object. And so on... Just food for thought. Raw RADOS (ie, librados users) get access significantly more powerful than what he's describing in that article. :) I don't know if anybody will ever punch more of that functionality through RGW or not. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Question about reliability model result
Hi, I have crosspost this issue here and in github, but no response yet. Any advice? On Mon, Aug 10, 2015 at 10:21 AM, dahan dahan...@gmail.com wrote: Hi all, I have tried the reliability model: https://github.com/ceph/ceph-tools/tree/master/models/reliability I run the tool with default configuration, and cannot understand the result. ``` storage durabilityPL(site) PL(copies) PL(NRE) PL(rep)loss/PiB ---- -- -- -- -- -- Disk: Enterprise 99.119% 0.000e+00 0.721457% 0.159744% 0.000e+00 8.812e+12 RADOS: 1 cp 99.279% 0.000e+00 0.721457% 0.000865% 0.000e+00 5.411e+12 RADOS: 2 cp 7-nines 0.000e+00 0.49% 0.003442% 0.000e+00 9.704e+06 RADOS: 3 cp 11-nines 0.000e+00 5.090e-11 3.541e-09 0.000e+00 6.655e+02 ``` ``` storage durabilityPL(site) PL(copies) PL(NRE) PL(rep)loss/PiB ---- -- -- -- -- -- Site (1 PB) 99.900% 0.099950% 0.000e+00 0.000e+00 0.000e+00 9.995e+11 RADOS: 1-site, 1-cp 99.179% 0.099950% 0.721457% 0.000865% 0.000e+00 1.010e+12 RADOS: 1-site, 2-cp 99.900% 0.099950% 0.49% 0.003442% 0.000e+00 9.995e+11 RADOS: 1-site, 3-cp 99.900% 0.099950% 5.090e-11 3.541e-09 0.000e+00 9.995e+11 ``` The two result tables have different trend. In the first table, durability value is 1 cp 2 cp 3 cp. However, the second table results in 1 cp 2 cp = 3 cp. The two tables have the same PL(site), PL(copies) , PL(NRE), and PL(rep). The only difference is PL(site). PL(site) is constant, since number of site is constant. The trend should be the same. How to explain the result? Anything I missed out? Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] НА: Question
Hi , I've do that before and when I try to write file into rbd. It's get freeze. Beside resource, is there any other reason not recommend to combined mon and osd? Best wishes, Mika 2015-08-18 15:52 GMT+08:00 Межов Игорь Александрович me...@yuterra.ru: Hi! You can run mons on the same hosts, though it is not recommemned. MON daemon itself are not resurce hungry - 1-2 cores and 2-4 Gb RAM is enough in most small installs. But there are some pitfalls: - MONs use LevelDB as a backstorage, and widely use direct write to ensure DB consistency. So, if MON daemon coexits with OSDs not only on the same host, but on the same volume/disk/controller - it will severily reduce disk io available to OSD, thus greatly reduce overall performance. Moving MONs root to separate spindle, or better - separate SSD will keep MONs running fine with OSDs at the same host. - When cluster is in healthy state, MONs are not resource consuming, but when cluster in changing state (adding/removing OSDs, backfiling, etc) the CPU and memory usage for MON can raise significantly. And yes, in small cluster, it is not alaways possible to get 3 separate hosts for MONs only. Megov Igor CIO, Yuterra -- *От:* ceph-users ceph-users-boun...@lists.ceph.com от имени Luis Periquito periqu...@gmail.com *Отправлено:* 17 августа 2015 г. 17:09 *Кому:* Kris Vaes *Копия:* ceph-users@lists.ceph.com *Тема:* Re: [ceph-users] Question yes. The issue is resource sharing as usual: the MONs will use disk I/O, memory and CPU. If the cluster is small (test?) then there's no problem in using the same disks. If the cluster starts to get bigger you may want to dedicate resources (e.g. the disk for the MONs isn't used by an OSD). If the cluster is big enough you may want to dedicate a node for being a MON. On Mon, Aug 17, 2015 at 2:56 PM, Kris Vaes k...@s3s.eu wrote: Hi, Maybe this seems like a strange question but i could not find this info in the docs , i have following question, For the ceph cluster you need osd daemons and monitor daemons, On a host you can run several osd daemons (best one per drive as read in the docs) on one host But now my question can you run on the same host where you run already some osd daemons the monitor daemon Is this possible and what are the implications of doing this Met Vriendelijke Groeten Cordialement Kind Regards Cordialmente С приятелски поздрави This message (including any attachments) may be privileged or confidential. If you have received it by mistake, please notify the sender by return e-mail and delete this message from your system. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited. S3S rejects any liability for the improper, incomplete or delayed transmission of the information contained in this message, as well as for damages resulting from this e-mail message. S3S cannot guarantee that the message received by you has not been intercepted by third parties and/or manipulated by computer programs used to transmit messages and viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Broken snapshots... CEPH 0.94.2
Exact as in our case. Ilya, same for images from our side. Headers opened from hot tier пятница, 21 августа 2015 г. пользователь Ilya Dryomov написал: On Fri, Aug 21, 2015 at 2:02 AM, Samuel Just sj...@redhat.com javascript:; wrote: What's supposed to happen is that the client transparently directs all requests to the cache pool rather than the cold pool when there is a cache pool. If the kernel is sending requests to the cold pool, that's probably where the bug is. Odd. It could also be a bug specific 'forward' mode either in the client or on the osd. Why did you have it in that mode? I think I reproduced this on today's master. Setup, cache mode is writeback: $ ./ceph osd pool create foo 12 12 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** pool 'foo' created $ ./ceph osd pool create foo-hot 12 12 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** pool 'foo-hot' created $ ./ceph osd tier add foo foo-hot *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** pool 'foo-hot' is now (or already was) a tier of 'foo' $ ./ceph osd tier cache-mode foo-hot writeback *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** set cache-mode for pool 'foo-hot' to writeback $ ./ceph osd tier set-overlay foo foo-hot *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** overlay for 'foo' is now (or already was) 'foo-hot' Create an image: $ ./rbd create --size 10M --image-format 2 foo/bar $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt $ sudo mkfs.ext4 /mnt/bar $ sudo umount /mnt Create a snapshot, take md5sum: $ ./rbd snap create foo/bar@snap $ ./rbd export foo/bar /tmp/foo-1 Exporting image: 100% complete...done. $ ./rbd export foo/bar@snap /tmp/snap-1 Exporting image: 100% complete...done. $ md5sum /tmp/foo-1 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-1 $ md5sum /tmp/snap-1 83f5d244bb65eb19eddce0dc94bf6dda /tmp/snap-1 Set the cache mode to forward and do a flush, hashes don't match - the snap is empty - we bang on the hot tier and don't get redirected to the cold tier, I suspect: $ ./ceph osd tier cache-mode foo-hot forward *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** set cache-mode for pool 'foo-hot' to forward $ ./rados -p foo-hot cache-flush-evict-all rbd_data.100a6b8b4567.0002 rbd_id.bar rbd_directory rbd_header.100a6b8b4567 bar.rbd rbd_data.100a6b8b4567.0001 rbd_data.100a6b8b4567. $ ./rados -p foo-hot cache-flush-evict-all $ ./rbd export foo/bar /tmp/foo-2 Exporting image: 100% complete...done. $ ./rbd export foo/bar@snap /tmp/snap-2 Exporting image: 100% complete...done. $ md5sum /tmp/foo-2 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-2 $ md5sum /tmp/snap-2 f1c9645dbc14efddc7d8a322685f26eb /tmp/snap-2 $ od /tmp/snap-2 000 00 00 00 00 00 00 00 00 * 5000 Disable the cache tier and we are back to normal: $ ./ceph osd tier remove-overlay foo *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** there is now (or already was) no overlay for 'foo' $ ./rbd export foo/bar /tmp/foo-3 Exporting image: 100% complete...done. $ ./rbd export foo/bar@snap /tmp/snap-3 Exporting image: 100% complete...done. $ md5sum /tmp/foo-3 83f5d244bb65eb19eddce0dc94bf6dda /tmp/foo-3 $ md5sum /tmp/snap-3 83f5d244bb65eb19eddce0dc94bf6dda /tmp/snap-3 I first reproduced it with the kernel client, rbd export was just to take it out of the equation. Also, Igor sort of raised a question in his second message: if, after setting the cache mode to forward and doing a flush, I open an image (not a snapshot, so may not be related to the above) for write (e.g. with rbd-fuse), I get an rbd header object in the hot pool, even though it's in forward mode: $ sudo ./rbd-fuse -p foo -c $PWD/ceph.conf /mnt $ sudo mount /mnt/bar /media $ sudo umount /media $ sudo umount /mnt $ ./rados -p foo-hot ls rbd_header.100a6b8b4567 $ ./rados -p foo ls | grep rbd_header rbd_header.100a6b8b4567 It's been a while since I looked into tiering, is that how it's supposed to work? It looks like it happens because rbd_header op replies don't redirect? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw hanging - blocking rgw.bucket_list ops
Hi, We are running hammer 0.94.2 and have an increasing amount of heartbeat_map is_healthy 'RGWProcess::m_tp thread 0x7f38c77e6700' had timed out after 600 messages in our radosgw logs, with radosgw eventually stalling. A restart of the radosgw helps for a few minutes, but after that it hangs again. ceph daemon /var/run/ceph/ceph-client.*.asok objecter_requests shows call rgw.bucket_list ops. No new bucket lists are requested, so those ops seem to stay there. Anyone any idea how to get rid of those. Restart of the affecting osd didn't help neither. I'm not sure if its related, but we have an object called _sanity in the bucket where the listing was performed on. I know there is some bug with objects starting with _. Any help would be much appreciated. r, Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com