Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Sat, Aug 13, 2016 at 4:51 PM, Alex Gorbachevwrote: > On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev > wrote: >> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov wrote: >>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev >>> wrote: > I'm confused. How can a 4M discard not free anything? It's either > going to hit an entire object or two adjacent objects, truncating the > tail of one and zeroing the head of another. Using rbd diff: > > $ rbd diff test | grep -A 1 25165824 > 25165824 4194304 data > 29360128 4194304 data > > # a 4M discard at 1M into a RADOS object > $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 > > $ rbd diff test | grep -A 1 25165824 > 25165824 1048576 data > 29360128 4194304 data I have tested this on a small RBD device with such offsets and indeed, the discard works as you describe, Ilya. Looking more into why ESXi's discard is not working. I found this message in kern.log on Ubuntu on creation of the SCST LUN, which shows unmap_alignment 0: Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin provisioning for device /dev/rbd/spin1/unmap1t Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI target virtual disk p_iSCSILun_sclun945 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, nblocks=838860800, cyln=409600) Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, initiator copy_manager_sess) even though: root@e1:/sys/block/rbd29# cat discard_alignment 4194304 So somehow the discard_alignment is not making it into the LUN. Could this be the issue? >>> >>> No, if you are not seeing *any* effect, the alignment is pretty much >>> irrelevant. Can you do the following on a small test image? >>> >>> - capture "rbd diff" output >>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace >>> - issue a few discards with blkdiscard >>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled >>> - capture "rbd diff" output again >>> >>> and attach all of the above? (You might need to install a blktrace >>> package.) >>> >> >> Latest results from VMWare validation tests: >> >> Each test creates and deletes a virtual disk, then calls ESXi unmap >> for what ESXi maps to that volume: >> >> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829 >> >> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837 >> >> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824 >> >> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837 >> >> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837 >> >> At the end, the compounded used size via rbd diff is 608 GB from 775GB >> of data. So we release only about 20% via discards in the end. > > Ilya has analyzed the discard pattern, and indeed the problem is that > ESXi appears to disregard the discard alignment attribute. Therefore, > discards are shifted by 1M, and are not hitting the tail of objects. > > Discards work much better on the EagerZeroedThick volumes, likely due > to contiguous data. > > I will proceed with the rest of testing, and will post any tips or > best practice results as they become available. > > Thank you for everyone's help and advice! Testing completed - the discards definitely follow the alignment pattern: - 4MB objects and VMFS5 - only some discards due to 1MB discard not often hitting the tail of object - 1MB objects - practically 100% space reclaim I have not tried shifting the VMFS5 filesystem, as the test will not work with that. Also not sure how to properly incorporate into VMWare routine operation. So, as a best practice: If you want efficient ESXi space reclaim with RBD and VMFS5, use 1 MB object size in Ceph Best regards, -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomovwrote: > On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev > wrote: >>> I'm confused. How can a 4M discard not free anything? It's either >>> going to hit an entire object or two adjacent objects, truncating the >>> tail of one and zeroing the head of another. Using rbd diff: >>> >>> $ rbd diff test | grep -A 1 25165824 >>> 25165824 4194304 data >>> 29360128 4194304 data >>> >>> # a 4M discard at 1M into a RADOS object >>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 >>> >>> $ rbd diff test | grep -A 1 25165824 >>> 25165824 1048576 data >>> 29360128 4194304 data >> >> I have tested this on a small RBD device with such offsets and indeed, >> the discard works as you describe, Ilya. >> >> Looking more into why ESXi's discard is not working. I found this >> message in kern.log on Ubuntu on creation of the SCST LUN, which shows >> unmap_alignment 0: >> >> Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) >> Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin >> provisioning for device /dev/rbd/spin1/unmap1t >> Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, >> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 >> Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI >> target virtual disk p_iSCSILun_sclun945 >> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, >> nblocks=838860800, cyln=409600) >> Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: >> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 >> Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: >> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, >> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, >> initiator copy_manager_sess) >> >> even though: >> >> root@e1:/sys/block/rbd29# cat discard_alignment >> 4194304 >> >> So somehow the discard_alignment is not making it into the LUN. Could >> this be the issue? > > No, if you are not seeing *any* effect, the alignment is pretty much > irrelevant. Can you do the following on a small test image? > > - capture "rbd diff" output > - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace > - issue a few discards with blkdiscard > - issue a few unmaps with ESXi, preferrably with SCST debugging enabled > - capture "rbd diff" output again > > and attach all of the above? (You might need to install a blktrace > package.) > Latest results from VMWare validation tests: Each test creates and deletes a virtual disk, then calls ESXi unmap for what ESXi maps to that volume: Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829 Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837 Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824 Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837 Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837 At the end, the compounded used size via rbd diff is 608 GB from 775GB of data. So we release only about 20% via discards in the end. Thank you, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachevwrote: >> I'm confused. How can a 4M discard not free anything? It's either >> going to hit an entire object or two adjacent objects, truncating the >> tail of one and zeroing the head of another. Using rbd diff: >> >> $ rbd diff test | grep -A 1 25165824 >> 25165824 4194304 data >> 29360128 4194304 data >> >> # a 4M discard at 1M into a RADOS object >> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 >> >> $ rbd diff test | grep -A 1 25165824 >> 25165824 1048576 data >> 29360128 4194304 data > > I have tested this on a small RBD device with such offsets and indeed, > the discard works as you describe, Ilya. > > Looking more into why ESXi's discard is not working. I found this > message in kern.log on Ubuntu on creation of the SCST LUN, which shows > unmap_alignment 0: > > Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) > Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin > provisioning for device /dev/rbd/spin1/unmap1t > Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, > unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 > Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI > target virtual disk p_iSCSILun_sclun945 > (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, > nblocks=838860800, cyln=409600) > Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: > scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 > Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: > scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, > d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, > initiator copy_manager_sess) > > even though: > > root@e1:/sys/block/rbd29# cat discard_alignment > 4194304 > > So somehow the discard_alignment is not making it into the LUN. Could > this be the issue? No, if you are not seeing *any* effect, the alignment is pretty much irrelevant. Can you do the following on a small test image? - capture "rbd diff" output - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace - issue a few discards with blkdiscard - issue a few unmaps with ESXi, preferrably with SCST debugging enabled - capture "rbd diff" output again and attach all of the above? (You might need to install a blktrace package.) Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Friday, August 5, 2016, matthew pattonwrote: > > - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free > anything > > the proper solution here is to: > * quit worrying about it and buy sufficient disk in the first place, it's > not exactly expensive > I would do that for one or a couple environments, or if I sold drives :). However, the two use cases that are of importance to my group still warrant figuring this out: - Large medical image collections or frequently modified database files (quite a few deletes and creates) - Passing VMWare certification. It means a lot to people without deep dtorage knowledge, to make a decision on adopting a technology > * ask VMware to have the decency to add a flag to vmkfstools to specify > the offset > Preaching to the choir! I will ask. Hope someone will listen. > > * create a small dummy VMFS on the block device that allows you to create > a second filesystem behind it that's aligned on a 4MB boundary. Or perhaps > simpler, create a thick-zeroed VMDK (3+minimum size + extra) on the VMFS > such that the next VMDK created falls on the desired boundary. > I wonder how to do this for the test, or use a small partition like Vlad described. I will try that with one of their unmap tests > > * use NFS like *deity* intended like any other sane person, nobody uses > block storage anymore for precisely these kinds of reasons. > > Working in that direction too. A bit concerned of the double writes of the backing filesystem, then double writes for RADOS. Per Nick, this still works better than block. But having gone through 95% of certification for block, I feel like I should finish it before jumping on to the next thing. Thank you for your input, it is very practical and helpful long term. Alex > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
> I'm confused. How can a 4M discard not free anything? It's either > going to hit an entire object or two adjacent objects, truncating the > tail of one and zeroing the head of another. Using rbd diff: > > $ rbd diff test | grep -A 1 25165824 > 25165824 4194304 data > 29360128 4194304 data > > # a 4M discard at 1M into a RADOS object > $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 > > $ rbd diff test | grep -A 1 25165824 > 25165824 1048576 data > 29360128 4194304 data I have tested this on a small RBD device with such offsets and indeed, the discard works as you describe, Ilya. Looking more into why ESXi's discard is not working. I found this message in kern.log on Ubuntu on creation of the SCST LUN, which shows unmap_alignment 0: Aug 6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945) Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin provisioning for device /dev/rbd/spin1/unmap1t Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI target virtual disk p_iSCSILun_sclun945 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, nblocks=838860800, cyln=409600) Aug 6 22:02:33 e1 kernel: [300378.136847] [4682]: scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32 Aug 6 22:02:33 e1 kernel: [300378.136853] [4682]: scst: scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0, d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945, initiator copy_manager_sess) even though: root@e1:/sys/block/rbd29# cat discard_alignment 4194304 So somehow the discard_alignment is not making it into the LUN. Could this be the issue? Thanks, Alex Aug 6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin provisioning for device /dev/rbd/spin1/unmap1t Aug 6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192, unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1 Aug 6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI target virtual disk p_iSCSILun_sclun945 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512, nblocks=838860800, cyln=409600) > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Sat, Aug 6, 2016 at 1:10 AM, Alex Gorbachevwrote: > Is there a way to perhaps increase the discard granularity? The way I see > it based on the discussion so far, here is why discard/unmap is failing to > work with VMWare: > > - RBD provides space in 4MB blocks, which must be discarded entirely, or at > least hitting the tail. > > - SCST communicates to ESXi that discard alignment is 4MB and discard > granularity is also 4MB > > - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free > anything > > What is it were possible to make a 6MB discard granularity? I'm confused. How can a 4M discard not free anything? It's either going to hit an entire object or two adjacent objects, truncating the tail of one and zeroing the head of another. Using rbd diff: $ rbd diff test | grep -A 1 25165824 25165824 4194304 data 29360128 4194304 data # a 4M discard at 1M into a RADOS object $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0 $ rbd diff test | grep -A 1 25165824 25165824 1048576 data 29360128 4194304 data Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Wed, Aug 3, 2016 at 10:54 AM, Alex Gorbachevwrote: > On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev > wrote: >> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin wrote: >>> Alex Gorbachev wrote on 08/02/2016 07:56 AM: On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: > On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev > wrote: >> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin >> wrote: >>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: Hi Ilya, On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: > On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev > wrote: >> RBD illustration showing RBD ignoring discard until a certain >> threshold - why is that? This behavior is unfortunately incompatible >> with ESXi discard (UNMAP) behavior. >> >> Is there a way to lower the discard sensitivity on RBD devices? >> >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 782336 KB > > Think about it in terms of underlying RADOS objects (4M by default). > There are three cases: > > discard range | command > - > whole object| delete > object's tail | truncate > object's head | zero > > Obviously, only delete and truncate free up space. In all of your > examples, except the last one, you are attempting to discard the head > of the (first) object. > > You can free up as little as a sector, as long as it's the tail: > > OffsetLength Type > 0 4194304 data > > # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > > OffsetLength Type > 0 4193792 data Looks like ESXi is sending in each discard/unmap with the fixed granularity of 8192 sectors, which is passed verbatim by SCST. There is a slight reduction in size via rbd diff method, but now I understand that actual truncate only takes effect when the discard happens to clip the tail of an image. So far looking at https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 ...the only variable we can control is the count of 8192-sector chunks and not their size. Which means that most of the ESXi discard commands will be disregarded by Ceph. Vlad, is 8192 sectors coming from ESXi, as in the debug: Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector 1342099456, nr_sects 8192) >>> >>> Yes, correct. However, to make sure that VMware is not (erroneously) >>> enforced to do this, you need to perform one more check. >>> >>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here >>> correct granularity and alignment (4M, I guess?) >> >> This seems to reflect the granularity (4194304), which matches the >> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >> value. >> >> Can discard_alignment be specified with RBD? > > It's exported as a read-only sysfs attribute, just like > discard_granularity: > > # cat /sys/block/rbd0/discard_alignment > 4194304 Ah thanks Ilya, it is indeed there. Vlad, your email says to look for discard_alignment in /sys/block//queue, but for RBD it's in /sys/block/ - could this be the source of the issue? >>> >>> No. As you can see below, the alignment reported correctly. So, this must >>> be VMware >>> issue, because it is ignoring the alignment parameter. You can try to align >>> your VMware >>> partition on 4M boundary, it might help. >> >> Is this not a mismatch: >> >> - From sg_inq: Unmap granularity alignment: 8192 >> >> - From "cat /sys/block/rbd0/discard_alignment": 4194304 >> >> I am compiling the latest SCST trunk now. > > Scratch that, please, I just did a test that shows correct calculation > of 4MB in sectors. > > - On iSCSI client node: > > dd if=/dev/urandom of=/dev/sdf bs=1M count=800 > blkdiscard -o 0 -l 4194304 /dev/sdf > > - On iSCSI server node: > > Aug 3 10:50:57 e1 kernel: [ 893.444538] [1381]: > vdisk_unmap_range:3832:Discarding (start_sector
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachevwrote: > On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin wrote: >> Alex Gorbachev wrote on 08/02/2016 07:56 AM: >>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev wrote: > On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin > wrote: >> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>> Hi Ilya, >>> >>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev wrote: > RBD illustration showing RBD ignoring discard until a certain > threshold - why is that? This behavior is unfortunately incompatible > with ESXi discard (UNMAP) behavior. > > Is there a way to lower the discard sensitivity on RBD devices? > >>> > > root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 782336 KB Think about it in terms of underlying RADOS objects (4M by default). There are three cases: discard range | command - whole object| delete object's tail | truncate object's head | zero Obviously, only delete and truncate free up space. In all of your examples, except the last one, you are attempting to discard the head of the (first) object. You can free up as little as a sector, as long as it's the tail: OffsetLength Type 0 4194304 data # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 OffsetLength Type 0 4193792 data >>> >>> Looks like ESXi is sending in each discard/unmap with the fixed >>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>> is a slight reduction in size via rbd diff method, but now I >>> understand that actual truncate only takes effect when the discard >>> happens to clip the tail of an image. >>> >>> So far looking at >>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 >>> >>> ...the only variable we can control is the count of 8192-sector chunks >>> and not their size. Which means that most of the ESXi discard >>> commands will be disregarded by Ceph. >>> >>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>> >>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>> 1342099456, nr_sects 8192) >> >> Yes, correct. However, to make sure that VMware is not (erroneously) >> enforced to do this, you need to perform one more check. >> >> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here >> correct granularity and alignment (4M, I guess?) > > This seems to reflect the granularity (4194304), which matches the > 8192 pages (8192 x 512 = 4194304). However, there is no alignment > value. > > Can discard_alignment be specified with RBD? It's exported as a read-only sysfs attribute, just like discard_granularity: # cat /sys/block/rbd0/discard_alignment 4194304 >>> >>> Ah thanks Ilya, it is indeed there. Vlad, your email says to look for >>> discard_alignment in /sys/block//queue, but for RBD it's in >>> /sys/block/ - could this be the source of the issue? >> >> No. As you can see below, the alignment reported correctly. So, this must be >> VMware >> issue, because it is ignoring the alignment parameter. You can try to align >> your VMware >> partition on 4M boundary, it might help. > > Is this not a mismatch: > > - From sg_inq: Unmap granularity alignment: 8192 > > - From "cat /sys/block/rbd0/discard_alignment": 4194304 > > I am compiling the latest SCST trunk now. Scratch that, please, I just did a test that shows correct calculation of 4MB in sectors. - On iSCSI client node: dd if=/dev/urandom of=/dev/sdf bs=1M count=800 blkdiscard -o 0 -l 4194304 /dev/sdf - On iSCSI server node: Aug 3 10:50:57 e1 kernel: [ 893.444538] [1381]: vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192) (8192 * 512 = 4194304) Now proceeding to test discard again with the latest SCST trunk build. > > Thanks, > Alex > >> >>> Here is what I get querying the iscsi-exported RBD device on Linux: >>> >>>
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitinwrote: > Alex Gorbachev wrote on 08/02/2016 07:56 AM: >> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov wrote: >>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev >>> wrote: On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin wrote: > Alex Gorbachev wrote on 08/01/2016 04:05 PM: >> Hi Ilya, >> >> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>> wrote: RBD illustration showing RBD ignoring discard until a certain threshold - why is that? This behavior is unfortunately incompatible with ESXi discard (UNMAP) behavior. Is there a way to lower the discard sensitivity on RBD devices? >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 782336 KB >>> >>> Think about it in terms of underlying RADOS objects (4M by default). >>> There are three cases: >>> >>> discard range | command >>> - >>> whole object| delete >>> object's tail | truncate >>> object's head | zero >>> >>> Obviously, only delete and truncate free up space. In all of your >>> examples, except the last one, you are attempting to discard the head >>> of the (first) object. >>> >>> You can free up as little as a sector, as long as it's the tail: >>> >>> OffsetLength Type >>> 0 4194304 data >>> >>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>> >>> OffsetLength Type >>> 0 4193792 data >> >> Looks like ESXi is sending in each discard/unmap with the fixed >> granularity of 8192 sectors, which is passed verbatim by SCST. There >> is a slight reduction in size via rbd diff method, but now I >> understand that actual truncate only takes effect when the discard >> happens to clip the tail of an image. >> >> So far looking at >> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 >> >> ...the only variable we can control is the count of 8192-sector chunks >> and not their size. Which means that most of the ESXi discard >> commands will be disregarded by Ceph. >> >> Vlad, is 8192 sectors coming from ESXi, as in the debug: >> >> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >> 1342099456, nr_sects 8192) > > Yes, correct. However, to make sure that VMware is not (erroneously) > enforced to do this, you need to perform one more check. > > 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here > correct granularity and alignment (4M, I guess?) This seems to reflect the granularity (4194304), which matches the 8192 pages (8192 x 512 = 4194304). However, there is no alignment value. Can discard_alignment be specified with RBD? >>> >>> It's exported as a read-only sysfs attribute, just like >>> discard_granularity: >>> >>> # cat /sys/block/rbd0/discard_alignment >>> 4194304 >> >> Ah thanks Ilya, it is indeed there. Vlad, your email says to look for >> discard_alignment in /sys/block//queue, but for RBD it's in >> /sys/block/ - could this be the source of the issue? > > No. As you can see below, the alignment reported correctly. So, this must be > VMware > issue, because it is ignoring the alignment parameter. You can try to align > your VMware > partition on 4M boundary, it might help. Is this not a mismatch: - From sg_inq: Unmap granularity alignment: 8192 - From "cat /sys/block/rbd0/discard_alignment": 4194304 I am compiling the latest SCST trunk now. Thanks, Alex > >> Here is what I get querying the iscsi-exported RBD device on Linux: >> >> root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf >> VPD INQUIRY: Block limits page (SBC) >> Maximum compare and write length: 255 blocks >> Optimal transfer length granularity: 8 blocks >> Maximum transfer length: 16384 blocks >> Optimal transfer length: 1024 blocks >> Maximum prefetch, xdread, xdwrite transfer length: 0 blocks >> Maximum unmap LBA count: 8192 >> Maximum unmap block descriptor count: 4294967295 >> Optimal unmap granularity: 8192 >> Unmap granularity alignment valid: 1 >> Unmap granularity alignment: 8192 > ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On 08/02/2016 07:26 PM, Ilya Dryomov wrote: This seems to reflect the granularity (4194304), which matches the >8192 pages (8192 x 512 = 4194304). However, there is no alignment >value. > >Can discard_alignment be specified with RBD? It's exported as a read-only sysfs attribute, just like discard_granularity: # cat /sys/block/rbd0/discard_alignment 4194304 Note that this is the standard way Linux export alignment for storage discard for *any* kind of storage so worth using :) Ric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomovwrote: > On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev > wrote: >> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin wrote: >>> Alex Gorbachev wrote on 08/01/2016 04:05 PM: Hi Ilya, On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: > On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev > wrote: >> RBD illustration showing RBD ignoring discard until a certain >> threshold - why is that? This behavior is unfortunately incompatible >> with ESXi discard (UNMAP) behavior. >> >> Is there a way to lower the discard sensitivity on RBD devices? >> >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 782336 KB > > Think about it in terms of underlying RADOS objects (4M by default). > There are three cases: > > discard range | command > - > whole object| delete > object's tail | truncate > object's head | zero > > Obviously, only delete and truncate free up space. In all of your > examples, except the last one, you are attempting to discard the head > of the (first) object. > > You can free up as little as a sector, as long as it's the tail: > > OffsetLength Type > 0 4194304 data > > # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > > OffsetLength Type > 0 4193792 data Looks like ESXi is sending in each discard/unmap with the fixed granularity of 8192 sectors, which is passed verbatim by SCST. There is a slight reduction in size via rbd diff method, but now I understand that actual truncate only takes effect when the discard happens to clip the tail of an image. So far looking at https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 ...the only variable we can control is the count of 8192-sector chunks and not their size. Which means that most of the ESXi discard commands will be disregarded by Ceph. Vlad, is 8192 sectors coming from ESXi, as in the debug: Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector 1342099456, nr_sects 8192) >>> >>> Yes, correct. However, to make sure that VMware is not (erroneously) >>> enforced to do this, you need to perform one more check. >>> >>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct >>> granularity and alignment (4M, I guess?) >> >> This seems to reflect the granularity (4194304), which matches the >> 8192 pages (8192 x 512 = 4194304). However, there is no alignment >> value. >> >> Can discard_alignment be specified with RBD? > > It's exported as a read-only sysfs attribute, just like > discard_granularity: > > # cat /sys/block/rbd0/discard_alignment > 4194304 Ah thanks Ilya, it is indeed there. Vlad, your email says to look for discard_alignment in /sys/block//queue, but for RBD it's in /sys/block/ - could this be the source of the issue? Here is what I get querying the iscsi-exported RBD device on Linux: root@kio1:/sys/block/sdf# sg_inq -p 0xB0 /dev/sdf VPD INQUIRY: Block limits page (SBC) Maximum compare and write length: 255 blocks Optimal transfer length granularity: 8 blocks Maximum transfer length: 16384 blocks Optimal transfer length: 1024 blocks Maximum prefetch, xdread, xdwrite transfer length: 0 blocks Maximum unmap LBA count: 8192 Maximum unmap block descriptor count: 4294967295 Optimal unmap granularity: 8192 Unmap granularity alignment valid: 1 Unmap granularity alignment: 8192 > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachevwrote: > On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin wrote: >> Alex Gorbachev wrote on 08/01/2016 04:05 PM: >>> Hi Ilya, >>> >>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev wrote: > RBD illustration showing RBD ignoring discard until a certain > threshold - why is that? This behavior is unfortunately incompatible > with ESXi discard (UNMAP) behavior. > > Is there a way to lower the discard sensitivity on RBD devices? > >>> > > root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 782336 KB Think about it in terms of underlying RADOS objects (4M by default). There are three cases: discard range | command - whole object| delete object's tail | truncate object's head | zero Obviously, only delete and truncate free up space. In all of your examples, except the last one, you are attempting to discard the head of the (first) object. You can free up as little as a sector, as long as it's the tail: OffsetLength Type 0 4194304 data # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 OffsetLength Type 0 4193792 data >>> >>> Looks like ESXi is sending in each discard/unmap with the fixed >>> granularity of 8192 sectors, which is passed verbatim by SCST. There >>> is a slight reduction in size via rbd diff method, but now I >>> understand that actual truncate only takes effect when the discard >>> happens to clip the tail of an image. >>> >>> So far looking at >>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 >>> >>> ...the only variable we can control is the count of 8192-sector chunks >>> and not their size. Which means that most of the ESXi discard >>> commands will be disregarded by Ceph. >>> >>> Vlad, is 8192 sectors coming from ESXi, as in the debug: >>> >>> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >>> 1342099456, nr_sects 8192) >> >> Yes, correct. However, to make sure that VMware is not (erroneously) >> enforced to do this, you need to perform one more check. >> >> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct >> granularity and alignment (4M, I guess?) > > This seems to reflect the granularity (4194304), which matches the > 8192 pages (8192 x 512 = 4194304). However, there is no alignment > value. > > Can discard_alignment be specified with RBD? It's exported as a read-only sysfs attribute, just like discard_granularity: # cat /sys/block/rbd0/discard_alignment 4194304 Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitinwrote: > Alex Gorbachev wrote on 08/01/2016 04:05 PM: >> Hi Ilya, >> >> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >>> wrote: RBD illustration showing RBD ignoring discard until a certain threshold - why is that? This behavior is unfortunately incompatible with ESXi discard (UNMAP) behavior. Is there a way to lower the discard sensitivity on RBD devices? >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 782336 KB >>> >>> Think about it in terms of underlying RADOS objects (4M by default). >>> There are three cases: >>> >>> discard range | command >>> - >>> whole object| delete >>> object's tail | truncate >>> object's head | zero >>> >>> Obviously, only delete and truncate free up space. In all of your >>> examples, except the last one, you are attempting to discard the head >>> of the (first) object. >>> >>> You can free up as little as a sector, as long as it's the tail: >>> >>> OffsetLength Type >>> 0 4194304 data >>> >>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >>> >>> OffsetLength Type >>> 0 4193792 data >> >> Looks like ESXi is sending in each discard/unmap with the fixed >> granularity of 8192 sectors, which is passed verbatim by SCST. There >> is a slight reduction in size via rbd diff method, but now I >> understand that actual truncate only takes effect when the discard >> happens to clip the tail of an image. >> >> So far looking at >> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 >> >> ...the only variable we can control is the count of 8192-sector chunks >> and not their size. Which means that most of the ESXi discard >> commands will be disregarded by Ceph. >> >> Vlad, is 8192 sectors coming from ESXi, as in the debug: >> >> Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector >> 1342099456, nr_sects 8192) > > Yes, correct. However, to make sure that VMware is not (erroneously) enforced > to do this, you need to perform one more check. > > 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct > granularity and alignment (4M, I guess?) This seems to reflect the granularity (4194304), which matches the 8192 pages (8192 x 512 = 4194304). However, there is no alignment value. Can discard_alignment be specified with RBD? > > 2. Connect to the this iSCSI device from a Linux box and run sg_inq -p 0xB0 > /dev/ > > SCST should correctly report those values for unmap parameters (in blocks). > > If in both cases you see correct the same values, then this is VMware issue, > because it is ignoring what it is told to do (generate appropriately sized > and aligned UNMAP requests). If either Ceph, or SCST doesn't show correct > numbers, then the broken party should be fixed. > > Vlad > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Tue, Aug 2, 2016 at 1:05 AM, Alex Gorbachevwrote: > Hi Ilya, > > On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov wrote: >> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev >> wrote: >>> RBD illustration showing RBD ignoring discard until a certain >>> threshold - why is that? This behavior is unfortunately incompatible >>> with ESXi discard (UNMAP) behavior. >>> >>> Is there a way to lower the discard sensitivity on RBD devices? >>> > >>> >>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>> print SUM/1024 " KB" }' >>> 819200 KB >>> >>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >>> print SUM/1024 " KB" }' >>> 782336 KB >> >> Think about it in terms of underlying RADOS objects (4M by default). >> There are three cases: >> >> discard range | command >> - >> whole object| delete >> object's tail | truncate >> object's head | zero >> >> Obviously, only delete and truncate free up space. In all of your >> examples, except the last one, you are attempting to discard the head >> of the (first) object. >> >> You can free up as little as a sector, as long as it's the tail: >> >> OffsetLength Type >> 0 4194304 data >> >> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 >> >> OffsetLength Type >> 0 4193792 data > > Looks like ESXi is sending in each discard/unmap with the fixed > granularity of 8192 sectors, which is passed verbatim by SCST. There > is a slight reduction in size via rbd diff method, but now I > understand that actual truncate only takes effect when the discard > happens to clip the tail of an image. ... the tail of the *object*. And again, with "filestore punch hole = true", page-sized discards anywhere within the image would free up space, but "rbd diff" won't reflect that. > > So far looking at > https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 > > ...the only variable we can control is the count of 8192-sector chunks > and not their size. Which means that most of the ESXi discard > commands will be disregarded by Ceph. > > Vlad, is 8192 sectors coming from ESXi, as in the debug: > > Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector > 1342099456, nr_sects 8192) They won't be disregarded, but it would definitely work better if they were aligned. 1342099456 isn't 4M-aligned. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
Hi Ilya, On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomovwrote: > On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev > wrote: >> RBD illustration showing RBD ignoring discard until a certain >> threshold - why is that? This behavior is unfortunately incompatible >> with ESXi discard (UNMAP) behavior. >> >> Is there a way to lower the discard sensitivity on RBD devices? >> >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 782336 KB > > Think about it in terms of underlying RADOS objects (4M by default). > There are three cases: > > discard range | command > - > whole object| delete > object's tail | truncate > object's head | zero > > Obviously, only delete and truncate free up space. In all of your > examples, except the last one, you are attempting to discard the head > of the (first) object. > > You can free up as little as a sector, as long as it's the tail: > > OffsetLength Type > 0 4194304 data > > # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > > OffsetLength Type > 0 4193792 data Looks like ESXi is sending in each discard/unmap with the fixed granularity of 8192 sectors, which is passed verbatim by SCST. There is a slight reduction in size via rbd diff method, but now I understand that actual truncate only takes effect when the discard happens to clip the tail of an image. So far looking at https://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=2057513 ...the only variable we can control is the count of 8192-sector chunks and not their size. Which means that most of the ESXi discard commands will be disregarded by Ceph. Vlad, is 8192 sectors coming from ESXi, as in the debug: Aug 1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector 1342099456, nr_sects 8192) Thank you, Alex > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Mon, Aug 1, 2016 at 9:07 PM, Ilya Dryomovwrote: > On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev > wrote: >> RBD illustration showing RBD ignoring discard until a certain >> threshold - why is that? This behavior is unfortunately incompatible >> with ESXi discard (UNMAP) behavior. >> >> Is there a way to lower the discard sensitivity on RBD devices? >> >> >> >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 819200 KB >> >> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 >> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { >> print SUM/1024 " KB" }' >> 782336 KB > > Think about it in terms of underlying RADOS objects (4M by default). > There are three cases: > > discard range | command > - > whole object| delete > object's tail | truncate > object's head | zero > > Obviously, only delete and truncate free up space. In all of your > examples, except the last one, you are attempting to discard the head > of the (first) object. > > You can free up as little as a sector, as long as it's the tail: > > OffsetLength Type > 0 4194304 data > > # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 > > OffsetLength Type > 0 4193792 data Just realized I've left out the most interesting bit. You can make zero punch holes, but that's disabled by default in jewel. The option is "filestore punch hole = true". Note that it won't be reflected in "rbd diff" output. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachevwrote: > RBD illustration showing RBD ignoring discard until a certain > threshold - why is that? This behavior is unfortunately incompatible > with ESXi discard (UNMAP) behavior. > > Is there a way to lower the discard sensitivity on RBD devices? > > > > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 819200 KB > > root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { > print SUM/1024 " KB" }' > 782336 KB Think about it in terms of underlying RADOS objects (4M by default). There are three cases: discard range | command - whole object| delete object's tail | truncate object's head | zero Obviously, only delete and truncate free up space. In all of your examples, except the last one, you are attempting to discard the head of the (first) object. You can free up as little as a sector, as long as it's the tail: OffsetLength Type 0 4194304 data # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28 OffsetLength Type 0 4193792 data Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
RBD illustration showing RBD ignoring discard until a certain threshold - why is that? This behavior is unfortunately incompatible with ESXi discard (UNMAP) behavior. Is there a way to lower the discard sensitivity on RBD devices? root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 819200 KB root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END { print SUM/1024 " KB" }' 782336 KB -- Alex Gorbachev Storcium On Sat, Jul 30, 2016 at 9:11 PM, Alex Gorbachevwrote: >> >> On Wednesday, July 27, 2016, Vladislav Bolkhovitin wrote: >>> >>> >>> Alex Gorbachev wrote on 07/27/2016 10:33 AM: >>> > One other experiment: just running blkdiscard against the RBD block >>> > device completely clears it, to the point where the rbd-diff method >>> > reports 0 blocks utilized. So to summarize: >>> > >>> > - ESXi sending UNMAP via SCST does not seem to release storage from >>> > RBD (BLOCKIO handler that is supposed to work with UNMAP) >>> > >>> > - blkdiscard does release the space >>> >>> How did you run blkdiscard? It might be that blkdiscard discarded big >>> areas, while ESXi >>> sending UNMAP commands for areas smaller, than min size, which could be >>> discarded, or >>> not aligned as needed, so those discard requests just ignored. > > Here is the output of the debug, many more of these statements before > and after. Is it correct to state then that SCST is indeed executing > the discard and the RBD device is ignoring it (since the used size in > ceph is not diminishing)? > > Jul 30 21:08:46 e1 kernel: [ 3032.199972] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570716160, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.202622] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570724352, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.207214] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570732544, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.210395] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570740736, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.212951] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570748928, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.216187] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570757120, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.219299] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570765312, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.222658] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570773504, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.225948] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570781696, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.230092] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570789888, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.234153] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570798080, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.238001] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570806272, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.240876] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570814464, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.242771] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570822656, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.244943] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570830848, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.247506] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570839040, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.250090] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570847232, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.253229] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570855424, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.256001] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570863616, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.259204] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 570871808, nr_sects > 8192) > Jul 30 21:08:46 e1 kernel: [ 3032.261368] [22016]: > vdisk_unmap_range:3830:Discarding (start_sector 57088,
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
> > On Wednesday, July 27, 2016, Vladislav Bolkhovitinwrote: >> >> >> Alex Gorbachev wrote on 07/27/2016 10:33 AM: >> > One other experiment: just running blkdiscard against the RBD block >> > device completely clears it, to the point where the rbd-diff method >> > reports 0 blocks utilized. So to summarize: >> > >> > - ESXi sending UNMAP via SCST does not seem to release storage from >> > RBD (BLOCKIO handler that is supposed to work with UNMAP) >> > >> > - blkdiscard does release the space >> >> How did you run blkdiscard? It might be that blkdiscard discarded big >> areas, while ESXi >> sending UNMAP commands for areas smaller, than min size, which could be >> discarded, or >> not aligned as needed, so those discard requests just ignored. Here is the output of the debug, many more of these statements before and after. Is it correct to state then that SCST is indeed executing the discard and the RBD device is ignoring it (since the used size in ceph is not diminishing)? Jul 30 21:08:46 e1 kernel: [ 3032.199972] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570716160, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.202622] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570724352, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.207214] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570732544, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.210395] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570740736, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.212951] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570748928, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.216187] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570757120, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.219299] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570765312, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.222658] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570773504, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.225948] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570781696, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.230092] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570789888, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.234153] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570798080, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.238001] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570806272, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.240876] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570814464, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.242771] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570822656, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.244943] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570830848, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.247506] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570839040, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.250090] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570847232, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.253229] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570855424, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.256001] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570863616, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.259204] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570871808, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.261368] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 57088, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.264025] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570888192, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.266737] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570896384, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.270143] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570904576, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.273975] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570912768, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.278163] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570920960, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.282250] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570929152, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.285932] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570937344, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.289736] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570945536, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.292506] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570953728, nr_sects 8192) Jul 30 21:08:46 e1 kernel: [ 3032.294706] [22016]: vdisk_unmap_range:3830:Discarding (start_sector 570961920, nr_sects 8192) Thank you, Alex > > > I indeed ran blkdiscard on the whole device. So the question to the Ceph > list is below
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
Hi Vlad, On Wednesday, July 27, 2016, Vladislav Bolkhovitinwrote: > > Alex Gorbachev wrote on 07/27/2016 10:33 AM: > > One other experiment: just running blkdiscard against the RBD block > > device completely clears it, to the point where the rbd-diff method > > reports 0 blocks utilized. So to summarize: > > > > - ESXi sending UNMAP via SCST does not seem to release storage from > > RBD (BLOCKIO handler that is supposed to work with UNMAP) > > > > - blkdiscard does release the space > > How did you run blkdiscard? It might be that blkdiscard discarded big > areas, while ESXi > sending UNMAP commands for areas smaller, than min size, which could be > discarded, or > not aligned as needed, so those discard requests just ignored. I indeed ran blkdiscard on the whole device. So the question to the Ceph list is below what length discard is ignored? I saw at least one other user post a similar issue with ESXi-SCST-RBD. > > For completely correct test you need to run blkdiscard for exactly the > same areas, both > start and size, as the ESXi UNMAP requests you are seeing on the SCST > traces. I am running a test with the debug settings you provided, and will keep this thread updated with results. Much appreciate the guidance. Alex > > Vlad > > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
One other experiment: just running blkdiscard against the RBD block device completely clears it, to the point where the rbd-diff method reports 0 blocks utilized. So to summarize: - ESXi sending UNMAP via SCST does not seem to release storage from RBD (BLOCKIO handler that is supposed to work with UNMAP) - blkdiscard does release the space -- Alex Gorbachev Storcium On Wed, Jul 27, 2016 at 11:55 AM, Alex Gorbachevwrote: > Hi Vlad, > > On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin wrote: >> Hi, >> >> I would suggest to rebuild SCST in the debug mode (after "make 2debug"), >> then before >> calling the unmap command enable "scsi" and "debug" logging for scst and >> scst_vdisk >> modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi" >>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug" >>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for >>>the unmap >> command vdisk_unmap_range() is reporting running blkdev_issue_discard() in >> the kernel >> logs. >> >> To double check, you might also add trace statement just before >> blkdev_issue_discard() >> in vdisk_unmap_range(). > > With the debug settings on, I am seeing the below output - this means > that discard is being sent to the backing (RBD) device, correct? > > Including the ceph-users list to see if there is a reason RBD is not > processing this discard/unmap. > > Thank you, > -- > Alex Gorbachev > Storcium > > Jul 26 08:23:38 e1 kernel: [ 858.324715] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.324740] [20426]: > vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.324743] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192) > Jul 26 08:23:38 e1 kernel: [ 858.336218] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.336232] [20426]: > vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.336234] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192) > Jul 26 08:23:38 e1 kernel: [ 858.351446] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.351468] [20426]: > vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.351471] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192) > Jul 26 08:23:38 e1 kernel: [ 858.373407] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:23:38 e1 kernel: [ 858.373422] [20426]: > vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0, > data_len 24 > Jul 26 08:23:38 e1 kernel: [ 858.373424] [20426]: > vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192) > > Jul 26 08:24:04 e1 kernel: [ 884.170201] [6290]: scst_cmd_init_done:829:CDB: > Jul 26 08:24:04 e1 kernel: [ 884.170202] > (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F > Jul 26 08:24:04 e1 kernel: [ 884.170205]0: 42 00 00 00 00 00 00 > 00 18 00 00 00 00 00 00 00 B... > Jul 26 08:24:04 e1 kernel: [ 884.170268] [6290]: scst: > scst_parse_cmd:1312:op_name (cmd 88201b556300), > direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24, > out_bufflen=0, (expected len data 24, expected len DIF 0, out expected > len 0), flags=0x80260, internal 0, naca 0 > Jul 26 08:24:04 e1 kernel: [ 884.173983] [20426]: scst: > scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0, > host_status 0, driver_status 0, resp_data_len 0 > Jul 26 08:24:04 e1 kernel: [ 884.173998] [20426]: > vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0, > data_len 24 > Jul 26 08:24:04 e1 kernel: [ 884.174001] [20426]: > vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192) > Jul 26 08:24:04 e1 kernel: [ 884.174224] [6290]: scst: > scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator > iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1, > queue_type 1, tag 4005936 (cmd 88201b5565c0, sess > 880ffa2c) > Jul 26 08:24:04 e1 kernel: [ 884.174227] [6290]: scst_cmd_init_done:829:CDB: > Jul 26 08:24:04 e1 kernel: [ 884.174228] > (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F > Jul 26 08:24:04 e1 kernel: [ 884.174231]0: 42 00 00 00 00 00 00 > 00 18 00 00 00 00 00 00 00 B... > Jul 26 08:24:04 e1 kernel: [ 884.174256] [6290]: scst: > scst_parse_cmd:1312:op_name (cmd 88201b5565c0), > direction=1 (expected 1, set yes),
Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's
Hi Vlad, On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitinwrote: > Hi, > > I would suggest to rebuild SCST in the debug mode (after "make 2debug"), then > before > calling the unmap command enable "scsi" and "debug" logging for scst and > scst_vdisk > modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi" >>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug" >>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for >>the unmap > command vdisk_unmap_range() is reporting running blkdev_issue_discard() in > the kernel > logs. > > To double check, you might also add trace statement just before > blkdev_issue_discard() > in vdisk_unmap_range(). With the debug settings on, I am seeing the below output - this means that discard is being sent to the backing (RBD) device, correct? Including the ceph-users list to see if there is a reason RBD is not processing this discard/unmap. Thank you, -- Alex Gorbachev Storcium Jul 26 08:23:38 e1 kernel: [ 858.324715] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.324740] [20426]: vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.324743] [20426]: vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192) Jul 26 08:23:38 e1 kernel: [ 858.336218] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.336232] [20426]: vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.336234] [20426]: vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192) Jul 26 08:23:38 e1 kernel: [ 858.351446] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.351468] [20426]: vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.351471] [20426]: vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192) Jul 26 08:23:38 e1 kernel: [ 858.373407] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:23:38 e1 kernel: [ 858.373422] [20426]: vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0, data_len 24 Jul 26 08:23:38 e1 kernel: [ 858.373424] [20426]: vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192) Jul 26 08:24:04 e1 kernel: [ 884.170201] [6290]: scst_cmd_init_done:829:CDB: Jul 26 08:24:04 e1 kernel: [ 884.170202] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F Jul 26 08:24:04 e1 kernel: [ 884.170205]0: 42 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 B... Jul 26 08:24:04 e1 kernel: [ 884.170268] [6290]: scst: scst_parse_cmd:1312:op_name (cmd 88201b556300), direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24, out_bufflen=0, (expected len data 24, expected len DIF 0, out expected len 0), flags=0x80260, internal 0, naca 0 Jul 26 08:24:04 e1 kernel: [ 884.173983] [20426]: scst: scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0, host_status 0, driver_status 0, resp_data_len 0 Jul 26 08:24:04 e1 kernel: [ 884.173998] [20426]: vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0, data_len 24 Jul 26 08:24:04 e1 kernel: [ 884.174001] [20426]: vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192) Jul 26 08:24:04 e1 kernel: [ 884.174224] [6290]: scst: scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1, queue_type 1, tag 4005936 (cmd 88201b5565c0, sess 880ffa2c) Jul 26 08:24:04 e1 kernel: [ 884.174227] [6290]: scst_cmd_init_done:829:CDB: Jul 26 08:24:04 e1 kernel: [ 884.174228] (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F Jul 26 08:24:04 e1 kernel: [ 884.174231]0: 42 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 B... Jul 26 08:24:04 e1 kernel: [ 884.174256] [6290]: scst: scst_parse_cmd:1312:op_name (cmd 88201b5565c0), direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24, out_bufflen=0, (expected len data 24, expected len DIF 0, out expected len 0), flags=0x80260, internal 0, naca 0 > > Alex Gorbachev wrote on 07/23/2016 08:48 PM: >> Hi Nick, Vlad, SCST Team, >> > I have been looking at using the rbd-nbd tool, so that the caching is provided by librbd and then use BLOCKIO with SCST. This will however need some work on the SCST resource agents to ensure the librbd cache is invalidated on ALUA state change. > > The other thing I have seen is this > > https://lwn.net/Articles/691871/ > > Which may mean FILEIO will