subject:"\[ceph\-users\] \[Scst\-devel\] Thin Provisioning and Ceph RBD's"

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-18 Thread Alex Gorbachev

On Sat, Aug 13, 2016 at 4:51 PM, Alex Gorbachev  
wrote:
> On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev  
> wrote:
>> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov  wrote:
>>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev  
>>> wrote:
> I'm confused.  How can a 4M discard not free anything?  It's either
> going to hit an entire object or two adjacent objects, truncating the
> tail of one and zeroing the head of another.  Using rbd diff:
>
> $ rbd diff test | grep -A 1 25165824
> 25165824  4194304 data
> 29360128  4194304 data
>
> # a 4M discard at 1M into a RADOS object
> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>
> $ rbd diff test | grep -A 1 25165824
> 25165824  1048576 data
> 29360128  4194304 data

 I have tested this on a small RBD device with such offsets and indeed,
 the discard works as you describe, Ilya.

 Looking more into why ESXi's discard is not working.  I found this
 message in kern.log on Ubuntu on creation of the SCST LUN, which shows
 unmap_alignment 0:

 Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
 Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
 provisioning for device /dev/rbd/spin1/unmap1t
 Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
 unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
 Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
 target virtual disk p_iSCSILun_sclun945
 (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
 nblocks=838860800, cyln=409600)
 Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
 scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
 Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
 scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
 d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
 initiator copy_manager_sess)

 even though:

 root@e1:/sys/block/rbd29# cat discard_alignment
 4194304

 So somehow the discard_alignment is not making it into the LUN.  Could
 this be the issue?
>>>
>>> No, if you are not seeing *any* effect, the alignment is pretty much
>>> irrelevant.  Can you do the following on a small test image?
>>>
>>> - capture "rbd diff" output
>>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
>>> - issue a few discards with blkdiscard
>>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled
>>> - capture "rbd diff" output again
>>>
>>> and attach all of the above?  (You might need to install a blktrace
>>> package.)
>>>
>>
>> Latest results from VMWare validation tests:
>>
>> Each test creates and deletes a virtual disk, then calls ESXi unmap
>> for what ESXi maps to that volume:
>>
>> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829
>>
>> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837
>>
>> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824
>>
>> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837
>>
>> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837
>>
>> At the end, the compounded used size via rbd diff is 608 GB from 775GB
>> of data.  So we release only about 20% via discards in the end.
>
> Ilya has analyzed the discard pattern, and indeed the problem is that
> ESXi appears to disregard the discard alignment attribute.  Therefore,
> discards are shifted by 1M, and are not hitting the tail of objects.
>
> Discards work much better on the EagerZeroedThick volumes, likely due
> to contiguous data.
>
> I will proceed with the rest of testing, and will post any tips or
> best practice results as they become available.
>
> Thank you for everyone's help and advice!

Testing completed - the discards definitely follow the alignment pattern:

- 4MB objects and VMFS5 - only some discards due to 1MB discard not
often hitting the tail of object

- 1MB objects - practically 100% space reclaim

I have not tried shifting the VMFS5 filesystem, as the test will not
work with that.  Also not sure how to properly incorporate into VMWare
routine operation.  So, as a best practice:

If you want efficient ESXi space reclaim with RBD and VMFS5, use 1 MB
object size in Ceph

Best regards,
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-13 Thread Alex Gorbachev

On Sat, Aug 13, 2016 at 12:36 PM, Alex Gorbachev  
wrote:
> On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov  wrote:
>> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev  
>> wrote:
 I'm confused.  How can a 4M discard not free anything?  It's either
 going to hit an entire object or two adjacent objects, truncating the
 tail of one and zeroing the head of another.  Using rbd diff:

 $ rbd diff test | grep -A 1 25165824
 25165824  4194304 data
 29360128  4194304 data

 # a 4M discard at 1M into a RADOS object
 $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0

 $ rbd diff test | grep -A 1 25165824
 25165824  1048576 data
 29360128  4194304 data
>>>
>>> I have tested this on a small RBD device with such offsets and indeed,
>>> the discard works as you describe, Ilya.
>>>
>>> Looking more into why ESXi's discard is not working.  I found this
>>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows
>>> unmap_alignment 0:
>>>
>>> Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
>>> Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
>>> provisioning for device /dev/rbd/spin1/unmap1t
>>> Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
>>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
>>> Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
>>> target virtual disk p_iSCSILun_sclun945
>>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
>>> nblocks=838860800, cyln=409600)
>>> Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
>>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
>>> Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
>>> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
>>> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
>>> initiator copy_manager_sess)
>>>
>>> even though:
>>>
>>> root@e1:/sys/block/rbd29# cat discard_alignment
>>> 4194304
>>>
>>> So somehow the discard_alignment is not making it into the LUN.  Could
>>> this be the issue?
>>
>> No, if you are not seeing *any* effect, the alignment is pretty much
>> irrelevant.  Can you do the following on a small test image?
>>
>> - capture "rbd diff" output
>> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
>> - issue a few discards with blkdiscard
>> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled
>> - capture "rbd diff" output again
>>
>> and attach all of the above?  (You might need to install a blktrace
>> package.)
>>
>
> Latest results from VMWare validation tests:
>
> Each test creates and deletes a virtual disk, then calls ESXi unmap
> for what ESXi maps to that volume:
>
> Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829
>
> Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837
>
> Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824
>
> Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837
>
> Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837
>
> At the end, the compounded used size via rbd diff is 608 GB from 775GB
> of data.  So we release only about 20% via discards in the end.

Ilya has analyzed the discard pattern, and indeed the problem is that
ESXi appears to disregard the discard alignment attribute.  Therefore,
discards are shifted by 1M, and are not hitting the tail of objects.

Discards work much better on the EagerZeroedThick volumes, likely due
to contiguous data.

I will proceed with the rest of testing, and will post any tips or
best practice results as they become available.

Thank you for everyone's help and advice!

Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-13 Thread Alex Gorbachev

On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov  wrote:
> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev  
> wrote:
>>> I'm confused.  How can a 4M discard not free anything?  It's either
>>> going to hit an entire object or two adjacent objects, truncating the
>>> tail of one and zeroing the head of another.  Using rbd diff:
>>>
>>> $ rbd diff test | grep -A 1 25165824
>>> 25165824  4194304 data
>>> 29360128  4194304 data
>>>
>>> # a 4M discard at 1M into a RADOS object
>>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>>>
>>> $ rbd diff test | grep -A 1 25165824
>>> 25165824  1048576 data
>>> 29360128  4194304 data
>>
>> I have tested this on a small RBD device with such offsets and indeed,
>> the discard works as you describe, Ilya.
>>
>> Looking more into why ESXi's discard is not working.  I found this
>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows
>> unmap_alignment 0:
>>
>> Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
>> Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
>> provisioning for device /dev/rbd/spin1/unmap1t
>> Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
>> Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
>> target virtual disk p_iSCSILun_sclun945
>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
>> nblocks=838860800, cyln=409600)
>> Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
>> Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
>> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
>> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
>> initiator copy_manager_sess)
>>
>> even though:
>>
>> root@e1:/sys/block/rbd29# cat discard_alignment
>> 4194304
>>
>> So somehow the discard_alignment is not making it into the LUN.  Could
>> this be the issue?
>
> No, if you are not seeing *any* effect, the alignment is pretty much
> irrelevant.  Can you do the following on a small test image?
>
> - capture "rbd diff" output
> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
> - issue a few discards with blkdiscard
> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled
> - capture "rbd diff" output again
>
> and attach all of the above?  (You might need to install a blktrace
> package.)
>

Latest results from VMWare validation tests:

Each test creates and deletes a virtual disk, then calls ESXi unmap
for what ESXi maps to that volume:

Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829

Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837

Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824

Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837

Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837

At the end, the compounded used size via rbd diff is 608 GB from 775GB
of data.  So we release only about 20% via discards in the end.

Thank you,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-08 Thread Ilya Dryomov

On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev  wrote:
>> I'm confused.  How can a 4M discard not free anything?  It's either
>> going to hit an entire object or two adjacent objects, truncating the
>> tail of one and zeroing the head of another.  Using rbd diff:
>>
>> $ rbd diff test | grep -A 1 25165824
>> 25165824  4194304 data
>> 29360128  4194304 data
>>
>> # a 4M discard at 1M into a RADOS object
>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>>
>> $ rbd diff test | grep -A 1 25165824
>> 25165824  1048576 data
>> 29360128  4194304 data
>
> I have tested this on a small RBD device with such offsets and indeed,
> the discard works as you describe, Ilya.
>
> Looking more into why ESXi's discard is not working.  I found this
> message in kern.log on Ubuntu on creation of the SCST LUN, which shows
> unmap_alignment 0:
>
> Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
> Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
> provisioning for device /dev/rbd/spin1/unmap1t
> Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
> Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
> target virtual disk p_iSCSILun_sclun945
> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
> nblocks=838860800, cyln=409600)
> Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
> Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
> initiator copy_manager_sess)
>
> even though:
>
> root@e1:/sys/block/rbd29# cat discard_alignment
> 4194304
>
> So somehow the discard_alignment is not making it into the LUN.  Could
> this be the issue?

No, if you are not seeing *any* effect, the alignment is pretty much
irrelevant.  Can you do the following on a small test image?

- capture "rbd diff" output
- blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
- issue a few discards with blkdiscard
- issue a few unmaps with ESXi, preferrably with SCST debugging enabled
- capture "rbd diff" output again

and attach all of the above?  (You might need to install a blktrace
package.)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-07 Thread Alex Gorbachev

On Friday, August 5, 2016, matthew patton  wrote:

> > - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free
> anything
>
> the proper solution here is to:
> * quit worrying about it and buy sufficient disk in the first place, it's
> not exactly expensive


> I would do that for one or a couple environments, or if I sold drives :).
  However, the two use cases that are of importance to my group still
warrant figuring this out:

- Large medical image collections or frequently modified database files
(quite a few deletes and creates)

- Passing VMWare certification.  It means a lot to people without deep
dtorage knowledge, to make a decision on adopting a technology


> * ask VMware to have the decency to add a flag to vmkfstools to specify
> the offset


> Preaching to the choir!  I will ask.  Hope someone will listen.

>
> * create a small dummy VMFS on the block device that allows you to create
> a second filesystem behind it that's aligned on a 4MB boundary. Or perhaps
> simpler, create a thick-zeroed VMDK (3+minimum size + extra) on the VMFS
> such that the next VMDK created falls on the desired boundary.


> I wonder how to do this for the test, or use a small partition like Vlad
described.  I will try that with one of their unmap tests

>
> * use NFS like *deity* intended like any other sane person, nobody uses
> block storage anymore for precisely these kinds of reasons.
>

> Working in that direction too.  A bit concerned of the double writes of
the backing filesystem, then double writes for RADOS.  Per Nick, this still
works better than block.  But having gone through 95% of certification for
block, I feel like I should finish it before jumping on to the next thing.

Thank you for your input, it is very practical and helpful long term.

Alex

>
>


-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-07 Thread Alex Gorbachev

> I'm confused.  How can a 4M discard not free anything?  It's either
> going to hit an entire object or two adjacent objects, truncating the
> tail of one and zeroing the head of another.  Using rbd diff:
>
> $ rbd diff test | grep -A 1 25165824
> 25165824  4194304 data
> 29360128  4194304 data
>
> # a 4M discard at 1M into a RADOS object
> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>
> $ rbd diff test | grep -A 1 25165824
> 25165824  1048576 data
> 29360128  4194304 data

I have tested this on a small RBD device with such offsets and indeed,
the discard works as you describe, Ilya.

Looking more into why ESXi's discard is not working.  I found this
message in kern.log on Ubuntu on creation of the SCST LUN, which shows
unmap_alignment 0:

Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
provisioning for device /dev/rbd/spin1/unmap1t
Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
target virtual disk p_iSCSILun_sclun945
(file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
nblocks=838860800, cyln=409600)
Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
initiator copy_manager_sess)

even though:

root@e1:/sys/block/rbd29# cat discard_alignment
4194304

So somehow the discard_alignment is not making it into the LUN.  Could
this be the issue?

Thanks,
Alex


Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
provisioning for device /dev/rbd/spin1/unmap1t
Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
target virtual disk p_iSCSILun_sclun945
(file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
nblocks=838860800, cyln=409600)

>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-06 Thread Ilya Dryomov

On Sat, Aug 6, 2016 at 1:10 AM, Alex Gorbachev  wrote:
> Is there a way to perhaps increase the discard granularity?  The way I see
> it based on the discussion so far, here is why discard/unmap is failing to
> work with VMWare:
>
> - RBD provides space in 4MB blocks, which must be discarded entirely, or at
> least hitting the tail.
>
> - SCST communicates to ESXi that discard alignment is 4MB and discard
> granularity is also 4MB
>
> - ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free
> anything
>
> What is it were possible to make a 6MB discard granularity?

I'm confused.  How can a 4M discard not free anything?  It's either
going to hit an entire object or two adjacent objects, truncating the
tail of one and zeroing the head of another.  Using rbd diff:

$ rbd diff test | grep -A 1 25165824
25165824  4194304 data
29360128  4194304 data

# a 4M discard at 1M into a RADOS object
$ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0

$ rbd diff test | grep -A 1 25165824
25165824  1048576 data
29360128  4194304 data

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-05 Thread Alex Gorbachev

On Tuesday, August 2, 2016, Ilya Dryomov  wrote:

> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  > wrote:
> > On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  > wrote:
> >> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
> >>> Hi Ilya,
> >>>
> >>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  > wrote:
>  On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev <
> a...@iss-integration.com > wrote:
> > RBD illustration showing RBD ignoring discard until a certain
> > threshold - why is that?  This behavior is unfortunately incompatible
> > with ESXi discard (UNMAP) behavior.
> >
> > Is there a way to lower the discard sensitivity on RBD devices?
> >
> >>> 
> >
> > root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> > print SUM/1024 " KB" }'
> > 819200 KB
> >
> > root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> > root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> > print SUM/1024 " KB" }'
> > 782336 KB
> 
>  Think about it in terms of underlying RADOS objects (4M by default).
>  There are three cases:
> 
>  discard range   | command
>  -
>  whole object| delete
>  object's tail   | truncate
>  object's head   | zero
> 
>  Obviously, only delete and truncate free up space.  In all of your
>  examples, except the last one, you are attempting to discard the head
>  of the (first) object.
> 
>  You can free up as little as a sector, as long as it's the tail:
> 
>  OffsetLength  Type
>  0 4194304 data
> 
>  # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
> 
>  OffsetLength  Type
>  0 4193792 data
> >>>
> >>> Looks like ESXi is sending in each discard/unmap with the fixed
> >>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
> >>> is a slight reduction in size via rbd diff method, but now I
> >>> understand that actual truncate only takes effect when the discard
> >>> happens to clip the tail of an image.
> >>>
> >>> So far looking at
> >>> https://kb.vmware.com/selfservice/microsites/search.
> do?language=en_US&cmd=displayKC&externalId=2057513
> >>>
> >>> ...the only variable we can control is the count of 8192-sector chunks
> >>> and not their size.  Which means that most of the ESXi discard
> >>> commands will be disregarded by Ceph.
> >>>
> >>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
> >>>
> >>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
> >>> 1342099456, nr_sects 8192)
> >>
> >> Yes, correct. However, to make sure that VMware is not (erroneously)
> enforced to do this, you need to perform one more check.
> >>
> >> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here
> correct granularity and alignment (4M, I guess?)
> >
> > This seems to reflect the granularity (4194304), which matches the
> > 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> > value.
> >
> > Can discard_alignment be specified with RBD?
>
> It's exported as a read-only sysfs attribute, just like
> discard_granularity:
>
> # cat /sys/block/rbd0/discard_alignment
> 4194304


> Is there a way to perhaps increase the discard granularity?  The way I see
it based on the discussion so far, here is why discard/unmap is failing to
work with VMWare:

- RBD provides space in 4MB blocks, which must be discarded entirely, or at
least hitting the tail.

- SCST communicates to ESXi that discard alignment is 4MB and discard
granularity is also 4MB

- ESXI's VMFS5 is aligned on 1MB, so 4MB discards never actually free
anything

What is it were possible to make a 6MB discard granularity?

Thank you,
Alex

>
>
> Thanks,
>
> Ilya
>


-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-04 Thread Alex Gorbachev

On Wed, Aug 3, 2016 at 10:54 AM, Alex Gorbachev  
wrote:
> On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev  
> wrote:
>> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin  wrote:
>>> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
 On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
> wrote:
>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  
>> wrote:
>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
 Hi Ilya,

 On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  
 wrote:
> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev 
>  wrote:
>> RBD illustration showing RBD ignoring discard until a certain
>> threshold - why is that?  This behavior is unfortunately incompatible
>> with ESXi discard (UNMAP) behavior.
>>
>> Is there a way to lower the discard sensitivity on RBD devices?
>>
 
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 782336 KB
>
> Think about it in terms of underlying RADOS objects (4M by default).
> There are three cases:
>
> discard range   | command
> -
> whole object| delete
> object's tail   | truncate
> object's head   | zero
>
> Obviously, only delete and truncate free up space.  In all of your
> examples, except the last one, you are attempting to discard the head
> of the (first) object.
>
> You can free up as little as a sector, as long as it's the tail:
>
> OffsetLength  Type
> 0 4194304 data
>
> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>
> OffsetLength  Type
> 0 4193792 data

 Looks like ESXi is sending in each discard/unmap with the fixed
 granularity of 8192 sectors, which is passed verbatim by SCST.  There
 is a slight reduction in size via rbd diff method, but now I
 understand that actual truncate only takes effect when the discard
 happens to clip the tail of an image.

 So far looking at
 https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513

 ...the only variable we can control is the count of 8192-sector chunks
 and not their size.  Which means that most of the ESXi discard
 commands will be disregarded by Ceph.

 Vlad, is 8192 sectors coming from ESXi, as in the debug:

 Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
 1342099456, nr_sects 8192)
>>>
>>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>>> enforced to do this, you need to perform one more check.
>>>
>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here 
>>> correct granularity and alignment (4M, I guess?)
>>
>> This seems to reflect the granularity (4194304), which matches the
>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>> value.
>>
>> Can discard_alignment be specified with RBD?
>
> It's exported as a read-only sysfs attribute, just like
> discard_granularity:
>
> # cat /sys/block/rbd0/discard_alignment
> 4194304

 Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
 discard_alignment in /sys/block//queue, but for RBD it's in
 /sys/block/ - could this be the source of the issue?
>>>
>>> No. As you can see below, the alignment reported correctly. So, this must 
>>> be VMware
>>> issue, because it is ignoring the alignment parameter. You can try to align 
>>> your VMware
>>> partition on 4M boundary, it might help.
>>
>> Is this not a mismatch:
>>
>> - From sg_inq: Unmap granularity alignment: 8192
>>
>> - From "cat /sys/block/rbd0/discard_alignment": 4194304
>>
>> I am compiling the latest SCST trunk now.
>
> Scratch that, please, I just did a test that shows correct calculation
> of 4MB in sectors.
>
> - On iSCSI client node:
>
> dd if=/dev/urandom of=/dev/sdf bs=1M count=800
> blkdiscard -o 0 -l 4194304 /dev/sdf
>
> - On iSCSI server node:
>
> Aug  3 10:50:57 e1 kernel: [  893.444538] [1381]:
> vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192)
>
> (8192 * 512 = 4194304)
>
> Now proceeding to test discard again with the latest SCST trunk build.

I ran the ESXi unmap again with the la

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-03 Thread Alex Gorbachev

On Wed, Aug 3, 2016 at 9:59 AM, Alex Gorbachev  wrote:
> On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin  wrote:
>> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
 On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
 wrote:
> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  
> wrote:
>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>> Hi Ilya,
>>>
>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
 On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev 
  wrote:
> RBD illustration showing RBD ignoring discard until a certain
> threshold - why is that?  This behavior is unfortunately incompatible
> with ESXi discard (UNMAP) behavior.
>
> Is there a way to lower the discard sensitivity on RBD devices?
>
>>> 
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 782336 KB

 Think about it in terms of underlying RADOS objects (4M by default).
 There are three cases:

 discard range   | command
 -
 whole object| delete
 object's tail   | truncate
 object's head   | zero

 Obviously, only delete and truncate free up space.  In all of your
 examples, except the last one, you are attempting to discard the head
 of the (first) object.

 You can free up as little as a sector, as long as it's the tail:

 OffsetLength  Type
 0 4194304 data

 # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28

 OffsetLength  Type
 0 4193792 data
>>>
>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>> is a slight reduction in size via rbd diff method, but now I
>>> understand that actual truncate only takes effect when the discard
>>> happens to clip the tail of an image.
>>>
>>> So far looking at
>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>
>>> ...the only variable we can control is the count of 8192-sector chunks
>>> and not their size.  Which means that most of the ESXi discard
>>> commands will be disregarded by Ceph.
>>>
>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>
>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>> 1342099456, nr_sects 8192)
>>
>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>> enforced to do this, you need to perform one more check.
>>
>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here 
>> correct granularity and alignment (4M, I guess?)
>
> This seems to reflect the granularity (4194304), which matches the
> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> value.
>
> Can discard_alignment be specified with RBD?

 It's exported as a read-only sysfs attribute, just like
 discard_granularity:

 # cat /sys/block/rbd0/discard_alignment
 4194304
>>>
>>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>>> discard_alignment in /sys/block//queue, but for RBD it's in
>>> /sys/block/ - could this be the source of the issue?
>>
>> No. As you can see below, the alignment reported correctly. So, this must be 
>> VMware
>> issue, because it is ignoring the alignment parameter. You can try to align 
>> your VMware
>> partition on 4M boundary, it might help.
>
> Is this not a mismatch:
>
> - From sg_inq: Unmap granularity alignment: 8192
>
> - From "cat /sys/block/rbd0/discard_alignment": 4194304
>
> I am compiling the latest SCST trunk now.

Scratch that, please, I just did a test that shows correct calculation
of 4MB in sectors.

- On iSCSI client node:

dd if=/dev/urandom of=/dev/sdf bs=1M count=800
blkdiscard -o 0 -l 4194304 /dev/sdf

- On iSCSI server node:

Aug  3 10:50:57 e1 kernel: [  893.444538] [1381]:
vdisk_unmap_range:3832:Discarding (start_sector 0, nr_sects 8192)

(8192 * 512 = 4194304)

Now proceeding to test discard again with the latest SCST trunk build.


>
> Thanks,
> Alex
>
>>
>>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>>
>>> root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
>>> VPD INQUIRY: Block limits page (SBC)
>>>   Maximum compare and write length: 255 blo

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-03 Thread Alex Gorbachev

On Tue, Aug 2, 2016 at 10:49 PM, Vladislav Bolkhovitin  wrote:
> Alex Gorbachev wrote on 08/02/2016 07:56 AM:
>> On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
>>> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
>>> wrote:
 On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  
 wrote:
> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>> Hi Ilya,
>>
>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev 
>>>  wrote:
 RBD illustration showing RBD ignoring discard until a certain
 threshold - why is that?  This behavior is unfortunately incompatible
 with ESXi discard (UNMAP) behavior.

 Is there a way to lower the discard sensitivity on RBD devices?

>> 

 root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 819200 KB

 root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 782336 KB
>>>
>>> Think about it in terms of underlying RADOS objects (4M by default).
>>> There are three cases:
>>>
>>> discard range   | command
>>> -
>>> whole object| delete
>>> object's tail   | truncate
>>> object's head   | zero
>>>
>>> Obviously, only delete and truncate free up space.  In all of your
>>> examples, except the last one, you are attempting to discard the head
>>> of the (first) object.
>>>
>>> You can free up as little as a sector, as long as it's the tail:
>>>
>>> OffsetLength  Type
>>> 0 4194304 data
>>>
>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>
>>> OffsetLength  Type
>>> 0 4193792 data
>>
>> Looks like ESXi is sending in each discard/unmap with the fixed
>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>> is a slight reduction in size via rbd diff method, but now I
>> understand that actual truncate only takes effect when the discard
>> happens to clip the tail of an image.
>>
>> So far looking at
>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>
>> ...the only variable we can control is the count of 8192-sector chunks
>> and not their size.  Which means that most of the ESXi discard
>> commands will be disregarded by Ceph.
>>
>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>
>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>> 1342099456, nr_sects 8192)
>
> Yes, correct. However, to make sure that VMware is not (erroneously) 
> enforced to do this, you need to perform one more check.
>
> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here 
> correct granularity and alignment (4M, I guess?)

 This seems to reflect the granularity (4194304), which matches the
 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
 value.

 Can discard_alignment be specified with RBD?
>>>
>>> It's exported as a read-only sysfs attribute, just like
>>> discard_granularity:
>>>
>>> # cat /sys/block/rbd0/discard_alignment
>>> 4194304
>>
>> Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
>> discard_alignment in /sys/block//queue, but for RBD it's in
>> /sys/block/ - could this be the source of the issue?
>
> No. As you can see below, the alignment reported correctly. So, this must be 
> VMware
> issue, because it is ignoring the alignment parameter. You can try to align 
> your VMware
> partition on 4M boundary, it might help.

Is this not a mismatch:

- From sg_inq: Unmap granularity alignment: 8192

- From "cat /sys/block/rbd0/discard_alignment": 4194304

I am compiling the latest SCST trunk now.

Thanks,
Alex

>
>> Here is what I get querying the iscsi-exported RBD device on Linux:
>>
>> root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
>> VPD INQUIRY: Block limits page (SBC)
>>   Maximum compare and write length: 255 blocks
>>   Optimal transfer length granularity: 8 blocks
>>   Maximum transfer length: 16384 blocks
>>   Optimal transfer length: 1024 blocks
>>   Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
>>   Maximum unmap LBA count: 8192
>>   Maximum unmap block descriptor count: 4294967295
>>   Optimal unmap granularity: 8192
>>   Unmap granularity alignment valid: 1
>>   Unmap granularity alignment: 8192
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Ric Wheeler


On 08/02/2016 07:26 PM, Ilya Dryomov wrote:

This seems to reflect the granularity (4194304), which matches the
>8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>value.
>
>Can discard_alignment be specified with RBD?

It's exported as a read-only sysfs attribute, just like
discard_granularity:

# cat /sys/block/rbd0/discard_alignment
4194304


Note that this is the standard way Linux export alignment for storage discard 
for *any* kind of storage so worth using :)


Ric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Alex Gorbachev

On Tue, Aug 2, 2016 at 9:56 AM, Ilya Dryomov  wrote:
> On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  
> wrote:
>> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  wrote:
>>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
 Hi Ilya,

 On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
> wrote:
>> RBD illustration showing RBD ignoring discard until a certain
>> threshold - why is that?  This behavior is unfortunately incompatible
>> with ESXi discard (UNMAP) behavior.
>>
>> Is there a way to lower the discard sensitivity on RBD devices?
>>
 
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 782336 KB
>
> Think about it in terms of underlying RADOS objects (4M by default).
> There are three cases:
>
> discard range   | command
> -
> whole object| delete
> object's tail   | truncate
> object's head   | zero
>
> Obviously, only delete and truncate free up space.  In all of your
> examples, except the last one, you are attempting to discard the head
> of the (first) object.
>
> You can free up as little as a sector, as long as it's the tail:
>
> OffsetLength  Type
> 0 4194304 data
>
> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>
> OffsetLength  Type
> 0 4193792 data

 Looks like ESXi is sending in each discard/unmap with the fixed
 granularity of 8192 sectors, which is passed verbatim by SCST.  There
 is a slight reduction in size via rbd diff method, but now I
 understand that actual truncate only takes effect when the discard
 happens to clip the tail of an image.

 So far looking at
 https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513

 ...the only variable we can control is the count of 8192-sector chunks
 and not their size.  Which means that most of the ESXi discard
 commands will be disregarded by Ceph.

 Vlad, is 8192 sectors coming from ESXi, as in the debug:

 Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
 1342099456, nr_sects 8192)
>>>
>>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>>> enforced to do this, you need to perform one more check.
>>>
>>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct 
>>> granularity and alignment (4M, I guess?)
>>
>> This seems to reflect the granularity (4194304), which matches the
>> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
>> value.
>>
>> Can discard_alignment be specified with RBD?
>
> It's exported as a read-only sysfs attribute, just like
> discard_granularity:
>
> # cat /sys/block/rbd0/discard_alignment
> 4194304

Ah thanks Ilya, it is indeed there.  Vlad, your email says to look for
discard_alignment in /sys/block//queue, but for RBD it's in
/sys/block/ - could this be the source of the issue?

Here is what I get querying the iscsi-exported RBD device on Linux:

root@kio1:/sys/block/sdf#  sg_inq -p 0xB0 /dev/sdf
VPD INQUIRY: Block limits page (SBC)
  Maximum compare and write length: 255 blocks
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 16384 blocks
  Optimal transfer length: 1024 blocks
  Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
  Maximum unmap LBA count: 8192
  Maximum unmap block descriptor count: 4294967295
  Optimal unmap granularity: 8192
  Unmap granularity alignment valid: 1
  Unmap granularity alignment: 8192


>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Ilya Dryomov

On Tue, Aug 2, 2016 at 3:49 PM, Alex Gorbachev  wrote:
> On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  wrote:
>> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>>> Hi Ilya,
>>>
>>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
 On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
 wrote:
> RBD illustration showing RBD ignoring discard until a certain
> threshold - why is that?  This behavior is unfortunately incompatible
> with ESXi discard (UNMAP) behavior.
>
> Is there a way to lower the discard sensitivity on RBD devices?
>
>>> 
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 782336 KB

 Think about it in terms of underlying RADOS objects (4M by default).
 There are three cases:

 discard range   | command
 -
 whole object| delete
 object's tail   | truncate
 object's head   | zero

 Obviously, only delete and truncate free up space.  In all of your
 examples, except the last one, you are attempting to discard the head
 of the (first) object.

 You can free up as little as a sector, as long as it's the tail:

 OffsetLength  Type
 0 4194304 data

 # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28

 OffsetLength  Type
 0 4193792 data
>>>
>>> Looks like ESXi is sending in each discard/unmap with the fixed
>>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>>> is a slight reduction in size via rbd diff method, but now I
>>> understand that actual truncate only takes effect when the discard
>>> happens to clip the tail of an image.
>>>
>>> So far looking at
>>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>>
>>> ...the only variable we can control is the count of 8192-sector chunks
>>> and not their size.  Which means that most of the ESXi discard
>>> commands will be disregarded by Ceph.
>>>
>>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>>
>>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>>> 1342099456, nr_sects 8192)
>>
>> Yes, correct. However, to make sure that VMware is not (erroneously) 
>> enforced to do this, you need to perform one more check.
>>
>> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct 
>> granularity and alignment (4M, I guess?)
>
> This seems to reflect the granularity (4194304), which matches the
> 8192 pages (8192 x 512 = 4194304).  However, there is no alignment
> value.
>
> Can discard_alignment be specified with RBD?

It's exported as a read-only sysfs attribute, just like
discard_granularity:

# cat /sys/block/rbd0/discard_alignment
4194304

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Alex Gorbachev

On Mon, Aug 1, 2016 at 11:03 PM, Vladislav Bolkhovitin  wrote:
> Alex Gorbachev wrote on 08/01/2016 04:05 PM:
>> Hi Ilya,
>>
>> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
>>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
>>> wrote:
 RBD illustration showing RBD ignoring discard until a certain
 threshold - why is that?  This behavior is unfortunately incompatible
 with ESXi discard (UNMAP) behavior.

 Is there a way to lower the discard sensitivity on RBD devices?

>> 

 root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 819200 KB

 root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
 root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
 print SUM/1024 " KB" }'
 782336 KB
>>>
>>> Think about it in terms of underlying RADOS objects (4M by default).
>>> There are three cases:
>>>
>>> discard range   | command
>>> -
>>> whole object| delete
>>> object's tail   | truncate
>>> object's head   | zero
>>>
>>> Obviously, only delete and truncate free up space.  In all of your
>>> examples, except the last one, you are attempting to discard the head
>>> of the (first) object.
>>>
>>> You can free up as little as a sector, as long as it's the tail:
>>>
>>> OffsetLength  Type
>>> 0 4194304 data
>>>
>>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>>
>>> OffsetLength  Type
>>> 0 4193792 data
>>
>> Looks like ESXi is sending in each discard/unmap with the fixed
>> granularity of 8192 sectors, which is passed verbatim by SCST.  There
>> is a slight reduction in size via rbd diff method, but now I
>> understand that actual truncate only takes effect when the discard
>> happens to clip the tail of an image.
>>
>> So far looking at
>> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>>
>> ...the only variable we can control is the count of 8192-sector chunks
>> and not their size.  Which means that most of the ESXi discard
>> commands will be disregarded by Ceph.
>>
>> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>>
>> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
>> 1342099456, nr_sects 8192)
>
> Yes, correct. However, to make sure that VMware is not (erroneously) enforced 
> to do this, you need to perform one more check.
>
> 1. Run cat /sys/block/rbd28/queue/discard*. Ceph should report here correct 
> granularity and alignment (4M, I guess?)

This seems to reflect the granularity (4194304), which matches the
8192 pages (8192 x 512 = 4194304).  However, there is no alignment
value.

Can discard_alignment be specified with RBD?

>
> 2. Connect to the this iSCSI device from a Linux box and run sg_inq -p 0xB0 
> /dev/
>
> SCST should correctly report those values for unmap parameters (in blocks).
>
> If in both cases you see correct the same values, then this is VMware issue, 
> because it is ignoring what it is told to do (generate appropriately sized 
> and aligned UNMAP requests). If either Ceph, or SCST doesn't show correct 
> numbers, then the broken party should be fixed.
>
> Vlad
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-02 Thread Ilya Dryomov

On Tue, Aug 2, 2016 at 1:05 AM, Alex Gorbachev  wrote:
> Hi Ilya,
>
> On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
>> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
>> wrote:
>>> RBD illustration showing RBD ignoring discard until a certain
>>> threshold - why is that?  This behavior is unfortunately incompatible
>>> with ESXi discard (UNMAP) behavior.
>>>
>>> Is there a way to lower the discard sensitivity on RBD devices?
>>>
> 
>>>
>>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>> print SUM/1024 " KB" }'
>>> 819200 KB
>>>
>>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>>> print SUM/1024 " KB" }'
>>> 782336 KB
>>
>> Think about it in terms of underlying RADOS objects (4M by default).
>> There are three cases:
>>
>> discard range   | command
>> -
>> whole object| delete
>> object's tail   | truncate
>> object's head   | zero
>>
>> Obviously, only delete and truncate free up space.  In all of your
>> examples, except the last one, you are attempting to discard the head
>> of the (first) object.
>>
>> You can free up as little as a sector, as long as it's the tail:
>>
>> OffsetLength  Type
>> 0 4194304 data
>>
>> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>>
>> OffsetLength  Type
>> 0 4193792 data
>
> Looks like ESXi is sending in each discard/unmap with the fixed
> granularity of 8192 sectors, which is passed verbatim by SCST.  There
> is a slight reduction in size via rbd diff method, but now I
> understand that actual truncate only takes effect when the discard
> happens to clip the tail of an image.

... the tail of the *object*.  And again, with "filestore punch hole
= true", page-sized discards anywhere within the image would free up
space, but "rbd diff" won't reflect that.

>
> So far looking at
> https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513
>
> ...the only variable we can control is the count of 8192-sector chunks
> and not their size.  Which means that most of the ESXi discard
> commands will be disregarded by Ceph.
>
> Vlad, is 8192 sectors coming from ESXi, as in the debug:
>
> Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
> 1342099456, nr_sects 8192)

They won't be disregarded, but it would definitely work better if they
were aligned.  1342099456 isn't 4M-aligned.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-01 Thread Alex Gorbachev

Hi Ilya,

On Mon, Aug 1, 2016 at 3:07 PM, Ilya Dryomov  wrote:
> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
> wrote:
>> RBD illustration showing RBD ignoring discard until a certain
>> threshold - why is that?  This behavior is unfortunately incompatible
>> with ESXi discard (UNMAP) behavior.
>>
>> Is there a way to lower the discard sensitivity on RBD devices?
>>

>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 782336 KB
>
> Think about it in terms of underlying RADOS objects (4M by default).
> There are three cases:
>
> discard range   | command
> -
> whole object| delete
> object's tail   | truncate
> object's head   | zero
>
> Obviously, only delete and truncate free up space.  In all of your
> examples, except the last one, you are attempting to discard the head
> of the (first) object.
>
> You can free up as little as a sector, as long as it's the tail:
>
> OffsetLength  Type
> 0 4194304 data
>
> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>
> OffsetLength  Type
> 0 4193792 data

Looks like ESXi is sending in each discard/unmap with the fixed
granularity of 8192 sectors, which is passed verbatim by SCST.  There
is a slight reduction in size via rbd diff method, but now I
understand that actual truncate only takes effect when the discard
happens to clip the tail of an image.

So far looking at
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2057513

...the only variable we can control is the count of 8192-sector chunks
and not their size.  Which means that most of the ESXi discard
commands will be disregarded by Ceph.

Vlad, is 8192 sectors coming from ESXi, as in the debug:

Aug  1 19:01:36 e1 kernel: [168220.570332] Discarding (start_sector
1342099456, nr_sects 8192)

Thank you,
Alex

>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-01 Thread Ilya Dryomov

On Mon, Aug 1, 2016 at 9:07 PM, Ilya Dryomov  wrote:
> On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  
> wrote:
>> RBD illustration showing RBD ignoring discard until a certain
>> threshold - why is that?  This behavior is unfortunately incompatible
>> with ESXi discard (UNMAP) behavior.
>>
>> Is there a way to lower the discard sensitivity on RBD devices?
>>
>>
>>
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 819200 KB
>>
>> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
>> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
>> print SUM/1024 " KB" }'
>> 782336 KB
>
> Think about it in terms of underlying RADOS objects (4M by default).
> There are three cases:
>
> discard range   | command
> -
> whole object| delete
> object's tail   | truncate
> object's head   | zero
>
> Obviously, only delete and truncate free up space.  In all of your
> examples, except the last one, you are attempting to discard the head
> of the (first) object.
>
> You can free up as little as a sector, as long as it's the tail:
>
> OffsetLength  Type
> 0 4194304 data
>
> # blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28
>
> OffsetLength  Type
> 0 4193792 data

Just realized I've left out the most interesting bit.  You can make
zero punch holes, but that's disabled by default in jewel.  The option
is "filestore punch hole = true".  Note that it won't be reflected in
"rbd diff" output.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-01 Thread Ilya Dryomov

On Mon, Aug 1, 2016 at 7:55 PM, Alex Gorbachev  wrote:
> RBD illustration showing RBD ignoring discard until a certain
> threshold - why is that?  This behavior is unfortunately incompatible
> with ESXi discard (UNMAP) behavior.
>
> Is there a way to lower the discard sensitivity on RBD devices?
>
>
>
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 819200 KB
>
> root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
> root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
> print SUM/1024 " KB" }'
> 782336 KB

Think about it in terms of underlying RADOS objects (4M by default).
There are three cases:

discard range   | command
-
whole object| delete
object's tail   | truncate
object's head   | zero

Obviously, only delete and truncate free up space.  In all of your
examples, except the last one, you are attempting to discard the head
of the (first) object.

You can free up as little as a sector, as long as it's the tail:

OffsetLength  Type
0 4194304 data

# blkdiscard -o $(((4 << 20) - 512)) -l 512 /dev/rbd28

OffsetLength  Type
0 4193792 data

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-01 Thread Alex Gorbachev

RBD illustration showing RBD ignoring discard until a certain
threshold - why is that?  This behavior is unfortunately incompatible
with ESXi discard (UNMAP) behavior.

Is there a way to lower the discard sensitivity on RBD devices?



root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
print SUM/1024 " KB" }'
819200 KB

root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
print SUM/1024 " KB" }'
819200 KB

root@e1:/var/log# blkdiscard -o 0 -l 40960 /dev/rbd28
root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
print SUM/1024 " KB" }'
819200 KB

root@e1:/var/log# blkdiscard -o 0 -l 409600 /dev/rbd28
root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
print SUM/1024 " KB" }'
819200 KB

root@e1:/var/log# blkdiscard -o 0 -l 4096000 /dev/rbd28
root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
print SUM/1024 " KB" }'
819200 KB

root@e1:/var/log# blkdiscard -o 0 -l 4096 /dev/rbd28
root@e1:/var/log# rbd diff spin1/testdis|awk '{ SUM += $2 } END {
print SUM/1024 " KB" }'
782336 KB
--
Alex Gorbachev
Storcium


On Sat, Jul 30, 2016 at 9:11 PM, Alex Gorbachev  
wrote:
>>
>> On Wednesday, July 27, 2016, Vladislav Bolkhovitin  wrote:
>>>
>>>
>>> Alex Gorbachev wrote on 07/27/2016 10:33 AM:
>>> > One other experiment: just running blkdiscard against the RBD block
>>> > device completely clears it, to the point where the rbd-diff method
>>> > reports 0 blocks utilized.  So to summarize:
>>> >
>>> > - ESXi sending UNMAP via SCST does not seem to release storage from
>>> > RBD (BLOCKIO handler that is supposed to work with UNMAP)
>>> >
>>> > - blkdiscard does release the space
>>>
>>> How did you run blkdiscard? It might be that blkdiscard discarded big
>>> areas, while ESXi
>>> sending UNMAP commands for areas smaller, than min size, which could be
>>> discarded, or
>>> not aligned as needed, so those discard requests just ignored.
>
> Here is the output of the debug, many more of these statements before
> and after.  Is it correct to state then that SCST is indeed executing
> the discard and the RBD device is ignoring it (since the used size in
> ceph is not diminishing)?
>
> Jul 30 21:08:46 e1 kernel: [ 3032.199972] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570716160, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.202622] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570724352, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.207214] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570732544, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.210395] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570740736, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.212951] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570748928, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.216187] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570757120, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.219299] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570765312, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.222658] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570773504, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.225948] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570781696, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.230092] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570789888, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.234153] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570798080, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.238001] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570806272, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.240876] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570814464, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.242771] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570822656, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.244943] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570830848, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.247506] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570839040, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.250090] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570847232, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.253229] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570855424, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.256001] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570863616, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.259204] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 570871808, nr_sects
> 8192)
> Jul 30 21:08:46 e1 kernel: [ 3032.261368] [22016]:
> vdisk_unmap_range:3830:Discarding (start_sector 57088, nr_sects
> 8192)
> Jul 30 21:08:46 e1 ker

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-07-30 Thread Alex Gorbachev

>
> On Wednesday, July 27, 2016, Vladislav Bolkhovitin  wrote:
>>
>>
>> Alex Gorbachev wrote on 07/27/2016 10:33 AM:
>> > One other experiment: just running blkdiscard against the RBD block
>> > device completely clears it, to the point where the rbd-diff method
>> > reports 0 blocks utilized.  So to summarize:
>> >
>> > - ESXi sending UNMAP via SCST does not seem to release storage from
>> > RBD (BLOCKIO handler that is supposed to work with UNMAP)
>> >
>> > - blkdiscard does release the space
>>
>> How did you run blkdiscard? It might be that blkdiscard discarded big
>> areas, while ESXi
>> sending UNMAP commands for areas smaller, than min size, which could be
>> discarded, or
>> not aligned as needed, so those discard requests just ignored.

Here is the output of the debug, many more of these statements before
and after.  Is it correct to state then that SCST is indeed executing
the discard and the RBD device is ignoring it (since the used size in
ceph is not diminishing)?

Jul 30 21:08:46 e1 kernel: [ 3032.199972] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570716160, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.202622] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570724352, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.207214] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570732544, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.210395] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570740736, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.212951] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570748928, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.216187] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570757120, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.219299] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570765312, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.222658] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570773504, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.225948] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570781696, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.230092] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570789888, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.234153] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570798080, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.238001] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570806272, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.240876] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570814464, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.242771] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570822656, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.244943] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570830848, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.247506] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570839040, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.250090] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570847232, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.253229] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570855424, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.256001] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570863616, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.259204] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570871808, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.261368] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 57088, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.264025] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570888192, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.266737] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570896384, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.270143] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570904576, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.273975] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570912768, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.278163] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570920960, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.282250] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570929152, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.285932] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570937344, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.289736] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570945536, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.292506] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570953728, nr_sects
8192)
Jul 30 21:08:46 e1 kernel: [ 3032.294706] [22016]:
vdisk_unmap_range:3830:Discarding (start_sector 570961920, nr_sects
8192)


Thank you,
Alex

>
>
> I indeed ran blkdiscard on the whole device.  So the question to the Ceph
> list is below what length disc

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-07-30 Thread Alex Gorbachev

Hi Vlad,

On Wednesday, July 27, 2016, Vladislav Bolkhovitin  wrote:

>
> Alex Gorbachev wrote on 07/27/2016 10:33 AM:
> > One other experiment: just running blkdiscard against the RBD block
> > device completely clears it, to the point where the rbd-diff method
> > reports 0 blocks utilized.  So to summarize:
> >
> > - ESXi sending UNMAP via SCST does not seem to release storage from
> > RBD (BLOCKIO handler that is supposed to work with UNMAP)
> >
> > - blkdiscard does release the space
>
> How did you run blkdiscard? It might be that blkdiscard discarded big
> areas, while ESXi
> sending UNMAP commands for areas smaller, than min size, which could be
> discarded, or
> not aligned as needed, so those discard requests just ignored.


I indeed ran blkdiscard on the whole device.  So the question to the Ceph
list is below what length discard is ignored? I saw at least one other user
post a similar issue with ESXi-SCST-RBD.


>
> For completely correct test you need to run blkdiscard for exactly the
> same areas, both
> start and size, as the ESXi UNMAP requests you are seeing on the SCST
> traces.


I am running a test with the debug settings you provided, and will keep
this thread updated with results.  Much appreciate the guidance.

Alex


>
> Vlad
>
>

-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-07-27 Thread Alex Gorbachev

One other experiment: just running blkdiscard against the RBD block
device completely clears it, to the point where the rbd-diff method
reports 0 blocks utilized.  So to summarize:

- ESXi sending UNMAP via SCST does not seem to release storage from
RBD (BLOCKIO handler that is supposed to work with UNMAP)

- blkdiscard does release the space

--
Alex Gorbachev
Storcium


On Wed, Jul 27, 2016 at 11:55 AM, Alex Gorbachev  
wrote:
> Hi Vlad,
>
> On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin  wrote:
>> Hi,
>>
>> I would suggest to rebuild SCST in the debug mode (after "make 2debug"), 
>> then before
>> calling the unmap command enable "scsi" and "debug" logging for scst and 
>> scst_vdisk
>> modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi"
>>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug"
>>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for 
>>>the unmap
>> command vdisk_unmap_range() is reporting running blkdev_issue_discard() in 
>> the kernel
>> logs.
>>
>> To double check, you might also add trace statement just before 
>> blkdev_issue_discard()
>> in vdisk_unmap_range().
>
> With the debug settings on, I am seeing the below output - this means
> that discard is being sent to the backing (RBD) device, correct?
>
> Including the ceph-users list to see if there is a reason RBD is not
> processing this discard/unmap.
>
> Thank you,
> --
> Alex Gorbachev
> Storcium
>
> Jul 26 08:23:38 e1 kernel: [  858.324715] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.324740] [20426]:
> vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.324743] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192)
> Jul 26 08:23:38 e1 kernel: [  858.336218] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.336232] [20426]:
> vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.336234] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192)
> Jul 26 08:23:38 e1 kernel: [  858.351446] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.351468] [20426]:
> vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.351471] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192)
> Jul 26 08:23:38 e1 kernel: [  858.373407] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:23:38 e1 kernel: [  858.373422] [20426]:
> vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:23:38 e1 kernel: [  858.373424] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192)
>
> Jul 26 08:24:04 e1 kernel: [  884.170201] [6290]: scst_cmd_init_done:829:CDB:
> Jul 26 08:24:04 e1 kernel: [  884.170202]
> (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
> Jul 26 08:24:04 e1 kernel: [  884.170205]0: 42 00 00 00 00 00 00
> 00 18 00 00 00 00 00 00 00   B...
> Jul 26 08:24:04 e1 kernel: [  884.170268] [6290]: scst:
> scst_parse_cmd:1312:op_name  (cmd 88201b556300),
> direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
> out_bufflen=0, (expected len data 24, expected len DIF 0, out expected
> len 0), flags=0x80260, internal 0, naca 0
> Jul 26 08:24:04 e1 kernel: [  884.173983] [20426]: scst:
> scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0,
> host_status 0, driver_status 0, resp_data_len 0
> Jul 26 08:24:04 e1 kernel: [  884.173998] [20426]:
> vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0,
> data_len 24
> Jul 26 08:24:04 e1 kernel: [  884.174001] [20426]:
> vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192)
> Jul 26 08:24:04 e1 kernel: [  884.174224] [6290]: scst:
> scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator
> iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1,
> queue_type 1, tag 4005936 (cmd 88201b5565c0, sess
> 880ffa2c)
> Jul 26 08:24:04 e1 kernel: [  884.174227] [6290]: scst_cmd_init_done:829:CDB:
> Jul 26 08:24:04 e1 kernel: [  884.174228]
> (h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
> Jul 26 08:24:04 e1 kernel: [  884.174231]0: 42 00 00 00 00 00 00
> 00 18 00 00 00 00 00 00 00   B...
> Jul 26 08:24:04 e1 kernel: [  884.174256] [6290]: scst:
> scst_parse_cmd:1312:op_name  (cmd 88201b5565c0),
> direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
> out_bufflen=

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-07-27 Thread Alex Gorbachev

Hi Vlad,

On Mon, Jul 25, 2016 at 10:44 PM, Vladislav Bolkhovitin  wrote:
> Hi,
>
> I would suggest to rebuild SCST in the debug mode (after "make 2debug"), then 
> before
> calling the unmap command enable "scsi" and "debug" logging for scst and 
> scst_vdisk
> modules by 'echo add scsi >/sys/kernel/scst_tgt/trace_level; echo "add scsi"
>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level; echo "add debug"
>>/sys/kernel/scst_tgt/handlers/vdisk_fileio/trace_level', then check, if for 
>>the unmap
> command vdisk_unmap_range() is reporting running blkdev_issue_discard() in 
> the kernel
> logs.
>
> To double check, you might also add trace statement just before 
> blkdev_issue_discard()
> in vdisk_unmap_range().

With the debug settings on, I am seeing the below output - this means
that discard is being sent to the backing (RBD) device, correct?

Including the ceph-users list to see if there is a reason RBD is not
processing this discard/unmap.

Thank you,
--
Alex Gorbachev
Storcium

Jul 26 08:23:38 e1 kernel: [  858.324715] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b552940, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.324740] [20426]:
vdisk_parse_offset:2930:cmd 88201b552c00, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.324743] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61779968 (blocks 8192)
Jul 26 08:23:38 e1 kernel: [  858.336218] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b552c00, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.336232] [20426]:
vdisk_parse_offset:2930:cmd 88201b552ec0, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.336234] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61788160 (blocks 8192)
Jul 26 08:23:38 e1 kernel: [  858.351446] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b552ec0, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.351468] [20426]:
vdisk_parse_offset:2930:cmd 88201b553180, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.351471] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61796352 (blocks 8192)
Jul 26 08:23:38 e1 kernel: [  858.373407] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b553180, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:23:38 e1 kernel: [  858.373422] [20426]:
vdisk_parse_offset:2930:cmd 88201b553440, lba_start 0, loff 0,
data_len 24
Jul 26 08:23:38 e1 kernel: [  858.373424] [20426]:
vdisk_unmap_range:3810:Unmapping lba 61804544 (blocks 8192)

Jul 26 08:24:04 e1 kernel: [  884.170201] [6290]: scst_cmd_init_done:829:CDB:
Jul 26 08:24:04 e1 kernel: [  884.170202]
(h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
Jul 26 08:24:04 e1 kernel: [  884.170205]0: 42 00 00 00 00 00 00
00 18 00 00 00 00 00 00 00   B...
Jul 26 08:24:04 e1 kernel: [  884.170268] [6290]: scst:
scst_parse_cmd:1312:op_name  (cmd 88201b556300),
direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
out_bufflen=0, (expected len data 24, expected len DIF 0, out expected
len 0), flags=0x80260, internal 0, naca 0
Jul 26 08:24:04 e1 kernel: [  884.173983] [20426]: scst:
scst_cmd_done_local:2272:cmd 88201b556b40, status 0, msg_status 0,
host_status 0, driver_status 0, resp_data_len 0
Jul 26 08:24:04 e1 kernel: [  884.173998] [20426]:
vdisk_parse_offset:2930:cmd 88201b556e00, lba_start 0, loff 0,
data_len 24
Jul 26 08:24:04 e1 kernel: [  884.174001] [20426]:
vdisk_unmap_range:3810:Unmapping lba 74231808 (blocks 8192)
Jul 26 08:24:04 e1 kernel: [  884.174224] [6290]: scst:
scst_cmd_init_done:828:NEW CDB: len 16, lun 16, initiator
iqn.1995-05.com.vihl2.ibft, target iqn.2008-10.net.storcium:scst.1,
queue_type 1, tag 4005936 (cmd 88201b5565c0, sess
880ffa2c)
Jul 26 08:24:04 e1 kernel: [  884.174227] [6290]: scst_cmd_init_done:829:CDB:
Jul 26 08:24:04 e1 kernel: [  884.174228]
(h)___0__1__2__3__4__5__6__7__8__9__A__B__C__D__E__F
Jul 26 08:24:04 e1 kernel: [  884.174231]0: 42 00 00 00 00 00 00
00 18 00 00 00 00 00 00 00   B...
Jul 26 08:24:04 e1 kernel: [  884.174256] [6290]: scst:
scst_parse_cmd:1312:op_name  (cmd 88201b5565c0),
direction=1 (expected 1, set yes), lba=0, bufflen=24, data len 24,
out_bufflen=0, (expected len data 24, expected len DIF 0, out expected
len 0), flags=0x80260, internal 0, naca 0




>
> Alex Gorbachev wrote on 07/23/2016 08:48 PM:
>> Hi Nick, Vlad, SCST Team,
>>
> I have been looking at using the rbd-nbd tool, so that the caching is
 provided by librbd and then use BLOCKIO with SCST. This will however need
 some work on the SCST resource agents to ensure the librbd cache is
 invalidated on ALUA state change.
>
> The other thing I have seen is this
>
> https://lwn.net/Articles/691871/
>
> Which may mean FILEIO will support thin provis

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

24 matches

Site Navigation

Mail list logo

Footer information