Re: problem with discard granularity in sd

2017-04-14 Thread David Buckley
Hi Martin,

I had assumed that the VPD values for the LUNs were valid, and hadn't
even thought to check if they were standards-compliant.  Based on the
SBC descriptions, it is clear that you are correct and this is an
issue with the storage not reporting a proper max_ws_blocks.  At least
now I have enough details to demonstrate to NetApp that this is an
issue on their side, and also have the relevant details from SBC.

Thank you once again for your help in understanding the underlying
technical issues responsible for the WS UNMAP failures in my
environment.  I really appreciate you taking the time to respond,
especially as this didn't turn out to be an issue with kernel code at
all.

-David


Re: problem with discard granularity in sd

2017-04-13 Thread Martin K. Petersen
David Buckley  writes:

David,

> The underlying issue seems to be that (per VPD page) the maximum
> supported unmap request is 8192 * 512 = 4194304 bytes, while the
> maximum write same request is 0x4000 * 512 = 8388608 bytes.  It
> appears both of these values are correct, and in the case where a WS
> UNMAP request larger than 8192 blocks is received the UNMAP is ignored
> and the result is effectively a WS ZERO request.

That's broken on the filer, then.

> If I'm correct in my understanding, then it seems like the root cause
> of the failures is that the current code assumes a disk will always
> support WS UNMAP requests up to max_ws_blocks and does not account for
> a case where max_unmap_blocks is smaller than max_ws_blocks.

SBC says:

"A MAXIMUM UNMAP LBA COUNT field set to a non-zero value indicates
the maximum number of LBAs that may be unmapped by an UNMAP
command."

"A MAXIMUM WRITE SAME LENGTH field set to a non-zero value indicates
the maximum number of contiguous logical blocks that the device
server allows to be unmapped or written in a single WRITE SAME
command."

I'm confident that our behavior is correct and that the problem is on
the storage side.

The good news is that the new discard code would alleviate your problem
in that (based on what the filer reported in your setup) we'd use UNMAP
to deallocate and WRITE SAME to zero.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: problem with discard granularity in sd

2017-04-12 Thread David Buckley
Hi Martin,

I have identified the key difference between the older working kernels
and newer ones; on older kernels the provisioning_mode is set to
'unmap', while on newer kernels it is 'writesame_16'.  This is due to
a change in sd_read_block_limits which prevents selection of
SD_LBP_UNMAP if the disk reports lbprz=1, which I'm assuming is the
desired behavior.

As I mentioned previously, one difference between working and failing
kernels is that discard_max_bytes is double the size on the latter.
To determine whether the NetApp incorrectly handles WRITE SAME with
UNMAP in general or if the increase in discard_max_bytes was
responsible, I swapped max_ws_blocks with max_unmap_blocks for WS16
requests in sd_config_discard in my test kernel.  Testing with this
change showed UNMAP requests working as expected, indicating that the
discard_max_bytes change is directly responsible for the failing
requests.

The underlying issue seems to be that (per VPD page) the maximum
supported unmap request is 8192 * 512 = 4194304 bytes, while the
maximum write same request is 0x4000 * 512 = 8388608 bytes.  It
appears both of these values are correct, and in the case where a WS
UNMAP request larger than 8192 blocks is received the UNMAP is ignored
and the result is effectively a WS ZERO request.

If I'm correct in my understanding, then it seems like the root cause
of the failures is that the current code assumes a disk will always
support WS UNMAP requests up to max_ws_blocks and does not account for
a case where max_unmap_blocks is smaller than max_ws_blocks.

I would think something like this could be a reasonable fix:

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fe0f799..b0233c2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -643,6 +643,8 @@ static void sd_config_discard(struct scsi_disk
*sdkp, unsigned int mode)
struct request_queue *q = sdkp->disk->queue;
unsigned int logical_block_size = sdkp->device->sector_size;
unsigned int max_blocks = 0;
+unsigned int max_ws_unmap_blocks =
+min_not_zero(sdkp->max_ws_blocks, sdkp->max_unmap_blocks);

q->limits.discard_zeroes_data = 0;

@@ -679,7 +681,7 @@ static void sd_config_discard(struct scsi_disk
*sdkp, unsigned int mode)
break;

case SD_LBP_WS16:
-   max_blocks = min_not_zero(sdkp->max_ws_blocks,
+   max_blocks = min_not_zero(max_ws_unmap_blocks,
  (u32)SD_MAX_WS16_BLOCKS);
q->limits.discard_zeroes_data = sdkp->lbprz;
break;


Thanks yet again for all your help.

-David

On Tue, Apr 11, 2017 at 6:55 PM, Martin K. Petersen
 wrote:
>
> David,
>
>> I am wondering if part of the issue is that in my use case, UNMAP and
>> WRITE SAME zeros result in very different results.  With thin
>> provisioned LUNs, UNMAP requests result in the blocks being freed and
>> thus reduces the actual size of the LUN allocation on disk.  If WRITE
>> SAME requests are used to zero the blocks, they remain allocated and
>> thus the real size of the LUN grows to match the allocated size
>> (effectively thick-provisioning the LUN).
>
> The filer explicitly reported support for WRITE SAME(10/16) with UNMAP.
> It seems very odd that it would then completely ignore the UNMAP bit and
> do a regular WRITE SAME.
>
> Are you running latest firmware, btw.?
>
> In any case. The changes I mentioned are now queued up for 4.12. But
> it'll obviously take a while for those to trickle into the
> distributions...
>
> --
> Martin K. Petersen  Oracle Linux Engineering


Re: problem with discard granularity in sd

2017-04-11 Thread Martin K. Petersen

David,

> I am wondering if part of the issue is that in my use case, UNMAP and
> WRITE SAME zeros result in very different results.  With thin
> provisioned LUNs, UNMAP requests result in the blocks being freed and
> thus reduces the actual size of the LUN allocation on disk.  If WRITE
> SAME requests are used to zero the blocks, they remain allocated and
> thus the real size of the LUN grows to match the allocated size
> (effectively thick-provisioning the LUN).

The filer explicitly reported support for WRITE SAME(10/16) with UNMAP.
It seems very odd that it would then completely ignore the UNMAP bit and
do a regular WRITE SAME.

Are you running latest firmware, btw.?

In any case. The changes I mentioned are now queued up for 4.12. But
it'll obviously take a while for those to trickle into the
distributions...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: problem with discard granularity in sd

2017-04-11 Thread David Buckley
(re-sending as I somehow lost the subject on my previous reply)

Martin,

I'm rather surprised nobody else has previously reported this as well,
especially as NetApp hadn't received any reports.  The only probable
explanation I could think of is that EL 7 is still based on a 3.10
kernel so is too old to be affected, and that is likely to be what
most NetApp customers are running.



I've been trying to test some additional kernels to try and narrow the
versions affected, and the change in discard_granularity does not
correspond with discard failing (as you suggested).

With kernel 3.13,  discard works as expected with
discard_granularity=4096 and discard_max_bytes=4194304.

With kernel 3.19, discard stops working completely, with
discard_granularity=4096 and discard_max_bytes=8388608.

Do you think the change in discard_max_bytes could be relevant?  If
that comes from the VPD data it seems odd it would change.

I am wondering if part of the issue is that in my use case, UNMAP and
WRITE SAME zeros result in very different results.  With thin
provisioned LUNs, UNMAP requests result in the blocks being freed and
thus reduces the actual size of the LUN allocation on disk.  If WRITE SAME
requests are used to zero the blocks, they remain allocated and thus the
real size of the LUN grows to match the allocated size (effectively
thick-provisioning the LUN).  This matches what I am
seeing on kernels with discard not working; running 'fstrim
/lun/mount' results in the LUN growing to its max size.

Thank you again for your help with this!

-David

On Thu, Apr 6, 2017 at 10:34 AM, Martin K. Petersen
 wrote:
> David Buckley  writes:
>
> David,
>
>> As I mentioned previously, I'm fairly certain that the issue I'm
>> seeing is due to the fact that while NetApp LUNs are presented as 512B
>> logical/4K physical disks for compatibility, they actually don't
>> support requests smaller than 4K (which makes sense as NetApp LUNs are
>> actually just files allocated on the 4K-block WAFL filesystem).
>
> That may be. But they should still deallocate all the whole 4K blocks
> described by an UNMAP request. Even if head and tail are not aligned.
>
>> Let me know if there's any additional information I can provide. This
>> has resulted in a 2-3x increase in raw disk requirements for some
>> workloads (unfortunately on SSD too), and I'd love to find a solution
>> that doesn't require rolling back to a 3.10 kernel.
>
> I just posted some patches yesterday that will address this (using WRITE
> SAME w/ UNMAP for block zeroing and allowing discards to be sent using
> the UNMAP command, honoring the granularity and alignment suggested by
> the device). That's 4.13 material, though.
>
> The enterprise distros have many customers using NetApp filers. I'm a
> bit puzzled why this is the first we hear of this...
>
> --
> Martin K. Petersen  Oracle Linux Engineering


Re: problem with discard granularity in sd

2017-04-06 Thread Martin K. Petersen
David Buckley  writes:

David,

> As I mentioned previously, I'm fairly certain that the issue I'm
> seeing is due to the fact that while NetApp LUNs are presented as 512B
> logical/4K physical disks for compatibility, they actually don't
> support requests smaller than 4K (which makes sense as NetApp LUNs are
> actually just files allocated on the 4K-block WAFL filesystem).

That may be. But they should still deallocate all the whole 4K blocks
described by an UNMAP request. Even if head and tail are not aligned.

> Let me know if there's any additional information I can provide. This
> has resulted in a 2-3x increase in raw disk requirements for some
> workloads (unfortunately on SSD too), and I'd love to find a solution
> that doesn't require rolling back to a 3.10 kernel.

I just posted some patches yesterday that will address this (using WRITE
SAME w/ UNMAP for block zeroing and allowing discards to be sent using
the UNMAP command, honoring the granularity and alignment suggested by
the device). That's 4.13 material, though.

The enterprise distros have many customers using NetApp filers. I'm a
bit puzzled why this is the first we hear of this...

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: problem with discard granularity in sd

2017-04-05 Thread David Buckley
Hi Martin,

I really appreciate the response. Here is the VPD page data you asked for:

Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 1
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Provisioning type: 2

Block limits VPD page (SBC):
  Write same no zero (WSNZ): 1
  Maximum compare and write length: 1 blocks
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 8388607 blocks
  Optimal transfer length: 128 blocks
  Maximum prefetch length: 0 blocks
  Maximum unmap LBA count: 8192
  Maximum unmap block descriptor count: 64
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 1
  Unmap granularity alignment: 0
  Maximum write same length: 0x4000 blocks


As I mentioned previously, I'm fairly certain that the issue I'm
seeing is due to the fact that while NetApp LUNs are presented as 512B
logical/4K physical disks for compatibility, they actually don't
support requests smaller than 4K (which makes sense as NetApp LUNs are
actually just files allocated on the 4K-block WAFL filesystem).

If the optimal granularity values from VPD are respected (as is the
case with 3.10 kernels), the result is:

minimum_io_size = 512 *8 = 4096  (logical_block_size * VPD optimal
transfer length granularity)
discard_granularity = 512 * 8 = 4096  (logical_block_size * VPD
optimal unmap granularity)


With recent kernels the value of minimum_io_size is unchanged, but
discard_granularity is explicitly set to logical_block_size which
results in unmap requests either being dropped or ignored.

I understand that the VPD values from LUNs may be atypical compared to
physical disks, and I expect there are some major differences in how
unmap (and zeroing) is handled between physical disks and LUNs.  But
it is unfortunate that a solution for misaligned I/O would break
fully-aligned requests which worked flawlessly in previous kernels.

Let me know if there's any additional information I can provide. This
has resulted in a 2-3x increase in raw disk requirements for some
workloads (unfortunately on SSD too), and I'd love to find a solution
that doesn't require rolling back to a 3.10 kernel.

Thanks again!
-David



On Tue, Apr 4, 2017 at 5:12 PM, Martin K. Petersen
 wrote:
> David Buckley  writes:
>
> David,
>
>> They result in discard granularity being forced to logical block size
>> if the disk reports LBPRZ is enabled (which the netapp luns do).
>
> Block zeroing and unmapping are currently sharing some plumbing and that
> has lead to some compromises. In this case a bias towards ensuring data
> integrity for zeroing at the expense of not aligning unmap requests.
>
> Christoph has worked on separating those two functions. His code is
> currently under review.
>
>> I'm not sure of the implications of either the netapp changes, though
>> reporting 4k logical blocks seems potential as this is supported in
>> newer OS at least.
>
> Yes, but it may break legacy applications that assume a 512-byte logical
> block size.
>
>> The sd change potentially would at least partially undo the patches
>> referenced above.  But it would seem that (assuming an aligned
>> filesystem with 4k blocks and minimum_io_size=4096) there is no
>> possibility of a partial block discard or advantage to sending the
>> discard requests in 512 blocks?
>
> The unmap granularity inside a device is often much, much bigger than
> 4K. So aligning to that probably won't make a difference. And it's
> imperative to filesystems that zeroing works at logical block size
> granularity.
>
> The expected behavior for a device is that it unmaps whichever full
> unmap granularity chunks are described by a received request. And then
> explicitly zeroes any partial chunks at the head and tail. So I am
> surprised you see no reclamation whatsoever.
>
> With the impending zero/unmap separation things might fare better. But
> I'd still like to understand the behavior you observe. Please provide
> the output of:
>
> sg_vpd -p lbpv /dev/sdN
> sg_vpd -p bl /dev/sdN
>
> for one of the LUNs and I'll take a look.
>
> Thanks!
>
> --
> Martin K. Petersen  Oracle Linux Engineering


Re: problem with discard granularity in sd

2017-04-04 Thread Martin K. Petersen
David Buckley  writes:

David,

> They result in discard granularity being forced to logical block size
> if the disk reports LBPRZ is enabled (which the netapp luns do).

Block zeroing and unmapping are currently sharing some plumbing and that
has lead to some compromises. In this case a bias towards ensuring data
integrity for zeroing at the expense of not aligning unmap requests.

Christoph has worked on separating those two functions. His code is
currently under review.

> I'm not sure of the implications of either the netapp changes, though
> reporting 4k logical blocks seems potential as this is supported in
> newer OS at least.

Yes, but it may break legacy applications that assume a 512-byte logical
block size.

> The sd change potentially would at least partially undo the patches
> referenced above.  But it would seem that (assuming an aligned
> filesystem with 4k blocks and minimum_io_size=4096) there is no
> possibility of a partial block discard or advantage to sending the
> discard requests in 512 blocks?

The unmap granularity inside a device is often much, much bigger than
4K. So aligning to that probably won't make a difference. And it's
imperative to filesystems that zeroing works at logical block size
granularity.

The expected behavior for a device is that it unmaps whichever full
unmap granularity chunks are described by a received request. And then
explicitly zeroes any partial chunks at the head and tail. So I am
surprised you see no reclamation whatsoever.

With the impending zero/unmap separation things might fare better. But
I'd still like to understand the behavior you observe. Please provide
the output of:

sg_vpd -p lbpv /dev/sdN
sg_vpd -p bl /dev/sdN

for one of the LUNs and I'll take a look.

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering