Re: [BUG] BLKZEROOUT on dm-crypt container cause OOM / kernel panic

2017-08-10 Thread Tom Yan
Never mind, I guess I was wrong, coz bio_add_page() is always called
with "offset" = 0 here, so:

offset == bv->bv_offset + bv->bv_len

is never matched. Hence,

if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;

will trigger the if-break.

So maybe this is after all just a dm-crypt issue on handling multiple
bios created with next_bio (a bio chain)?

On 10 August 2017 at 07:27, Tom Yan <tom.t...@gmail.com> wrote:
> I haven't really tested but I was aware of the commit before I send my
> last email. It doesn't seem relevant to be honest, because it doesn't
> change the fact that the inner loop wil only end if the whole request
> has been looped over. So still one big bio.
>
> There are a few things that seem suspicious to me. First of all, the
> inner loop has an if-break at its end that seem to practically do
> nothing, especially in terms of making the inner loop end when the bio
> reach its expected size (BIO_MAX_PAGES?).
>
> bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);
>
> If you take a look at bio_add_page():
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/bio.c?h=v4.13-rc4#n820
>
> it will basically always return "len", which is "sz" here, unchanged.
>
> if (bi_size < sz)
> break;
>
> So this pretty much never happens, with two exceptions:
>
> if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> return 0;
>
> which is most likely irrelevant in this case,
>
> if (bio->bi_vcnt >= bio->bi_max_vecs)
> return 0;
>
> which seem to matter in this case. However:
>
> if (page == bv->bv_page &&
> offset == bv->bv_offset + bv->bv_len) {
> bv->bv_len += len;
> goto done;
> }
> ...
> if (bio->bi_vcnt >= bio->bi_max_vecs)
> return 0;
> ...
> bio->bi_vcnt++;
> done:
> bio->bi_iter.bi_size += len;
> return len;
>
> So if the first condition in the above quote is matched, the exception
> would never happen either.
>
> On 9 August 2017 at 21:38, h...@lst.de <h...@lst.de> wrote:
>> Does commit 615d22a51c04856efe62af6e1d5b450aaf5cc2c0
>> "block: Fix __blkdev_issue_zeroout loop" fix the issue for you?


Re: [BUG] BLKZEROOUT on dm-crypt container cause OOM / kernel panic

2017-08-09 Thread Tom Yan
I haven't really tested but I was aware of the commit before I send my
last email. It doesn't seem relevant to be honest, because it doesn't
change the fact that the inner loop wil only end if the whole request
has been looped over. So still one big bio.

There are a few things that seem suspicious to me. First of all, the
inner loop has an if-break at its end that seem to practically do
nothing, especially in terms of making the inner loop end when the bio
reach its expected size (BIO_MAX_PAGES?).

bi_size = bio_add_page(bio, ZERO_PAGE(0), sz, 0);

If you take a look at bio_add_page():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/bio.c?h=v4.13-rc4#n820

it will basically always return "len", which is "sz" here, unchanged.

if (bi_size < sz)
break;

So this pretty much never happens, with two exceptions:

if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return 0;

which is most likely irrelevant in this case,

if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;

which seem to matter in this case. However:

if (page == bv->bv_page &&
offset == bv->bv_offset + bv->bv_len) {
bv->bv_len += len;
goto done;
}
...
if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;
...
bio->bi_vcnt++;
done:
bio->bi_iter.bi_size += len;
return len;

So if the first condition in the above quote is matched, the exception
would never happen either.

On 9 August 2017 at 21:38, h...@lst.de  wrote:
> Does commit 615d22a51c04856efe62af6e1d5b450aaf5cc2c0
> "block: Fix __blkdev_issue_zeroout loop" fix the issue for you?


Re: [PATCH v2 2/2] sd: check BLK_DEF_MAX_SECTORS against max_dev_sectors

2016-08-14 Thread Tom Yan
On 14 August 2016 at 17:00, Tom Yan <tom.t...@gmail.com> wrote:
>
> That won't really work. min_t() would, though the line is gonna be a
> bit long; not sure if I can/should simply cast the type (unsigned int)
> to BLK_DEF_MAX_SECTORS. And which braces are you referring to?
>
Oh you mean the else-clause braces. Hmm I thought the coding style was
that if you needed braces in one clause, you should use braces for
others in the condition even if it consists of only one line of
statement. At least that is what I was told when I send patches for
libata.

So, should I use min_t() or just do a type casting in front to
BLK_DEF_MAX_SECTORS? Also do I need to break the line at a certain
point when it exceeds certain length?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] libata-scsi: make sure Maximum Write Same Length is not too large

2016-08-14 Thread Tom Yan
On 13 August 2016 at 04:56, Martin K. Petersen
<martin.peter...@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.t...@gmail.com> writes:
>
> Tom,
>
>>> put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM / (sector_size / 512), 
>>> [36]);
>
> How many 8-byte ranges fit in a 4096-byte sector?

The thing is, as of ACS-4, blocks that carry DSM/TRIM LBA Range
Entries are always 512-byte. Honestly, I have no idea how that would
work on a 4Kn SSD, if it is / will ever be a thing.

Anyway I was NOT trying to bump the number entries allowed, but
instead to decrease it accordingly for the increased sector size, only
to make sure that the Maximum Write Same Length advertised here is
gonna overflow the 32-bit limit when it is converted to a
representation in bytes (i.e. discard_max_bytes and
write_same_max_bytes), and probably also, the bio size.

It's also a suggestion from me for Shaun to determine how the Maximum
Write Same Length to be advertised, so that the size in bytes covered
by a single SCT Write Same will stay at ~2G in spite of the logical
sector size.

>
> Tom> So were you trying to pointing out something I am still missing, or
> Tom> were you merely confirming I was right?
>
> I suggest you drop ATA_MAX_TRIM_RNUM and do:
>
> enum {
>  ATA_TRIM_BLOCKS_PER_RANGE = 65535, /* 0x blocks per range desc. */
>  ATA_TRIM_RANGE_SIZE_SHIFT = 3, /* range descriptor is 8 bytes */
> };
>
> put_unaligned_be64(ATA_TRIM_BLOCKS_PER_RANGE *
>sector_size >> ATA_TRIM_RANGE_SIZE_SHIFT, [36]);
>
> Might be worthwhile to create an ata_max_lba_range_blocks() wrapper.
>
> --
> Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] sd: check BLK_DEF_MAX_SECTORS against max_dev_sectors

2016-08-14 Thread Tom Yan
On 13 August 2016 at 05:37, Martin K. Petersen
 wrote:
>
> It would be pretty unusual for a device that is smart enough to report a
> transfer length limit to be constrained to 1 MB and change.
>

Well, it is done pretty much for libata's SATL. Also since
opt_xfer_blocks is checked against dev_max, I think it makes sense in
logic that we do the equivalence for BLK_DEF_MAX_SECTORS. Not to
mention that since we check rw_max against the controller limit, no
reason not to make sure we check it against the device limit in both
cases.

>
> If you want to fix this please drop the braces and do:
>
> rw_max = min(BLK_DEF_MAX_SECTORS, q->limits.max_dev_sectors);

That won't really work. min_t() would, though the line is gonna be a
bit long; not sure if I can/should simply cast the type (unsigned int)
to BLK_DEF_MAX_SECTORS. And which braces are you referring to?

>
> --
> Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] sd: dynamically adjust SD_MAX_WS16_BLOCKS as per the actual logical block size

2016-08-14 Thread Tom Yan
On 13 August 2016 at 04:26, Martin K. Petersen
<martin.peter...@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.t...@gmail.com> writes:
>
> Tom,
>
> Tom> I don't really follow. What would this BLK_MAX_BIO_SECTORS be? It
> Tom> doesn't appear to me that a static value is going to address the
> Tom> problem I am addressing in this patch.
>
> 0x7f, the maximum number of block layer sectors that can be
> expressed in a single bio.

Hmm, so when we queue any of the limits, we convert a certain maximum
number of physical sectors (which we has already been doing) to its
corresponding maximum number of block layer sectors, and then make
sure that number does not exceed 0x7f, right?

>
> --
> Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] libata-scsi: use dev->max_sectors from libata-core appropriately

2016-08-12 Thread Tom Yan
On 12 August 2016 at 19:56,   wrote:
>
> Also note that ATA_HORKAGE_MAX_SEC_LBA48 is not supposed to work
> automatically anyway, even when max_hw_sectors is as high as 65535,
> since the effective max_sectors will be set by the SCSI disk driver.
>
I missed the fact that ATA_HORKAGE_MAX_SEC_LBA48 is only (and should
only be) used by a few ATAPI devices, and the SCSI cdrom devices will
not touch max_sectors like the SCSI disk driver. Therefore, I guess I
should really just use blk_queue_max_hw_sectors() to set both
max_hw_sectors and max_sectors for ATAPI devices. Sending a v2.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] libata-core: do not set dev->max_sectors for LBA48 devices

2016-08-12 Thread Tom Yan
On 12 August 2016 at 13:18, Tom Yan <tom.t...@gmail.com> wrote:
> On 12 August 2016 at 10:01, Martin K. Petersen
> <martin.peter...@oracle.com> wrote:
>>
>> Again, the point of max_hw_sectors and max_dev_sectors is to enforce the
>> hard limits of controller and device respectively. Nothing else.
>>
>
> Sounds like libata-scsi is doing something wrong then. It should not
> set max_hw_sectors to dev->max_sectors that is set by libata-core:
>
>> blk_queue_max_hw_sectors(q, dev->max_sectors);

Though we'll still have to "abuse" max_hw_sectors for ATAPI class
devices, since neither the SATL or sr cares about VPD. The only
devices that need ATA_HORKAGE_MAX_SEC_LBA48 are ATAPI devices as well.

>
> but instead it should report it as Maximum Transfer Length and let sd
> set it as max_dev_sectors:
>
>> put_unaligned_be32(dev->max_sectors, [8]);
>
> While max_hw_sectors will be left untouched (in the case of AHCI, for
> example, since its SCSI host template does not have max_sectors set;
> so max_hw_sectors will be SCSI_DEFAULT_MAX_SECTORS).
>
> Although that means setting dev->max_sectors to a value larger than
> 1024 will probably be a no-op, if that's really an issue, we should
> have the host templates in libata updated.
>
> Make sense?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] libata-core: do not set dev->max_sectors for LBA48 devices

2016-08-11 Thread Tom Yan
On 12 August 2016 at 10:01, Martin K. Petersen
 wrote:
>
> Again, the point of max_hw_sectors and max_dev_sectors is to enforce the
> hard limits of controller and device respectively. Nothing else.
>

Sounds like libata-scsi is doing something wrong then. It should not
set max_hw_sectors to dev->max_sectors that is set by libata-core:

> blk_queue_max_hw_sectors(q, dev->max_sectors);

but instead it should report it as Maximum Transfer Length and let sd
set it as max_dev_sectors:

> put_unaligned_be32(dev->max_sectors, [8]);

While max_hw_sectors will be left untouched (in the case of AHCI, for
example, since its SCSI host template does not have max_sectors set;
so max_hw_sectors will be SCSI_DEFAULT_MAX_SECTORS).

Although that means setting dev->max_sectors to a value larger than
1024 will probably be a no-op, if that's really an issue, we should
have the host templates in libata updated.

Make sense?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] libata-scsi: make sure Maximum Write Same Length is not too large

2016-08-11 Thread Tom Yan
On 12 August 2016 at 09:34, Martin K. Petersen
 wrote:
>> "Tom," == tom ty89  writes:
>
> Tom,
>
> +   put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
> +  (sector_size / 512), [36]);
>
> MAXIMUM WRITE SAME LENGTH is in units of the device's logical block
> size.

Yeah that's why:

+   sector_size = ata_id_logical_sector_size(args->id);

Where the logical sector size of the device is reported with the same
function in ata_scsi_dev_config() and ata_scsiop_read_cap().

So were you trying to pointing out something I am still missing, or
were you merely confirming I was right?

>
> --
> Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] libata-scsi: make sure Maximum Write Same Length is not too large

2016-08-11 Thread Tom Yan
The patch isn't about how the request from the block layer will be
processed (to form the SCSI commands).

What it addresses is blk_queue_max_write_same_sectors() and
blk_queue_max_discard_sectors() that are called in the SCSI disk
driver. You can see that they are called with an input of the Maximum
Write Same Length times (logical_block_size >> 9), which is to convert
the number of sectors to an appropriate value for the block layer
(512-byte block based). On 4Kn drives, the multiplier will be 4096 >>
9, which is 8. So if the reported Maximum Write Same Length is
4194240, when the value is passed onto the block layer to set the
limits for a 4Kn drive, it will be 4194240 * 8. (And when this value
is represented in bytes, it will be further multiplied by 512 and over
32-bit.)

`logical_block_size >> 9` is pretty much the same thing as
``logical_block_size / 512`. I should have probably used the bit shift
way instead.

On 11 August 2016 at 17:04, Shaun Tancheff <sh...@tancheff.com> wrote:
> On Thu, Aug 11, 2016 at 3:26 AM,  <tom.t...@gmail.com> wrote:
>> From: Tom Yan <tom.t...@gmail.com>
>>
>> Currently we advertise Maximum Write Same Length based on the
>> maximum number of sectors that one-block TRIM payload can cover.
>> The field are used to derived discard_max_bytes and
>> write_same_max_bytes limits in the block layer, which currently can
>> at max be 0x (32-bit).
>>
>> However, with a AF 4Kn drive, the derived limits would be 65535 *
>> 64 * 4096 = 0x3fffc (34-bit). Therefore, we now devide
>> ATA_MAX_TRIM_RNUM with (logical sector size / 512), so that the
>> derived limits will not overflow.
>>
>> The limits are now also consistent among drives with different
>> logical sector sizes. (Although that may or may not be what we
>> want ultimately when the SCSI / block layer allows larger
>> representation in the future.)
>>
>> Although 4Kn ATA SSDs may not be a thing on the market yet, this
>> patch is necessary for forthcoming SCT Write Same translation
>> support, which could be available on traditional HDDs where 4Kn is
>> already a thing. Also it should not change the current behavior on
>> drives with 512-byte logical sectors.
>>
>> Note: this patch is not about AF 512e drives.
>> Signed-off-by: Tom Yan <tom.t...@gmail.com>
>>
>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
>> index be9c76c..dcadcaf 100644
>> --- a/drivers/ata/libata-scsi.c
>> +++ b/drivers/ata/libata-scsi.c
>> @@ -2295,6 +2295,7 @@ static unsigned int ata_scsiop_inq_89(struct 
>> ata_scsi_args *args, u8 *rbuf)
>>  static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf)
>>  {
>> u16 min_io_sectors;
>> +   u32 sector_size;
>>
>> rbuf[1] = 0xb0;
>> rbuf[3] = 0x3c; /* required VPD size with unmap support */
>> @@ -2309,17 +2310,27 @@ static unsigned int ata_scsiop_inq_b0(struct 
>> ata_scsi_args *args, u8 *rbuf)
>> min_io_sectors = 1 << ata_id_log2_per_physical_sector(args->id);
>> put_unaligned_be16(min_io_sectors, [6]);
>>
>> -   /*
>> -* Optimal unmap granularity.
>> -*
>> -* The ATA spec doesn't even know about a granularity or alignment
>> -* for the TRIM command.  We can leave away most of the unmap related
>> -* VPD page entries, but we have specifify a granularity to signal
>> -* that we support some form of unmap - in thise case via WRITE SAME
>> -* with the unmap bit set.
>> -*/
>> +   sector_size = ata_id_logical_sector_size(args->id);
>> if (ata_id_has_trim(args->id)) {
>> -   put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM, [36]);
>> +   /*
>> +* Maximum write same length.
>> +*
>> +* Avoid overflow in discard_max_bytes and 
>> write_same_max_bytes
>> +* with AF 4Kn drives. Also make them consistent among drives
>> +* with different logical sector sizes.
>> +*/
>> +   put_unaligned_be64(65535 * ATA_MAX_TRIM_RNUM /
>> +  (sector_size / 512), [36]);
>
> I think the existing fixups in sd_setup_discard_cmnd() and
> sd_setup_write_same_cmnd()
> are 'doing the right thing'.
>
> If I understand the stack correctly:
>
> libata-scsi.c (and sd.c) both report a maximum in terms of 512 byte sectors.
> The upper layer stack works (mostly) on a mix of bytes and 512 byte sectors
> agnostic of the underlying hard

Re: [PATCH v2 2/2] libata-core: do not set dev->max_sectors for LBA48 devices

2016-08-11 Thread Tom Yan
On 11 August 2016 at 11:37, Martin K. Petersen
<martin.peter...@oracle.com> wrote:
>>>>>> "Tom" == Tom Yan <tom.t...@gmail.com> writes:
>
> I don't agree with conflating the optimal transfer size and the maximum
> supported ditto. Submitting the largest possible I/O to a device does
> not guarantee that you get the best overall performance.
>
>  - max_hw_sectors is gated by controller DMA constraints.
>
>  - max_dev_sectors is set for devices that explicitly report a transfer
>length limit.
>
>  - max_sectors, the soft limit for filesystem read/write requests,
>should be left at BLK_DEF_MAX_SECTORS unless the device explicitly
>requests transfers to be aligned multiples of a different value
>(typically the internal stripe size in large arrays).

Shouldn't we use Maximum Transfer Length to derive max_sectors (and
get rid of the almost useless max_dev_sectors)? Honestly it looks
pretty non-sensical to me that the SCSI disk driver uses Optimal
Transfer Length for max_sectors. Also in libata's case, this make
setting the effective max_sectors (e.g. see ATA_HORKAGE_MAX_SEC_LBA48)
impossible if we do not want to touch io_opt.

It would look to me that our block layer simply have a flawed design
if we really need to derive both io_opt and max_sectors from the same
field.

>
> The point of BLK_DEF_MAX_SECTORS is to offer a reasonable default for
> common workloads unless otherwise instructed by the storage device.
>
> We can have a discussion about what the right value for
> BLK_DEF_MAX_SECTORS should be. It has gone up over time but it used to
> be the case that permitting large transfers significantly impacted
> interactive I/O performance. And finding a sweet spot that works for a
> wide variety of hardware, interconnects and workloads is obviously
> non-trivial.
>

If BLK_DEF_MAX_SECTORS is supposed to be used as a fallback, then it
should be a safe value, especially when max_sectors_kb can be adjusted
through sysfs.

But the biggest problem isn't on bumping it, but the value picked is
totally irrational for a general default. I mean, given that it was
1024 (512k), try to double it? Fine. Try to quadruple it? Alright.
We'll need to deal with some alignment / boundary issue (like the
typical 65535 vs 65536 case)? Okay let's do it. But what's the sense
in picking a random RAID configuartion as the base to decide the
default? Also, if max_sectors need to concern about the number of
disks used and chunk sizes in a RAID configuartion, it should be
calculated in the device-mapper layer or a specific driver or so.
Changing a block layer default won't help anyway. Say 2560 will
accomodate a 10-disk 128k-chunk RAID. What about a 12-disk 128k-chunk
RAID then? Why not just decide the value base on an 8-disk 128k-chunk
RAID, which HAPPENED to be a double of 1024 as well?

It does not make sense that the SCSI disk driver uses it as the
fallback either. SCSI host templates that does not have max_sectors
set (as well as some specific driver) will use
SCSI_DEFAULT_MAX_SECTORS as the fallback, for such hosts
max_hw_sectors will be 1024, where the current BLK_DEF_MAX_SECTORS
cannot apply as max_sectors anyway. So we should use also
SCSI_DEFAULT_MAX_SECTORS in the SCSI disk driver as fallback for
max_sectors. If the value is considered to low even as a safe
fallback, then it should be bumped appropriately. (Or we might want to
replace it with BLK_DEF_MAX_SECTORS everywhere in the SCSI layer, that
said, after the value is fixed.)

> --
> Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding AHCI_MAX_SG and (ATA_HORKAGE_MAX_SEC_1024)

2016-08-10 Thread Tom Yan
On 10 August 2016 at 15:41, David Milburn  wrote:
> Hi,
>
> The 168 makes AHCI_CMD_TBL_SZ equal to 2816
>
> AHCI_CMD_TBL_SZ = AHCI_CMD_TBL_HDR_SZ + (AHCI_MAX_SG * 16)
> AHCI_CMD_TBL_SZ = 128 + (168 * 16)
>
> I think if you add in AHCI_CMD_SLOT_SZ (1024) and AHCI_RX_FIS_SZ (256)
> the DMA is 4K aligned, I think that is where the 168 came from.

Looks like the right guess. Though AHCI_PORT_PRIV_DMA_SZ is not:

AHCI_CMD_SLOT_SZ (1024) + AHCI_CMD_TBL_SZ (2816) + AHCI_RX_FIS_SZ (256) = 4096

but:

AHCI_CMD_SLOT_SZ (1024) + AHCI_CMD_TBL_AR_SZ (2816 * 32 = 90112) +
AHCI_RX_FIS_SZ (256) = 91392

and AHCI_PORT_PRIV_FBS_DMA_SZ is:

AHCI_CMD_SLOT_SZ (1024) + AHCI_CMD_TBL_AR_SZ (2816 * 32 = 90112) +
AHCI_RX_FIS_SZ * 16 (4096) = 95232

>
> Thanks,
> David
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding AHCI_MAX_SG and (ATA_HORKAGE_MAX_SEC_1024)

2016-08-10 Thread Tom Yan
On 10 August 2016 at 11:26, Tejun Heo  wrote:
> Hmmm.. why not?  The hardware limit is 64k and the driver is using a

Is that referring to the maximum number of entries allowed in the
PRDT, Physical Region Descriptor Table (which is, more precisely,
65535)?

> lower limit of 168 most likely because it doesn't make noticeable
> difference beyond certain point and it determines the size of
> contiguous memory which has to be allocated for the command table.
> Each sg entry is 16 bytes.  Pushing it to the hardware limit would
> require an order 9 allocation for each port.

That makes sense to me, and I didn't have the intention to push it to
the limit anyway.

> Not necessarily.  A single sg entry can point to an area larger than
> PAGE_SIZE.

You mean the 4MB limit of "Data Byte Count" in "DW3: Description
Information" of the PRDT? Is that what max_segment_size (which is set
to a general fallback of 65536:
http://lxr.free-electrons.com/ident?i=dma_get_max_seg_size) is about
in this case?

And my point was, it will be a multiple of 168 anyway, if 1344 is just
an example.

> As written above, that probably makes the ahci command table size
> nicely aligned.

I think that's what bothers me ultimately, cause I don't see how 168
makes it (more) nicely aligned (or even, aligned to what?).

I even checked out the AHCI driver of FreeBSD
(https://github.com/freebsd/freebsd/blob/master/sys/dev/ahci/ahci.h):

...
#define MAXPHYS 512 * 1024
...
#define AHCI_SG_ENTRIES (roundup(btoc(MAXPHYS) + 1, 8))
...
#define AHCI_CT_SIZE (128 + AHCI_SG_ENTRIES * 16)
...

Couldn't get the sense out of the `+ 1` and round up to 8 thing either.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] libata-scsi: do not call blk_queue_max_hw_sectors()

2016-08-09 Thread Tom Yan
Withdrawn. blk_queue_max_hw_sectors() is not called in hosts.c but in
scsi_lib.c. However, it does not check the dev->max_sectors set in
libata-core.c. So everything of this patch is wrong. Will rewrite and
resend the second patch.

On 9 August 2016 at 18:31,  <tom.t...@gmail.com> wrote:
> From: Tom Yan <tom.t...@gmail.com>
>
> We should just let the scsi driver (hosts.c) call the function. It
> has better heuristic anyway (i.e. use SCSI_DEFAULT_MAX_SECTORS as
> fallback when max_sectors is not set).
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 2bdb5da..495d916 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1204,9 +1204,6 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
> if (!ata_id_has_unload(dev->id))
> dev->flags |= ATA_DFLAG_NO_UNLOAD;
>
> -   /* configure max sectors */
> -   blk_queue_max_hw_sectors(q, dev->max_sectors);
> -
> if (dev->class == ATA_DEV_ATAPI) {
> void *buf;
>
> --
> 2.9.2
>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Regarding AHCI_MAX_SG and (ATA_HORKAGE_MAX_SEC_1024)

2016-08-07 Thread Tom Yan
So the (not so) recent bump of BLK_DEF_MAX_SECTORS from 1024 to 2560
(commit d2be537c3ba3) seemed to have caused trouble to some of the ATA
devices, which were then worked around with ATA_HORKAGE_MAX_SEC_1024.

However, I am suspecting that the bump of BLK_DEF_MAX_SECTORS is not
the "real" cause of the trouble, but the fact that AHCI_MAX_SG has
been set to a weird value of 168 (with a comment "hardware max is
64K", which neither seem to make any sense).

AHCI_MAX_SG is used to set the sg_tablesize (i.e. max_segments,
apparently), which is apparently used to derive the actual "request
size" (that is, if it is lower than max_sectors(_kb), it will be the
limiting factor instead).

For example, no matter if the drive has max_sectors set to 2560, or to
65535 (by adding it as the Optimal Transfer Length to libata's SATL,
which is also max_hw_sectors that is set from ATA_MAX_SECTORS_LBA48),
"avgrq-sz" in `iostat` will be capped at 1344 (168 * 8).

However, if I change AHCI_MAX_SG to 128 (which is also the
sg_tablesize set in libata.h from LIBATA_MAX_PRD), "avgrq-sz" in
`iostat` will be capped at 1024 (128 * 8), which should make
ATA_HORKAGE_MAX_SEC_1024 unnecessary.

So why has AHCI_MAX_SG been set to 168 anyway?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html