Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same

2016-08-23 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 8:25 PM, Damien Le Moal  wrote:
>
> Shaun,
>
> On 8/23/16 09:22, Shaun Tancheff wrote:
>> On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal  
>> wrote:

>> Also you may note that in my patch to get Host Aware working
>> with the zone cache I do not include the runt zone in the cache.
>
> Why not ? The RB-tree will handle it just fine (the insert and lookup
> code as Hannes had them was not relying on a constant zone size).

A good point. I didn't pay too much attention while brining this
forward. I think a few of my hacks may be pointless now. I'll
try to rework it and get rid of the runt check.

>> So as it sits I need this fallback otherwise doing blkdiscard over
>> the whole device ends in a error, as well as mkfs.f2fs et. al.
>
> Got it, but I do not see a problem with including it. I have not checked
> the code, but the split of a big discard call into "chunks" should be
> already handling the last chunk and make sure that the operation does
> not exceed the device capacity (in any case, that's easy to fix in the
> sd_zbc_setup_discard code).

Yes I agree the split of big discards does handle the last chunk correctly.

>>> Some 10TB host managed disks out there have 1% conventional zone space,
>>> that is 100GB of capacity. When issuing a "reset all", doing a write
>>> same in these zones will take forever... If the user really wants zeroes
>>> in those zones, let it issue a zeroout.
>>>
>>> I think that it would a better choice to simply not report
>>> discard_zeroes_data as true and do nothing for conventional zones reset.
>>
>> I think that would be unfortunate for Host Managed but I think it's
>> the right choice for Host Aware at this time. So either we base
>> it on disk type or we have some other config flag added to sysfs.
>
> I do not see any difference between host managed and host aware. Both
> define the same behavior for reset, and both end up in a NOP for
> conventional zone reset (no data "erasure" required by the standard).
> For write pointer zones, reading unwritten LBAs returns the
> initialization pattern, with the exception of host-managed disks with
> the URSWRZ bit set to 0. But that case is covered in sd.c, so the
> behavior is consistent across all models. So why forcing data zeroing
> when the standards do not mandate it ?

Well you do have point.
It appears to be only mkfs and similar tools that are really utilizing
discard zeros data at the moment.

I did a quick test:

mkfs -t ext4 -b 4096 -g 32768 -G 32  \
 -E 
lazy_itable_init=0,lazy_journal_init=0,offset=0,num_backup_sb=0,packed_meta_blocks=1,discard
  \
 -O flex_bg,extent,sparse_super2

   - discard zeroes data true - 3 minutess
   - discard zeroes data false - 6 minutes
So for the smaller conventional space on the current HA drive
there is some advantage to enabling discard zeroes data.

However for a larger conventional space you are correct the overall
impact is worse performance.

For some reason I had been assuming that some file systems
used or relied on discard zeroes data during normal operation.
Now that I am looking for that I don't seem to be finding any
evidence of it, so aside from mkfs I don't have as good an
argument discard zeroes data as I though I did.

Regards,
Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/29] be2iscsi: driver update 11.2.0.0

2016-08-23 Thread Martin K. Petersen
> "Jitendra" == Jitendra Bhivare  writes:

Jitendra Bhivare (29):
  be2iscsi: Fix to use correct configuration values
  be2iscsi: Replace _bh version for mcc_lock spinlock
  be2iscsi: Reduce driver load/unload time
  be2iscsi: Set and return right iface v4/v6 states
  be2iscsi: Fix gateway APIs to support IPv4 & IPv6
  be2iscsi: Fix release of DHCP IP in static mode
  be2iscsi: Move VLAN code to common iface_set_param
  be2iscsi: Update iface handle before any set param
  be2iscsi: Rename iface get/set/create/destroy APIs
  be2iscsi: Handle only NET_PARAM in iface_get_param
  be2iscsi: Check all zeroes IP before issuing IOCTL
  be2iscsi: Remove alloc_mcc_tag & beiscsi_pci_soft_reset
  be2iscsi: Remove isr_lock and dead code
  be2iscsi: Fix checks for HBA in error state
  be2iscsi: Fix to make boot discovery non-blocking
  be2iscsi: Fix to add timer for UE detection
  be2iscsi: Add IOCTL to check UER supported
  be2iscsi: Move functions to right files
  be2iscsi: Fix POST check and reset sequence
  be2iscsi: Add V1 of EPFW cleanup IOCTL
  be2iscsi: Add TPE recovery feature
  be2iscsi: Fail the sessions immediately after TPE
  be2iscsi: Add FUNCTION_RESET during driver unload
  be2iscsi: Fix async PDU handling path
  be2iscsi: Fix bad WRB index error
  be2iscsi: Fix queue and connection parameters
  be2iscsi: Update copyright information
  be2iscsi: Update the driver version
  MAINTAINERS: Update be2iscsi contact info

Applied to 4.9/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] megaraid_sas: Use memdup_user() rather than duplicating its implementation

2016-08-23 Thread Martin K. Petersen
> "SF" == SF Markus Elfring  writes:

SF> Reuse existing functionality from memdup_user() instead of keeping
SF> duplicate source code.

Applied to 4.9/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fcoe: provide translation table between Ethernet and FC port speeds

2016-08-23 Thread Martin K. Petersen
> "Johannes" == Johannes Thumshirn  writes:

Johannes> Provide a translation table between Ethernet and FC port
Johannes> speeds so odd speeds (from a Ethernet POV) like 8 Gbit are
Johannes> correctly mapped to sysfs and open-fcoe's fcoeadm.

Applied to 4.9/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] cxlflash: Transition to application close model

2016-08-23 Thread Martin K. Petersen
> "Matthew" == Matthew R Ochs  writes:

Matthew> Caching the adapter file descriptor and performing a close on
Matthew> behalf of an application is a poor design. This is due to the
Matthew> fact that once a file descriptor in installed, it is free to be
Matthew> altered without the knowledge of the cxlflash driver. This can
Matthew> lead to inconsistencies between the application and
Matthew> kernel. Furthermore, the nature of the former design is more
Matthew> exploitable and thus should be abandoned.

Matthew> To support applications performing a close on the adapter file
Matthew> that is associated with a context, a new flag is introduced to
Matthew> the user API to indicate to applications that they are
Matthew> responsible for the close following the cleanup (detach) of a
Matthew> context. The documentation is also updated to reflect this
Matthew> change in behavior.

Applied patches 4 through 6 to 4.9/scsi-queue.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: Tree for Jun 6 (scsi_debug.c)

2016-08-23 Thread Masanari Iida
This one still exist on linus's tree (as of  v4.8-rc3).

CALLscripts/checksyscalls.sh
  CHK kernel/config_data.h
  Building modules, stage 2.
  MODPOST 720 modules
ERROR: "ip_compute_csum" [drivers/scsi/scsi_debug.ko] undefined!
scripts/Makefile.modpost:91: recipe for target '__modpost' failed
make[1]: *** [__modpost] Error 1
Makefile:1178: recipe for target 'modules' failed
make: *** [modules] Error 2

Masanari

On Tue, Jun 7, 2016 at 12:10 PM, Randy Dunlap  wrote:
> On 06/05/16 21:20, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20160603:
>
> on x86_64:
>
> ERROR: "ip_compute_csum" [drivers/scsi/scsi_debug.ko] undefined!
>
> CONFIG_GENERIC_CSUM doesn't seem to exist for x86, so lib/checksum.o is not 
> built.
>
>
> --
> ~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops when completing request on the wrong queue

2016-08-23 Thread Keith Busch
On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
> >My workload looks similar to yours, in that it's high depth and with a
> >lot of jobs to keep most CPUs loaded. My bash script is different than
> >yours, I'll try that and see if it helps here.
> 
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

I haven't recreated this either, but I think I can logically see why
this failure is happening.

I sent an nvme driver patch earlier on this thread to exit the hardware
context, which I thought would do the trick if the hctx's tags were
being moved. That turns out to be wrong for a couple reasons.

First, we can't release the nvmeq->tags when a hctx exits because
that nvmeq may be used by other namespaces that need to point to
the device's tag set.

The other reason is that blk-mq doesn't exit or init hardware contexts
when remapping for a CPU event, leaving the nvme driver unaware a hardware
context points to a different tag set.

So I think I see why this test would fail; don't know about a fix yet.
Maybe the nvme driver needs some indirection instead of pointing
directly to the tagset after init_hctx.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops when completing request on the wrong queue

2016-08-23 Thread Jens Axboe

On 08/23/2016 03:11 PM, Jens Axboe wrote:

On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:

Gabriel Krisman Bertazi  writes:


Can you share what you ran to online/offline CPUs? I can't reproduce
this here.


I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs.  but I just reproduced it with the script below.

Note that this is ppc64le.  I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.


Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.


Nope, I have not been able to reproduce it. How long does the CPU
offline/online actions take on ppc64? It's pretty slow on x86, which may
hide the issue. I took out the various printk's associated with bringing
a CPU off/online, as well as IRQ breaking parts, but didn't help in
reproducing it.


The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.


My workload looks similar to yours, in that it's high depth and with a
lot of jobs to keep most CPUs loaded. My bash script is different than
yours, I'll try that and see if it helps here.


Actually, I take that back. You're not using O_DIRECT, hence all your
jobs are running at QD=1, not the 256 specified. That looks odd, but
I'll try, maybe it'll hit something different.


--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops when completing request on the wrong queue

2016-08-23 Thread Jens Axboe

On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:

Gabriel Krisman Bertazi  writes:


Can you share what you ran to online/offline CPUs? I can't reproduce
this here.


I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs.  but I just reproduced it with the script below.

Note that this is ppc64le.  I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.


Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.


Nope, I have not been able to reproduce it. How long does the CPU
offline/online actions take on ppc64? It's pretty slow on x86, which may
hide the issue. I took out the various printk's associated with bringing
a CPU off/online, as well as IRQ breaking parts, but didn't help in
reproducing it.


The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.


My workload looks similar to yours, in that it's high depth and with a
lot of jobs to keep most CPUs loaded. My bash script is different than
yours, I'll try that and see if it helps here.


--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Oops when completing request on the wrong queue

2016-08-23 Thread Gabriel Krisman Bertazi
Gabriel Krisman Bertazi  writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.

-- 
Gabriel Krisman Bertazi

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] target/tcm_fc: Update debugging statements to match libfc usage

2016-08-23 Thread Bart Van Assche

On 08/22/2016 01:54 AM, Hannes Reinecke wrote:

+   char *reason = "no session created";


Please use const char* instead of char* in the above declaration.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] [PATCH v4 00/26] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-08-23 Thread Arnd Bergmann
On Monday, August 15, 2016 6:23:12 PM CEST Greg KH wrote:
> On Sat, Aug 13, 2016 at 03:48:12PM -0700, Deepa Dinamani wrote:
> > The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC 
> > macros.
> > The macros are not y2038 safe. There is no plan to transition them into 
> > being
> > y2038 safe.
> > ktime_get_* api's can be used in their place. And, these are y2038 safe.
> 
> Who are you execting to pull this huge patch series?

Dave Chinner suggested to have Al Viro pick up the whole series.

> Why not just introduce the new api call, wait for that to be merged, and
> then push the individual patches through the different subsystems?
> After half of those get ignored, then provide a single set of patches
> that can go through Andrew or my trees.

That was the original approach for v4.7, but (along with requesting
a number of reworks that Deepa incorporated), Linus preferred doing
the API change done in one chunk, see
https://patchwork.kernel.org/patch/9134249/

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bestätigung deiner Email Adresse

2016-08-23 Thread Newsletter

Hallo!

Wir freuen uns, dass du dich für unseren kostenlosen Newsletter entschieden 
hast.


Ab deiner Eintragung bekommst du von uns regelmäßig, komplett kostenfrei, 
alle aktuellen News zu dem von dir gewählten Thema auf deine E-Mailadresse 
zugestellt.


Schließe deine Eintragung ab indem du auf den untenstehenden Link klickst:
http://rdr.lcp-prios.com/c/6f27d9719e475471a41620e31d160c733497d466cd8ae2ad148e45e40e330acb0aa136118d9386acc4dedb63a5b929e1?sToken=listcompanion

Du hast deine Meinung geändert oder dich nicht selbst für unseren Dienst 
eingetragen? Kein Thema, in dem Fall brauchst du nichts weiter zu tun. Du 
erhältst von uns keine weiteren E-Mails.


Dein Team von
ListCompanion

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Tom Yan
On 23 August 2016 at 10:45, Shaun Tancheff  wrote:
> On Tue, Aug 23, 2016 at 5:17 AM, Tom Yan  wrote:
>> Wait a minute. I think you missed or misunderstood something when you
>> listen to someone's opinion in that we should switch to sglist buffer.
>
> No, I think I can trust Christoph Hellwig 

I didn't say that you cannot trust him, but I just wonder you might
have misinterpreted what he said. I haven't really follow your patch
series earlier though.

>
>> I think the danger people referred to is exactly what is revealed when
>> the ugly code is removed in this commit (it doesn't mean that the code
>> should be kept though).
>>
>> The original buffer appears to be open:
>> buf = page_address(sg_page(scsi_sglist(scmd)));
>
> Which is unsafe.

Yup, but the while loop is not the right way to make it safe. It
probably prevent any overflow, but does not handle / response to the
request (n_block) properly.

It would only be practically unsafe if we would ever call it with a
request that needs a multi-block payload.

>
>> While the new buffer you adopted in in ata_format_sct_write_same() and
>> ata_format_dsm_trim_descr() is of fixed size:
>
> Yes ... it is a temporary response buffer for simulated commands used to
> copy data to and from the command sg_list so as not to hold irqs while
> modifying the buffer.
>
>> buffer = ((void *)ata_scsi_rbuf);
>>
>> sctpg = ((void *)ata_scsi_rbuf);
>>
>> because:
>>
>> #define ATA_SCSI_RBUF_SIZE  4096
>> ...
>> static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE];
>>
>> So the sglist buffer is always 4096 bytes.
>
> No. The sglist buffer attached to the write same / trim command

Don't you think it is in appropriate to use ata_scsi_rbuf here then?

> is always sdp->sector_size

I can hardly see how sdp->sector_size is relevant to the sglist buffer
(especially the sglist here is facing to the ATA device).
Scatter/gather list could have multiple entries of a certain size
(that is not necessarily, or unlikely, logical sector size). Although
I don't really know how scatter/gather list works, it hardly seems
making any sense by saying that the sglist buffer is always logical
sector size.

AFAIK, for example, USB Attached SCSI driver does not limit the
maximum number of entries (max_segment) so the kernel fall back to
SG_MAX_SEGMENTS (2048), while it reports a max_segment_size of 4096 in
spite of the logical sector size with dma_get_max_seg_size().

For AHCI, it reports a max_segment of 168, while it does not seem to
report any max_segment_size so dma_get_max_seg_size() fallback to
65536 (though practically it seems 4096 will still be used in normal
writing).

>
>> And hence you can probably safely use ATA_SCSI_RBUF_SIZE as the buflen
>> param in the sg_copy_from_buffer() calls (at least in the case of
>> ata_format_sct_write_same()).
>
> No. SCT Write Same has a fixed single 512 byte transfer.

Never mind on that. I have presumption that you should always fully
copy ata_scsi_rbuf to the sglist.

>
> However, the return value of
>> ata_format_dsm_trim_descr() should still always be used_bytes since
>> that is needed by the ata taskfile construction.
>
> So long as it does not exceed its sglist/sector_size buffer.
>
>> You may want to check (n_block / 65535 * 8 > ATA_SCSI_RBUF_SIZE). If
>> it is true, then perhaps we may want to return 0, and make the SATL
>> response with invalid CDB field if we catch that.
>
> No that is not quite right you need to check if you are
> overflowing either RBUF or sdp->sector_size.

Again, this does not make any sense. See above. Apparently you really
have no reason to use ata_scsi_rbuf.

>
>> Though IMHO this is really NOT a reason that is strong enough to
>> prevent this patch from entering the repo first.
>
>> On 23 August 2016 at 09:36, Tom Yan  wrote:
>>> On 23 August 2016 at 09:18, Shaun Tancheff  
>>> wrote:
 On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan  wrote:
> On 23 August 2016 at 07:30, Shaun Tancheff  wrote:

> If we really want/need to avoid hitting some real buffer limit (e.g.
> maximum length of scatter/gather list?), then we should in some way
> check n_block against that. If it is too large we then return
> used_bytes = 0 (optionally with some follow-up to add a response to
> such return value or so).

 Yes there is a real buffer limit, I can think of these two options:
 1- Assume the setups from sd_setup_discard_cmnd() and/
or sd_setup_write_same_cmnd() are providing an sglist of
sdp->sector_size via scsi_init_io()
>>>
>>> That sounds completely wrong. The scatter/gather list we are talking
>>> about here has nothing to do with the SCSI or block layer anymore. The
>>> SATL has _already_ parsed the SCSI Write Same (16) command and is
>>> packing ranges/payload according to that in this stage. If there is
>>> any limit it would 

Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Shaun Tancheff
On Tue, Aug 23, 2016 at 5:17 AM, Tom Yan  wrote:
> Wait a minute. I think you missed or misunderstood something when you
> listen to someone's opinion in that we should switch to sglist buffer.

No, I think I can trust Christoph Hellwig 

> I think the danger people referred to is exactly what is revealed when
> the ugly code is removed in this commit (it doesn't mean that the code
> should be kept though).
>
> The original buffer appears to be open:
> buf = page_address(sg_page(scsi_sglist(scmd)));

Which is unsafe.

> While the new buffer you adopted in in ata_format_sct_write_same() and
> ata_format_dsm_trim_descr() is of fixed size:

Yes ... it is a temporary response buffer for simulated commands used to
copy data to and from the command sg_list so as not to hold irqs while
modifying the buffer.

> buffer = ((void *)ata_scsi_rbuf);
>
> sctpg = ((void *)ata_scsi_rbuf);
>
> because:
>
> #define ATA_SCSI_RBUF_SIZE  4096
> ...
> static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE];
>
> So the sglist buffer is always 4096 bytes.

No. The sglist buffer attached to the write same / trim command
is always sdp->sector_size

> And hence you can probably safely use ATA_SCSI_RBUF_SIZE as the buflen
> param in the sg_copy_from_buffer() calls (at least in the case of
> ata_format_sct_write_same()).

No. SCT Write Same has a fixed single 512 byte transfer.

However, the return value of
> ata_format_dsm_trim_descr() should still always be used_bytes since
> that is needed by the ata taskfile construction.

So long as it does not exceed its sglist/sector_size buffer.

> You may want to check (n_block / 65535 * 8 > ATA_SCSI_RBUF_SIZE). If
> it is true, then perhaps we may want to return 0, and make the SATL
> response with invalid CDB field if we catch that.

No that is not quite right you need to check if you are
overflowing either RBUF or sdp->sector_size.

> Though IMHO this is really NOT a reason that is strong enough to
> prevent this patch from entering the repo first.

> On 23 August 2016 at 09:36, Tom Yan  wrote:
>> On 23 August 2016 at 09:18, Shaun Tancheff  
>> wrote:
>>> On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan  wrote:
 On 23 August 2016 at 07:30, Shaun Tancheff  wrote:
>>>
 If we really want/need to avoid hitting some real buffer limit (e.g.
 maximum length of scatter/gather list?), then we should in some way
 check n_block against that. If it is too large we then return
 used_bytes = 0 (optionally with some follow-up to add a response to
 such return value or so).
>>>
>>> Yes there is a real buffer limit, I can think of these two options:
>>> 1- Assume the setups from sd_setup_discard_cmnd() and/
>>>or sd_setup_write_same_cmnd() are providing an sglist of
>>>sdp->sector_size via scsi_init_io()
>>
>> That sounds completely wrong. The scatter/gather list we are talking
>> about here has nothing to do with the SCSI or block layer anymore. The
>> SATL has _already_ parsed the SCSI Write Same (16) command and is
>> packing ranges/payload according to that in this stage. If there is
>> any limit it would probably the max_segment allowed by the host driver
>> (e.g. ahci).
>>
>> It doesn't seem to make sense to me either that we would need to
>> prevent sglist overflow in such level. Doesn't that mean we would need
>> to do the same checking (specifically, as in hard coding checks in all
>> kinds of procedures) in every use of scatter/gather list? That doesn't
>> sound right at all.
>>
>>>
>>> 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the
>>> sglist and calculate the available buffer size.
>>>
>>> #2 sounds like more fun but I'm not sure it's what people would prefer to 
>>> see.
>>
>> No idea if such thing exists / makes sense at all.
>>
>>>
>>> --
>>> Shaun



-- 
Shaun Tancheff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Tom Yan
Wait a minute. I think you missed or misunderstood something when you
listen to someone's opinion in that we should switch to sglist buffer.
I think the danger people referred to is exactly what is revealed when
the ugly code is removed in this commit (it doesn't mean that the code
should be kept though).

The original buffer appears to be open:
buf = page_address(sg_page(scsi_sglist(scmd)));

While the new buffer you adopted in in ata_format_sct_write_same() and
ata_format_dsm_trim_descr() is of fixed size:

buffer = ((void *)ata_scsi_rbuf);

sctpg = ((void *)ata_scsi_rbuf);

because:

#define ATA_SCSI_RBUF_SIZE  4096
...
static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE];

So the sglist buffer is always 4096 bytes.

And hence you can probably safely use ATA_SCSI_RBUF_SIZE as the buflen
param in the sg_copy_from_buffer() calls (at least in the case of
ata_format_sct_write_same()). However, the return value of
ata_format_dsm_trim_descr() should still always be used_bytes since
that is needed by the ata taskfile construction.

You may want to check (n_block / 65535 * 8 > ATA_SCSI_RBUF_SIZE). If
it is true, then perhaps we may want to return 0, and make the SATL
response with invalid CDB field if we catch that.

Though IMHO this is really NOT a reason that is strong enough to
prevent this patch from entering the repo first.

On 23 August 2016 at 09:36, Tom Yan  wrote:
> On 23 August 2016 at 09:18, Shaun Tancheff  wrote:
>> On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan  wrote:
>>> On 23 August 2016 at 07:30, Shaun Tancheff  wrote:
>>
>>> If we really want/need to avoid hitting some real buffer limit (e.g.
>>> maximum length of scatter/gather list?), then we should in some way
>>> check n_block against that. If it is too large we then return
>>> used_bytes = 0 (optionally with some follow-up to add a response to
>>> such return value or so).
>>
>> Yes there is a real buffer limit, I can think of these two options:
>> 1- Assume the setups from sd_setup_discard_cmnd() and/
>>or sd_setup_write_same_cmnd() are providing an sglist of
>>sdp->sector_size via scsi_init_io()
>
> That sounds completely wrong. The scatter/gather list we are talking
> about here has nothing to do with the SCSI or block layer anymore. The
> SATL has _already_ parsed the SCSI Write Same (16) command and is
> packing ranges/payload according to that in this stage. If there is
> any limit it would probably the max_segment allowed by the host driver
> (e.g. ahci).
>
> It doesn't seem to make sense to me either that we would need to
> prevent sglist overflow in such level. Doesn't that mean we would need
> to do the same checking (specifically, as in hard coding checks in all
> kinds of procedures) in every use of scatter/gather list? That doesn't
> sound right at all.
>
>>
>> 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the
>> sglist and calculate the available buffer size.
>>
>> #2 sounds like more fun but I'm not sure it's what people would prefer to 
>> see.
>
> No idea if such thing exists / makes sense at all.
>
>>
>> --
>> Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Shaun Tancheff
On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan  wrote:
> On 23 August 2016 at 07:30, Shaun Tancheff  wrote:

> If we really want/need to avoid hitting some real buffer limit (e.g.
> maximum length of scatter/gather list?), then we should in some way
> check n_block against that. If it is too large we then return
> used_bytes = 0 (optionally with some follow-up to add a response to
> such return value or so).

Yes there is a real buffer limit, I can think of these two options:
1- Assume the setups from sd_setup_discard_cmnd() and/
   or sd_setup_write_same_cmnd() are providing an sglist of
   sdp->sector_size via scsi_init_io()

2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the
sglist and calculate the available buffer size.

#2 sounds like more fun but I'm not sure it's what people would prefer to see.

--
Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Observing Softlockup's while running heavy IOs

2016-08-23 Thread Kashyap Desai
> -Original Message-
> From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com]
> Sent: Saturday, August 20, 2016 2:58 AM
> To: Sreekanth Reddy
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org;
> irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty;
> Chaitra Basappa; Suganath Prabu Subramani
> Subject: RE: Observing Softlockup's while running heavy IOs
>
>
>
> > -Original Message-
> > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com]
> > Sent: Friday, August 19, 2016 6:45 AM
> > To: Elliott, Robert (Persistent Memory) 
> > Subject: Re: Observing Softlockup's while running heavy IOs
> >
> ...
> > Yes I am also observing that all the interrupts are routed to one CPU.
> > But still I observing softlockups (sometime hardlockups) even when I
> > set rq_affinity to 2.

How about below scenario ?  For simplicity. HBA with single MSI-x vector.
(Whenever HBA supports less MSI-x and logical CPUs are more on system, we
can see chance of these issue frequently..)

Assume we have 32 logical CPU  (4 socket, each with 8 logical CPU). CPU-0 is
not participating in IO.
Remaining CPU range from 1 to 31 is submitting IO. In such a scenario
rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not
help.

We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or
it is very rare to happen on field  ?


>
> That'll ensure the block layer's completion handling is done there, but
> not your
> driver's interrupt handler (which precedes the block layer completion
> handling).
>
>
> > Is their any way to route the interrupts the same CPUs which has
> > submitted the corresponding IOs?
> > or
> > Is their any way/option in the irqbalance/kernel which can route
> > interrupts to CPUs (enabled in affinity_hint) in round robin manner
> > after specific time period.
>
> Ensure your driver creates one MSIX interrupt per CPU core, uses that
> interrupt
> for all submissions from that core, and reports that it would like that
> interrupt to
> be serviced by that core in /proc/irq/nnn/affinity_hint.
>
> Even with hyperthreading, this needs to be based on the logical CPU cores,
> not
> just the physical core or the physical socket.
> You can swamp a logical CPU core as easily as a physical CPU core.
>
> Then, provide an irqbalance policy script that honors the affinity_hint
> for your
> driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity
> to
> match the affinity_hint.
>
> Some versions of irqbalance honor the hints; some purposely don't and need
> to
> be overridden with a policy script.
>
>
> ---
> Robert Elliott, HPE Persistent Memory
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Tom Yan
On 23 August 2016 at 09:18, Shaun Tancheff  wrote:
> On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan  wrote:
>> On 23 August 2016 at 07:30, Shaun Tancheff  wrote:
>
>> If we really want/need to avoid hitting some real buffer limit (e.g.
>> maximum length of scatter/gather list?), then we should in some way
>> check n_block against that. If it is too large we then return
>> used_bytes = 0 (optionally with some follow-up to add a response to
>> such return value or so).
>
> Yes there is a real buffer limit, I can think of these two options:
> 1- Assume the setups from sd_setup_discard_cmnd() and/
>or sd_setup_write_same_cmnd() are providing an sglist of
>sdp->sector_size via scsi_init_io()

That sounds completely wrong. The scatter/gather list we are talking
about here has nothing to do with the SCSI or block layer anymore. The
SATL has _already_ parsed the SCSI Write Same (16) command and is
packing ranges/payload according to that in this stage. If there is
any limit it would probably the max_segment allowed by the host driver
(e.g. ahci).

It doesn't seem to make sense to me either that we would need to
prevent sglist overflow in such level. Doesn't that mean we would need
to do the same checking (specifically, as in hard coding checks in all
kinds of procedures) in every use of scatter/gather list? That doesn't
sound right at all.

>
> 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the
> sglist and calculate the available buffer size.
>
> #2 sounds like more fun but I'm not sure it's what people would prefer to see.

No idea if such thing exists / makes sense at all.

>
> --
> Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Tom Yan
On 23 August 2016 at 08:37, Tom Yan  wrote:
> On 23 August 2016 at 07:30, Shaun Tancheff  wrote:
>>
>> But  the "humanized" limit is the one you just added and proceeded to
>> change ata_set_lba_range_entries(). You changed from a buffer size
>> to use "num" instead and now you want to remove the protection
>> entirely?
>
> I am not sure about what you are talking about here. What I did is
> just to avoid setting an (inappropriate and unnecessary) hard-coded
> limit (with the original while-condition) for i (the unaligned buffer
> size) in this general (in the sense that TRIM payload is _not always_
> of one 512-byte block, even if we may want to forever practice that in
> our kernel) function.
>
> If we really want/need to avoid hitting some real buffer limit (e.g.
> maximum length of scatter/gather list?), then we should in some way
> check n_block against that. If it is too large we then return
> used_bytes = 0 (optionally with some follow-up to add a response to
> such return value or so).
>
> We advertise 4194240 as Maximum Write Same Length so that the
> SCSI/block layer will know how to split the discard request, and then
> we make sure that the SATL reject SCSI commands (that is not issued
> through the discard / write same ioctl but with SG_IO /
> sg_write_same). That's all we really need to do, end of story.
>
>>
>> Why not just change to put this in front of ata_set_lba_range_entries()
>>
>> if (n_block > 65535 * ATA_MAX_TRIM_RNUM) {
>>  fp = 2;
>> goto invalid_fld;
>> }
>
> Well you can change the if-else clause to that. Apparently you've been
> doing that in your patch series. But that has nothing to do with what
> I am fixing here. Neither does it affect the actual behavior.
>
>>
>> And then restore ata_set_lba_range_entries() to how it looked
>> before you changed it in commit:
>>
>> 2983860c7 (libata-scsi: avoid repeated calculation of number of TRIM ranges)
>
> I don't see how it is relevant at all. Though I should have introduced
> _this_ patch before that to save some work. Unfortunately I wasn't
> aware how ugly ata_set_lba_range_entries was.
>
>>
>> Then you can have ata_set_lba_range_entries() take the buffer size ...
>> something like the following would be fine:
>>
>>   size = ata_set_lba_range_entries(buf, scmd->device->sector_size,
>> block, n_block);
>>
>> Now things are happily protected against both exceeding the b0 limit(s) and
>> overflowing the sglist buffer.
>
> We have no reason at all to bring the logical sector size in here.
> Only if in future ACS standards it is stated that payload block are
> counted in logical block size instead of statically 512, we would only
> need to make the now static "512" in ALIGN() in to 4096.

typo. I meant from static "512" to logical sector size, and I really
think we should do it ONLY if the later ACS changes its statement.

>
> For possible sglist buffer overflow, see what I mentioned above about
> checking n_block and return 0. We would not want to do it with the
> original while-way anyway. All it does would be to truncate larger
> request and partially complete it silently.
>
> Can you tell exactly how the size of the sglist buffer is limited?
> AFAIK it has nothing to do with logical sector size. I doubt that we
> will ever cause an overflow, given how conservative we have been on
> the size of TRIM payload. We would probably ignore the reported value
> once it is larger than 16 or so, if we would ever enable multi-block
> payload in the future.
>
> Realistically speaking, I sent this patch not really to get the
> function ready for multi-block payload (neither I have mentioned that
> in the commit message), but just to make the code proper and avoid
> silly effort (and confusion caused) you and I spent in our patches.
>
>>
>> --
>> Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Tom Yan
On 23 August 2016 at 07:30, Shaun Tancheff  wrote:
>
> But  the "humanized" limit is the one you just added and proceeded to
> change ata_set_lba_range_entries(). You changed from a buffer size
> to use "num" instead and now you want to remove the protection
> entirely?

I am not sure about what you are talking about here. What I did is
just to avoid setting an (inappropriate and unnecessary) hard-coded
limit (with the original while-condition) for i (the unaligned buffer
size) in this general (in the sense that TRIM payload is _not always_
of one 512-byte block, even if we may want to forever practice that in
our kernel) function.

If we really want/need to avoid hitting some real buffer limit (e.g.
maximum length of scatter/gather list?), then we should in some way
check n_block against that. If it is too large we then return
used_bytes = 0 (optionally with some follow-up to add a response to
such return value or so).

We advertise 4194240 as Maximum Write Same Length so that the
SCSI/block layer will know how to split the discard request, and then
we make sure that the SATL reject SCSI commands (that is not issued
through the discard / write same ioctl but with SG_IO /
sg_write_same). That's all we really need to do, end of story.

>
> Why not just change to put this in front of ata_set_lba_range_entries()
>
> if (n_block > 65535 * ATA_MAX_TRIM_RNUM) {
>  fp = 2;
> goto invalid_fld;
> }

Well you can change the if-else clause to that. Apparently you've been
doing that in your patch series. But that has nothing to do with what
I am fixing here. Neither does it affect the actual behavior.

>
> And then restore ata_set_lba_range_entries() to how it looked
> before you changed it in commit:
>
> 2983860c7 (libata-scsi: avoid repeated calculation of number of TRIM ranges)

I don't see how it is relevant at all. Though I should have introduced
_this_ patch before that to save some work. Unfortunately I wasn't
aware how ugly ata_set_lba_range_entries was.

>
> Then you can have ata_set_lba_range_entries() take the buffer size ...
> something like the following would be fine:
>
>   size = ata_set_lba_range_entries(buf, scmd->device->sector_size,
> block, n_block);
>
> Now things are happily protected against both exceeding the b0 limit(s) and
> overflowing the sglist buffer.

We have no reason at all to bring the logical sector size in here.
Only if in future ACS standards it is stated that payload block are
counted in logical block size instead of statically 512, we would only
need to make the now static "512" in ALIGN() in to 4096.

For possible sglist buffer overflow, see what I mentioned above about
checking n_block and return 0. We would not want to do it with the
original while-way anyway. All it does would be to truncate larger
request and partially complete it silently.

Can you tell exactly how the size of the sglist buffer is limited?
AFAIK it has nothing to do with logical sector size. I doubt that we
will ever cause an overflow, given how conservative we have been on
the size of TRIM payload. We would probably ignore the reported value
once it is larger than 16 or so, if we would ever enable multi-block
payload in the future.

Realistically speaking, I sent this patch not really to get the
function ready for multi-block payload (neither I have mentioned that
in the commit message), but just to make the code proper and avoid
silly effort (and confusion caused) you and I spent in our patches.

>
> --
> Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()

2016-08-23 Thread Shaun Tancheff
On Mon, Aug 22, 2016 at 3:53 PM, Tom Yan  wrote:
> On 22 August 2016 at 20:32, Shaun Tancheff  wrote:
>> On Mon, Aug 22, 2016 at 3:07 PM, Tom Yan  wrote:
>>> I don't see how that's possible. count / n_block will always be
>>> smaller than 65535 * ATA_MAX_TRIM_RNUM(64) = 4194240. Not to mention
>>> that isn't even a "buffer limit" anyway. By SG_IO do you mean like
>>> SCSI Write Same commands that issued with sg_write_same or so? If
>>> that's the case, that's what exactly commit 5c79097a28c2
>>> ("libata-scsi: reject WRITE SAME (16) with n_block that exceeds
>>> limit") is for.
>>
>> Ah, I see. You are guarding the only user of ata_set_lba_range_entries().
>
> Yup. It is the only right thing to do anyway, that we leave the
> function "open" and guard per context when we use it. Say if
> ata_set_lba_range_entries() is gonna be a function that is shared by
> others, it would only make this commit more important. As I said, we
> did not guard it with a certain buffer limit, but merely redundantly
> guard it with a ("humanized") limit that applies to TRIM only.

But  the "humanized" limit is the one you just added and proceeded to
change ata_set_lba_range_entries(). You changed from a buffer size
to use "num" instead and now you want to remove the protection
entirely?

Why not just change to put this in front of ata_set_lba_range_entries()

if (n_block > 65535 * ATA_MAX_TRIM_RNUM) {
 fp = 2;
goto invalid_fld;
}

And then restore ata_set_lba_range_entries() to how it looked
before you changed it in commit:

2983860c7 (libata-scsi: avoid repeated calculation of number of TRIM ranges)

Then you can have ata_set_lba_range_entries() take the buffer size ...
something like the following would be fine:

  size = ata_set_lba_range_entries(buf, scmd->device->sector_size,
block, n_block);

Now things are happily protected against both exceeding the b0 limit(s) and
overflowing the sglist buffer.

--
Shaun
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html