Re: [PATCH v2 2/4] On Discard either do Reset WP or Write Same
On Mon, Aug 22, 2016 at 8:25 PM, Damien Le Moalwrote: > > Shaun, > > On 8/23/16 09:22, Shaun Tancheff wrote: >> On Mon, Aug 22, 2016 at 6:57 PM, Damien Le Moal >> wrote: >> Also you may note that in my patch to get Host Aware working >> with the zone cache I do not include the runt zone in the cache. > > Why not ? The RB-tree will handle it just fine (the insert and lookup > code as Hannes had them was not relying on a constant zone size). A good point. I didn't pay too much attention while brining this forward. I think a few of my hacks may be pointless now. I'll try to rework it and get rid of the runt check. >> So as it sits I need this fallback otherwise doing blkdiscard over >> the whole device ends in a error, as well as mkfs.f2fs et. al. > > Got it, but I do not see a problem with including it. I have not checked > the code, but the split of a big discard call into "chunks" should be > already handling the last chunk and make sure that the operation does > not exceed the device capacity (in any case, that's easy to fix in the > sd_zbc_setup_discard code). Yes I agree the split of big discards does handle the last chunk correctly. >>> Some 10TB host managed disks out there have 1% conventional zone space, >>> that is 100GB of capacity. When issuing a "reset all", doing a write >>> same in these zones will take forever... If the user really wants zeroes >>> in those zones, let it issue a zeroout. >>> >>> I think that it would a better choice to simply not report >>> discard_zeroes_data as true and do nothing for conventional zones reset. >> >> I think that would be unfortunate for Host Managed but I think it's >> the right choice for Host Aware at this time. So either we base >> it on disk type or we have some other config flag added to sysfs. > > I do not see any difference between host managed and host aware. Both > define the same behavior for reset, and both end up in a NOP for > conventional zone reset (no data "erasure" required by the standard). > For write pointer zones, reading unwritten LBAs returns the > initialization pattern, with the exception of host-managed disks with > the URSWRZ bit set to 0. But that case is covered in sd.c, so the > behavior is consistent across all models. So why forcing data zeroing > when the standards do not mandate it ? Well you do have point. It appears to be only mkfs and similar tools that are really utilizing discard zeros data at the moment. I did a quick test: mkfs -t ext4 -b 4096 -g 32768 -G 32 \ -E lazy_itable_init=0,lazy_journal_init=0,offset=0,num_backup_sb=0,packed_meta_blocks=1,discard \ -O flex_bg,extent,sparse_super2 - discard zeroes data true - 3 minutess - discard zeroes data false - 6 minutes So for the smaller conventional space on the current HA drive there is some advantage to enabling discard zeroes data. However for a larger conventional space you are correct the overall impact is worse performance. For some reason I had been assuming that some file systems used or relied on discard zeroes data during normal operation. Now that I am looking for that I don't seem to be finding any evidence of it, so aside from mkfs I don't have as good an argument discard zeroes data as I though I did. Regards, Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/29] be2iscsi: driver update 11.2.0.0
> "Jitendra" == Jitendra Bhivarewrites: Jitendra Bhivare (29): be2iscsi: Fix to use correct configuration values be2iscsi: Replace _bh version for mcc_lock spinlock be2iscsi: Reduce driver load/unload time be2iscsi: Set and return right iface v4/v6 states be2iscsi: Fix gateway APIs to support IPv4 & IPv6 be2iscsi: Fix release of DHCP IP in static mode be2iscsi: Move VLAN code to common iface_set_param be2iscsi: Update iface handle before any set param be2iscsi: Rename iface get/set/create/destroy APIs be2iscsi: Handle only NET_PARAM in iface_get_param be2iscsi: Check all zeroes IP before issuing IOCTL be2iscsi: Remove alloc_mcc_tag & beiscsi_pci_soft_reset be2iscsi: Remove isr_lock and dead code be2iscsi: Fix checks for HBA in error state be2iscsi: Fix to make boot discovery non-blocking be2iscsi: Fix to add timer for UE detection be2iscsi: Add IOCTL to check UER supported be2iscsi: Move functions to right files be2iscsi: Fix POST check and reset sequence be2iscsi: Add V1 of EPFW cleanup IOCTL be2iscsi: Add TPE recovery feature be2iscsi: Fail the sessions immediately after TPE be2iscsi: Add FUNCTION_RESET during driver unload be2iscsi: Fix async PDU handling path be2iscsi: Fix bad WRB index error be2iscsi: Fix queue and connection parameters be2iscsi: Update copyright information be2iscsi: Update the driver version MAINTAINERS: Update be2iscsi contact info Applied to 4.9/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] megaraid_sas: Use memdup_user() rather than duplicating its implementation
> "SF" == SF Markus Elfringwrites: SF> Reuse existing functionality from memdup_user() instead of keeping SF> duplicate source code. Applied to 4.9/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fcoe: provide translation table between Ethernet and FC port speeds
> "Johannes" == Johannes Thumshirnwrites: Johannes> Provide a translation table between Ethernet and FC port Johannes> speeds so odd speeds (from a Ethernet POV) like 8 Gbit are Johannes> correctly mapped to sysfs and open-fcoe's fcoeadm. Applied to 4.9/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] cxlflash: Transition to application close model
> "Matthew" == Matthew R Ochswrites: Matthew> Caching the adapter file descriptor and performing a close on Matthew> behalf of an application is a poor design. This is due to the Matthew> fact that once a file descriptor in installed, it is free to be Matthew> altered without the knowledge of the cxlflash driver. This can Matthew> lead to inconsistencies between the application and Matthew> kernel. Furthermore, the nature of the former design is more Matthew> exploitable and thus should be abandoned. Matthew> To support applications performing a close on the adapter file Matthew> that is associated with a context, a new flag is introduced to Matthew> the user API to indicate to applications that they are Matthew> responsible for the close following the cleanup (detach) of a Matthew> context. The documentation is also updated to reflect this Matthew> change in behavior. Applied patches 4 through 6 to 4.9/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux-next: Tree for Jun 6 (scsi_debug.c)
This one still exist on linus's tree (as of v4.8-rc3). CALLscripts/checksyscalls.sh CHK kernel/config_data.h Building modules, stage 2. MODPOST 720 modules ERROR: "ip_compute_csum" [drivers/scsi/scsi_debug.ko] undefined! scripts/Makefile.modpost:91: recipe for target '__modpost' failed make[1]: *** [__modpost] Error 1 Makefile:1178: recipe for target 'modules' failed make: *** [modules] Error 2 Masanari On Tue, Jun 7, 2016 at 12:10 PM, Randy Dunlapwrote: > On 06/05/16 21:20, Stephen Rothwell wrote: >> Hi all, >> >> Changes since 20160603: > > on x86_64: > > ERROR: "ip_compute_csum" [drivers/scsi/scsi_debug.ko] undefined! > > CONFIG_GENERIC_CSUM doesn't seem to exist for x86, so lib/checksum.o is not > built. > > > -- > ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops when completing request on the wrong queue
On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote: > On 08/23/2016 03:11 PM, Jens Axboe wrote: > >My workload looks similar to yours, in that it's high depth and with a > >lot of jobs to keep most CPUs loaded. My bash script is different than > >yours, I'll try that and see if it helps here. > > Actually, I take that back. You're not using O_DIRECT, hence all your > jobs are running at QD=1, not the 256 specified. That looks odd, but > I'll try, maybe it'll hit something different. I haven't recreated this either, but I think I can logically see why this failure is happening. I sent an nvme driver patch earlier on this thread to exit the hardware context, which I thought would do the trick if the hctx's tags were being moved. That turns out to be wrong for a couple reasons. First, we can't release the nvmeq->tags when a hctx exits because that nvmeq may be used by other namespaces that need to point to the device's tag set. The other reason is that blk-mq doesn't exit or init hardware contexts when remapping for a CPU event, leaving the nvme driver unaware a hardware context points to a different tag set. So I think I see why this test would fail; don't know about a fix yet. Maybe the nvme driver needs some indirection instead of pointing directly to the tagset after init_hctx. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops when completing request on the wrong queue
On 08/23/2016 03:11 PM, Jens Axboe wrote: On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: Gabriel Krisman Bertaziwrites: Can you share what you ran to online/offline CPUs? I can't reproduce this here. I was using the ppc64_cpu tool, which shouldn't do nothing more than write to sysfs. but I just reproduced it with the script below. Note that this is ppc64le. I don't have a x86 in hand to attempt to reproduce right now, but I'll look for one and see how it goes. Hi, Any luck on reproducing it? We were initially reproducing with a proprietary stress test, but I gave a try to a generated fio jobfile associated with the SMT script I shared earlier and I could reproduce the crash consistently in less than 10 minutes of execution. this was still ppc64le, though. I couldn't get my hands on nvme on x86 yet. Nope, I have not been able to reproduce it. How long does the CPU offline/online actions take on ppc64? It's pretty slow on x86, which may hide the issue. I took out the various printk's associated with bringing a CPU off/online, as well as IRQ breaking parts, but didn't help in reproducing it. The job file I used, as well as the smt.sh script, in case you want to give it a try: jobfile: http://krisman.be/k/nvmejob.fio smt.sh: http://krisman.be/k/smt.sh Still, the trigger seems to be consistently a heavy load of IO associated with CPU addition/removal. My workload looks similar to yours, in that it's high depth and with a lot of jobs to keep most CPUs loaded. My bash script is different than yours, I'll try that and see if it helps here. Actually, I take that back. You're not using O_DIRECT, hence all your jobs are running at QD=1, not the 256 specified. That looks odd, but I'll try, maybe it'll hit something different. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops when completing request on the wrong queue
On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote: Gabriel Krisman Bertaziwrites: Can you share what you ran to online/offline CPUs? I can't reproduce this here. I was using the ppc64_cpu tool, which shouldn't do nothing more than write to sysfs. but I just reproduced it with the script below. Note that this is ppc64le. I don't have a x86 in hand to attempt to reproduce right now, but I'll look for one and see how it goes. Hi, Any luck on reproducing it? We were initially reproducing with a proprietary stress test, but I gave a try to a generated fio jobfile associated with the SMT script I shared earlier and I could reproduce the crash consistently in less than 10 minutes of execution. this was still ppc64le, though. I couldn't get my hands on nvme on x86 yet. Nope, I have not been able to reproduce it. How long does the CPU offline/online actions take on ppc64? It's pretty slow on x86, which may hide the issue. I took out the various printk's associated with bringing a CPU off/online, as well as IRQ breaking parts, but didn't help in reproducing it. The job file I used, as well as the smt.sh script, in case you want to give it a try: jobfile: http://krisman.be/k/nvmejob.fio smt.sh: http://krisman.be/k/smt.sh Still, the trigger seems to be consistently a heavy load of IO associated with CPU addition/removal. My workload looks similar to yours, in that it's high depth and with a lot of jobs to keep most CPUs loaded. My bash script is different than yours, I'll try that and see if it helps here. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oops when completing request on the wrong queue
Gabriel Krisman Bertaziwrites: >> Can you share what you ran to online/offline CPUs? I can't reproduce >> this here. > > I was using the ppc64_cpu tool, which shouldn't do nothing more than > write to sysfs. but I just reproduced it with the script below. > > Note that this is ppc64le. I don't have a x86 in hand to attempt to > reproduce right now, but I'll look for one and see how it goes. Hi, Any luck on reproducing it? We were initially reproducing with a proprietary stress test, but I gave a try to a generated fio jobfile associated with the SMT script I shared earlier and I could reproduce the crash consistently in less than 10 minutes of execution. this was still ppc64le, though. I couldn't get my hands on nvme on x86 yet. The job file I used, as well as the smt.sh script, in case you want to give it a try: jobfile: http://krisman.be/k/nvmejob.fio smt.sh: http://krisman.be/k/smt.sh Still, the trigger seems to be consistently a heavy load of IO associated with CPU addition/removal. Let me share my progress from the last couple days in the hope that it rings a bell for you. Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the request_queue's freeze_depth is 0, which points away from a fault in the freeze/unfreeze mechanism. If a request was escaping and going through the block layer during a freeze, we'd see freeze_depth >= 1. Before that, I had also tried to keep the q_usage_counter in atomic mode, in case of a bug in the percpu refcount. No luck, the BUG_ON was still hit. Also, I don't see anything special about the request that reaches the BUG_ON. It's a REQ_TYPE_FS request and, at least in the last time I reproduced, it was a READ that came from the stress test task through submit_bio. So nothing remarkable about it too, as far as I can see. I'm still thinking about a case in which the mapping get's screwed up, where a ctx would appear into two hctxs bitmaps after a remap, or if the ctx got remaped to another hctx. I'm still learning my way through the cpumap code, so I'm not sure it's a real possibility, but I'm not convinced it isn't. Some preliminary tests don't suggest it's the case at play, but I wanna spend a little more time on this theory (maybe for my lack of better ideas :) On a side note, probably unrelated to this crash, it also got me thinking about the current usefulness of blk_mq_hctx_notify. Since CPU is dead, no more requests would be coming through its ctx. I think we could force a queue run in blk_mq_queue_reinit_notify, before remapping, which would cause the hctx to fetch the remaining requests from that dead ctx (since it's not unmapped yet). This way, we could maintain a single hotplug notification hook and simplify the hotplug path. I haven't written code for it yet, but I'll see if I can come up with something and send to the list. -- Gabriel Krisman Bertazi -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] target/tcm_fc: Update debugging statements to match libfc usage
On 08/22/2016 01:54 AM, Hannes Reinecke wrote: + char *reason = "no session created"; Please use const char* instead of char* in the above declaration. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] [PATCH v4 00/26] Delete CURRENT_TIME and CURRENT_TIME_SEC macros
On Monday, August 15, 2016 6:23:12 PM CEST Greg KH wrote: > On Sat, Aug 13, 2016 at 03:48:12PM -0700, Deepa Dinamani wrote: > > The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC > > macros. > > The macros are not y2038 safe. There is no plan to transition them into > > being > > y2038 safe. > > ktime_get_* api's can be used in their place. And, these are y2038 safe. > > Who are you execting to pull this huge patch series? Dave Chinner suggested to have Al Viro pick up the whole series. > Why not just introduce the new api call, wait for that to be merged, and > then push the individual patches through the different subsystems? > After half of those get ignored, then provide a single set of patches > that can go through Andrew or my trees. That was the original approach for v4.7, but (along with requesting a number of reworks that Deepa incorporated), Linus preferred doing the API change done in one chunk, see https://patchwork.kernel.org/patch/9134249/ Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bestätigung deiner Email Adresse
Hallo! Wir freuen uns, dass du dich für unseren kostenlosen Newsletter entschieden hast. Ab deiner Eintragung bekommst du von uns regelmäßig, komplett kostenfrei, alle aktuellen News zu dem von dir gewählten Thema auf deine E-Mailadresse zugestellt. Schließe deine Eintragung ab indem du auf den untenstehenden Link klickst: http://rdr.lcp-prios.com/c/6f27d9719e475471a41620e31d160c733497d466cd8ae2ad148e45e40e330acb0aa136118d9386acc4dedb63a5b929e1?sToken=listcompanion Du hast deine Meinung geändert oder dich nicht selbst für unseren Dienst eingetragen? Kein Thema, in dem Fall brauchst du nichts weiter zu tun. Du erhältst von uns keine weiteren E-Mails. Dein Team von ListCompanion -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On 23 August 2016 at 10:45, Shaun Tancheffwrote: > On Tue, Aug 23, 2016 at 5:17 AM, Tom Yan wrote: >> Wait a minute. I think you missed or misunderstood something when you >> listen to someone's opinion in that we should switch to sglist buffer. > > No, I think I can trust Christoph Hellwig I didn't say that you cannot trust him, but I just wonder you might have misinterpreted what he said. I haven't really follow your patch series earlier though. > >> I think the danger people referred to is exactly what is revealed when >> the ugly code is removed in this commit (it doesn't mean that the code >> should be kept though). >> >> The original buffer appears to be open: >> buf = page_address(sg_page(scsi_sglist(scmd))); > > Which is unsafe. Yup, but the while loop is not the right way to make it safe. It probably prevent any overflow, but does not handle / response to the request (n_block) properly. It would only be practically unsafe if we would ever call it with a request that needs a multi-block payload. > >> While the new buffer you adopted in in ata_format_sct_write_same() and >> ata_format_dsm_trim_descr() is of fixed size: > > Yes ... it is a temporary response buffer for simulated commands used to > copy data to and from the command sg_list so as not to hold irqs while > modifying the buffer. > >> buffer = ((void *)ata_scsi_rbuf); >> >> sctpg = ((void *)ata_scsi_rbuf); >> >> because: >> >> #define ATA_SCSI_RBUF_SIZE 4096 >> ... >> static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE]; >> >> So the sglist buffer is always 4096 bytes. > > No. The sglist buffer attached to the write same / trim command Don't you think it is in appropriate to use ata_scsi_rbuf here then? > is always sdp->sector_size I can hardly see how sdp->sector_size is relevant to the sglist buffer (especially the sglist here is facing to the ATA device). Scatter/gather list could have multiple entries of a certain size (that is not necessarily, or unlikely, logical sector size). Although I don't really know how scatter/gather list works, it hardly seems making any sense by saying that the sglist buffer is always logical sector size. AFAIK, for example, USB Attached SCSI driver does not limit the maximum number of entries (max_segment) so the kernel fall back to SG_MAX_SEGMENTS (2048), while it reports a max_segment_size of 4096 in spite of the logical sector size with dma_get_max_seg_size(). For AHCI, it reports a max_segment of 168, while it does not seem to report any max_segment_size so dma_get_max_seg_size() fallback to 65536 (though practically it seems 4096 will still be used in normal writing). > >> And hence you can probably safely use ATA_SCSI_RBUF_SIZE as the buflen >> param in the sg_copy_from_buffer() calls (at least in the case of >> ata_format_sct_write_same()). > > No. SCT Write Same has a fixed single 512 byte transfer. Never mind on that. I have presumption that you should always fully copy ata_scsi_rbuf to the sglist. > > However, the return value of >> ata_format_dsm_trim_descr() should still always be used_bytes since >> that is needed by the ata taskfile construction. > > So long as it does not exceed its sglist/sector_size buffer. > >> You may want to check (n_block / 65535 * 8 > ATA_SCSI_RBUF_SIZE). If >> it is true, then perhaps we may want to return 0, and make the SATL >> response with invalid CDB field if we catch that. > > No that is not quite right you need to check if you are > overflowing either RBUF or sdp->sector_size. Again, this does not make any sense. See above. Apparently you really have no reason to use ata_scsi_rbuf. > >> Though IMHO this is really NOT a reason that is strong enough to >> prevent this patch from entering the repo first. > >> On 23 August 2016 at 09:36, Tom Yan wrote: >>> On 23 August 2016 at 09:18, Shaun Tancheff >>> wrote: On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan wrote: > On 23 August 2016 at 07:30, Shaun Tancheff wrote: > If we really want/need to avoid hitting some real buffer limit (e.g. > maximum length of scatter/gather list?), then we should in some way > check n_block against that. If it is too large we then return > used_bytes = 0 (optionally with some follow-up to add a response to > such return value or so). Yes there is a real buffer limit, I can think of these two options: 1- Assume the setups from sd_setup_discard_cmnd() and/ or sd_setup_write_same_cmnd() are providing an sglist of sdp->sector_size via scsi_init_io() >>> >>> That sounds completely wrong. The scatter/gather list we are talking >>> about here has nothing to do with the SCSI or block layer anymore. The >>> SATL has _already_ parsed the SCSI Write Same (16) command and is >>> packing ranges/payload according to that in this stage. If there is >>> any limit it would
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On Tue, Aug 23, 2016 at 5:17 AM, Tom Yanwrote: > Wait a minute. I think you missed or misunderstood something when you > listen to someone's opinion in that we should switch to sglist buffer. No, I think I can trust Christoph Hellwig > I think the danger people referred to is exactly what is revealed when > the ugly code is removed in this commit (it doesn't mean that the code > should be kept though). > > The original buffer appears to be open: > buf = page_address(sg_page(scsi_sglist(scmd))); Which is unsafe. > While the new buffer you adopted in in ata_format_sct_write_same() and > ata_format_dsm_trim_descr() is of fixed size: Yes ... it is a temporary response buffer for simulated commands used to copy data to and from the command sg_list so as not to hold irqs while modifying the buffer. > buffer = ((void *)ata_scsi_rbuf); > > sctpg = ((void *)ata_scsi_rbuf); > > because: > > #define ATA_SCSI_RBUF_SIZE 4096 > ... > static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE]; > > So the sglist buffer is always 4096 bytes. No. The sglist buffer attached to the write same / trim command is always sdp->sector_size > And hence you can probably safely use ATA_SCSI_RBUF_SIZE as the buflen > param in the sg_copy_from_buffer() calls (at least in the case of > ata_format_sct_write_same()). No. SCT Write Same has a fixed single 512 byte transfer. However, the return value of > ata_format_dsm_trim_descr() should still always be used_bytes since > that is needed by the ata taskfile construction. So long as it does not exceed its sglist/sector_size buffer. > You may want to check (n_block / 65535 * 8 > ATA_SCSI_RBUF_SIZE). If > it is true, then perhaps we may want to return 0, and make the SATL > response with invalid CDB field if we catch that. No that is not quite right you need to check if you are overflowing either RBUF or sdp->sector_size. > Though IMHO this is really NOT a reason that is strong enough to > prevent this patch from entering the repo first. > On 23 August 2016 at 09:36, Tom Yan wrote: >> On 23 August 2016 at 09:18, Shaun Tancheff >> wrote: >>> On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan wrote: On 23 August 2016 at 07:30, Shaun Tancheff wrote: >>> If we really want/need to avoid hitting some real buffer limit (e.g. maximum length of scatter/gather list?), then we should in some way check n_block against that. If it is too large we then return used_bytes = 0 (optionally with some follow-up to add a response to such return value or so). >>> >>> Yes there is a real buffer limit, I can think of these two options: >>> 1- Assume the setups from sd_setup_discard_cmnd() and/ >>>or sd_setup_write_same_cmnd() are providing an sglist of >>>sdp->sector_size via scsi_init_io() >> >> That sounds completely wrong. The scatter/gather list we are talking >> about here has nothing to do with the SCSI or block layer anymore. The >> SATL has _already_ parsed the SCSI Write Same (16) command and is >> packing ranges/payload according to that in this stage. If there is >> any limit it would probably the max_segment allowed by the host driver >> (e.g. ahci). >> >> It doesn't seem to make sense to me either that we would need to >> prevent sglist overflow in such level. Doesn't that mean we would need >> to do the same checking (specifically, as in hard coding checks in all >> kinds of procedures) in every use of scatter/gather list? That doesn't >> sound right at all. >> >>> >>> 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the >>> sglist and calculate the available buffer size. >>> >>> #2 sounds like more fun but I'm not sure it's what people would prefer to >>> see. >> >> No idea if such thing exists / makes sense at all. >> >>> >>> -- >>> Shaun -- Shaun Tancheff -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
Wait a minute. I think you missed or misunderstood something when you listen to someone's opinion in that we should switch to sglist buffer. I think the danger people referred to is exactly what is revealed when the ugly code is removed in this commit (it doesn't mean that the code should be kept though). The original buffer appears to be open: buf = page_address(sg_page(scsi_sglist(scmd))); While the new buffer you adopted in in ata_format_sct_write_same() and ata_format_dsm_trim_descr() is of fixed size: buffer = ((void *)ata_scsi_rbuf); sctpg = ((void *)ata_scsi_rbuf); because: #define ATA_SCSI_RBUF_SIZE 4096 ... static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE]; So the sglist buffer is always 4096 bytes. And hence you can probably safely use ATA_SCSI_RBUF_SIZE as the buflen param in the sg_copy_from_buffer() calls (at least in the case of ata_format_sct_write_same()). However, the return value of ata_format_dsm_trim_descr() should still always be used_bytes since that is needed by the ata taskfile construction. You may want to check (n_block / 65535 * 8 > ATA_SCSI_RBUF_SIZE). If it is true, then perhaps we may want to return 0, and make the SATL response with invalid CDB field if we catch that. Though IMHO this is really NOT a reason that is strong enough to prevent this patch from entering the repo first. On 23 August 2016 at 09:36, Tom Yanwrote: > On 23 August 2016 at 09:18, Shaun Tancheff wrote: >> On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan wrote: >>> On 23 August 2016 at 07:30, Shaun Tancheff wrote: >> >>> If we really want/need to avoid hitting some real buffer limit (e.g. >>> maximum length of scatter/gather list?), then we should in some way >>> check n_block against that. If it is too large we then return >>> used_bytes = 0 (optionally with some follow-up to add a response to >>> such return value or so). >> >> Yes there is a real buffer limit, I can think of these two options: >> 1- Assume the setups from sd_setup_discard_cmnd() and/ >>or sd_setup_write_same_cmnd() are providing an sglist of >>sdp->sector_size via scsi_init_io() > > That sounds completely wrong. The scatter/gather list we are talking > about here has nothing to do with the SCSI or block layer anymore. The > SATL has _already_ parsed the SCSI Write Same (16) command and is > packing ranges/payload according to that in this stage. If there is > any limit it would probably the max_segment allowed by the host driver > (e.g. ahci). > > It doesn't seem to make sense to me either that we would need to > prevent sglist overflow in such level. Doesn't that mean we would need > to do the same checking (specifically, as in hard coding checks in all > kinds of procedures) in every use of scatter/gather list? That doesn't > sound right at all. > >> >> 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the >> sglist and calculate the available buffer size. >> >> #2 sounds like more fun but I'm not sure it's what people would prefer to >> see. > > No idea if such thing exists / makes sense at all. > >> >> -- >> Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On Tue, Aug 23, 2016 at 3:37 AM, Tom Yanwrote: > On 23 August 2016 at 07:30, Shaun Tancheff wrote: > If we really want/need to avoid hitting some real buffer limit (e.g. > maximum length of scatter/gather list?), then we should in some way > check n_block against that. If it is too large we then return > used_bytes = 0 (optionally with some follow-up to add a response to > such return value or so). Yes there is a real buffer limit, I can think of these two options: 1- Assume the setups from sd_setup_discard_cmnd() and/ or sd_setup_write_same_cmnd() are providing an sglist of sdp->sector_size via scsi_init_io() 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the sglist and calculate the available buffer size. #2 sounds like more fun but I'm not sure it's what people would prefer to see. -- Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Observing Softlockup's while running heavy IOs
> -Original Message- > From: Elliott, Robert (Persistent Memory) [mailto:elli...@hpe.com] > Sent: Saturday, August 20, 2016 2:58 AM > To: Sreekanth Reddy > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; > irqbala...@lists.infradead.org; Kashyap Desai; Sathya Prakash Veerichetty; > Chaitra Basappa; Suganath Prabu Subramani > Subject: RE: Observing Softlockup's while running heavy IOs > > > > > -Original Message- > > From: Sreekanth Reddy [mailto:sreekanth.re...@broadcom.com] > > Sent: Friday, August 19, 2016 6:45 AM > > To: Elliott, Robert (Persistent Memory)> > Subject: Re: Observing Softlockup's while running heavy IOs > > > ... > > Yes I am also observing that all the interrupts are routed to one CPU. > > But still I observing softlockups (sometime hardlockups) even when I > > set rq_affinity to 2. How about below scenario ? For simplicity. HBA with single MSI-x vector. (Whenever HBA supports less MSI-x and logical CPUs are more on system, we can see chance of these issue frequently..) Assume we have 32 logical CPU (4 socket, each with 8 logical CPU). CPU-0 is not participating in IO. Remaining CPU range from 1 to 31 is submitting IO. In such a scenario rq_affinity=2 and irqbalance supporting *exact* smp_affinity_hint will not help. We may see soft/hard lockup on CPU-0.. Are we going to resolve such issue or it is very rare to happen on field ? > > That'll ensure the block layer's completion handling is done there, but > not your > driver's interrupt handler (which precedes the block layer completion > handling). > > > > Is their any way to route the interrupts the same CPUs which has > > submitted the corresponding IOs? > > or > > Is their any way/option in the irqbalance/kernel which can route > > interrupts to CPUs (enabled in affinity_hint) in round robin manner > > after specific time period. > > Ensure your driver creates one MSIX interrupt per CPU core, uses that > interrupt > for all submissions from that core, and reports that it would like that > interrupt to > be serviced by that core in /proc/irq/nnn/affinity_hint. > > Even with hyperthreading, this needs to be based on the logical CPU cores, > not > just the physical core or the physical socket. > You can swamp a logical CPU core as easily as a physical CPU core. > > Then, provide an irqbalance policy script that honors the affinity_hint > for your > driver, or turn off irqbalance and manually set /proc/irq/nnn/smp_affinity > to > match the affinity_hint. > > Some versions of irqbalance honor the hints; some purposely don't and need > to > be overridden with a policy script. > > > --- > Robert Elliott, HPE Persistent Memory > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On 23 August 2016 at 09:18, Shaun Tancheffwrote: > On Tue, Aug 23, 2016 at 3:37 AM, Tom Yan wrote: >> On 23 August 2016 at 07:30, Shaun Tancheff wrote: > >> If we really want/need to avoid hitting some real buffer limit (e.g. >> maximum length of scatter/gather list?), then we should in some way >> check n_block against that. If it is too large we then return >> used_bytes = 0 (optionally with some follow-up to add a response to >> such return value or so). > > Yes there is a real buffer limit, I can think of these two options: > 1- Assume the setups from sd_setup_discard_cmnd() and/ >or sd_setup_write_same_cmnd() are providing an sglist of >sdp->sector_size via scsi_init_io() That sounds completely wrong. The scatter/gather list we are talking about here has nothing to do with the SCSI or block layer anymore. The SATL has _already_ parsed the SCSI Write Same (16) command and is packing ranges/payload according to that in this stage. If there is any limit it would probably the max_segment allowed by the host driver (e.g. ahci). It doesn't seem to make sense to me either that we would need to prevent sglist overflow in such level. Doesn't that mean we would need to do the same checking (specifically, as in hard coding checks in all kinds of procedures) in every use of scatter/gather list? That doesn't sound right at all. > > 2- Find (or write) a suitable sg_get_size(sgl, nents) to walk the > sglist and calculate the available buffer size. > > #2 sounds like more fun but I'm not sure it's what people would prefer to see. No idea if such thing exists / makes sense at all. > > -- > Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On 23 August 2016 at 08:37, Tom Yanwrote: > On 23 August 2016 at 07:30, Shaun Tancheff wrote: >> >> But the "humanized" limit is the one you just added and proceeded to >> change ata_set_lba_range_entries(). You changed from a buffer size >> to use "num" instead and now you want to remove the protection >> entirely? > > I am not sure about what you are talking about here. What I did is > just to avoid setting an (inappropriate and unnecessary) hard-coded > limit (with the original while-condition) for i (the unaligned buffer > size) in this general (in the sense that TRIM payload is _not always_ > of one 512-byte block, even if we may want to forever practice that in > our kernel) function. > > If we really want/need to avoid hitting some real buffer limit (e.g. > maximum length of scatter/gather list?), then we should in some way > check n_block against that. If it is too large we then return > used_bytes = 0 (optionally with some follow-up to add a response to > such return value or so). > > We advertise 4194240 as Maximum Write Same Length so that the > SCSI/block layer will know how to split the discard request, and then > we make sure that the SATL reject SCSI commands (that is not issued > through the discard / write same ioctl but with SG_IO / > sg_write_same). That's all we really need to do, end of story. > >> >> Why not just change to put this in front of ata_set_lba_range_entries() >> >> if (n_block > 65535 * ATA_MAX_TRIM_RNUM) { >> fp = 2; >> goto invalid_fld; >> } > > Well you can change the if-else clause to that. Apparently you've been > doing that in your patch series. But that has nothing to do with what > I am fixing here. Neither does it affect the actual behavior. > >> >> And then restore ata_set_lba_range_entries() to how it looked >> before you changed it in commit: >> >> 2983860c7 (libata-scsi: avoid repeated calculation of number of TRIM ranges) > > I don't see how it is relevant at all. Though I should have introduced > _this_ patch before that to save some work. Unfortunately I wasn't > aware how ugly ata_set_lba_range_entries was. > >> >> Then you can have ata_set_lba_range_entries() take the buffer size ... >> something like the following would be fine: >> >> size = ata_set_lba_range_entries(buf, scmd->device->sector_size, >> block, n_block); >> >> Now things are happily protected against both exceeding the b0 limit(s) and >> overflowing the sglist buffer. > > We have no reason at all to bring the logical sector size in here. > Only if in future ACS standards it is stated that payload block are > counted in logical block size instead of statically 512, we would only > need to make the now static "512" in ALIGN() in to 4096. typo. I meant from static "512" to logical sector size, and I really think we should do it ONLY if the later ACS changes its statement. > > For possible sglist buffer overflow, see what I mentioned above about > checking n_block and return 0. We would not want to do it with the > original while-way anyway. All it does would be to truncate larger > request and partially complete it silently. > > Can you tell exactly how the size of the sglist buffer is limited? > AFAIK it has nothing to do with logical sector size. I doubt that we > will ever cause an overflow, given how conservative we have been on > the size of TRIM payload. We would probably ignore the reported value > once it is larger than 16 or so, if we would ever enable multi-block > payload in the future. > > Realistically speaking, I sent this patch not really to get the > function ready for multi-block payload (neither I have mentioned that > in the commit message), but just to make the code proper and avoid > silly effort (and confusion caused) you and I spent in our patches. > >> >> -- >> Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On 23 August 2016 at 07:30, Shaun Tancheffwrote: > > But the "humanized" limit is the one you just added and proceeded to > change ata_set_lba_range_entries(). You changed from a buffer size > to use "num" instead and now you want to remove the protection > entirely? I am not sure about what you are talking about here. What I did is just to avoid setting an (inappropriate and unnecessary) hard-coded limit (with the original while-condition) for i (the unaligned buffer size) in this general (in the sense that TRIM payload is _not always_ of one 512-byte block, even if we may want to forever practice that in our kernel) function. If we really want/need to avoid hitting some real buffer limit (e.g. maximum length of scatter/gather list?), then we should in some way check n_block against that. If it is too large we then return used_bytes = 0 (optionally with some follow-up to add a response to such return value or so). We advertise 4194240 as Maximum Write Same Length so that the SCSI/block layer will know how to split the discard request, and then we make sure that the SATL reject SCSI commands (that is not issued through the discard / write same ioctl but with SG_IO / sg_write_same). That's all we really need to do, end of story. > > Why not just change to put this in front of ata_set_lba_range_entries() > > if (n_block > 65535 * ATA_MAX_TRIM_RNUM) { > fp = 2; > goto invalid_fld; > } Well you can change the if-else clause to that. Apparently you've been doing that in your patch series. But that has nothing to do with what I am fixing here. Neither does it affect the actual behavior. > > And then restore ata_set_lba_range_entries() to how it looked > before you changed it in commit: > > 2983860c7 (libata-scsi: avoid repeated calculation of number of TRIM ranges) I don't see how it is relevant at all. Though I should have introduced _this_ patch before that to save some work. Unfortunately I wasn't aware how ugly ata_set_lba_range_entries was. > > Then you can have ata_set_lba_range_entries() take the buffer size ... > something like the following would be fine: > > size = ata_set_lba_range_entries(buf, scmd->device->sector_size, > block, n_block); > > Now things are happily protected against both exceeding the b0 limit(s) and > overflowing the sglist buffer. We have no reason at all to bring the logical sector size in here. Only if in future ACS standards it is stated that payload block are counted in logical block size instead of statically 512, we would only need to make the now static "512" in ALIGN() in to 4096. For possible sglist buffer overflow, see what I mentioned above about checking n_block and return 0. We would not want to do it with the original while-way anyway. All it does would be to truncate larger request and partially complete it silently. Can you tell exactly how the size of the sglist buffer is limited? AFAIK it has nothing to do with logical sector size. I doubt that we will ever cause an overflow, given how conservative we have been on the size of TRIM payload. We would probably ignore the reported value once it is larger than 16 or so, if we would ever enable multi-block payload in the future. Realistically speaking, I sent this patch not really to get the function ready for multi-block payload (neither I have mentioned that in the commit message), but just to make the code proper and avoid silly effort (and confusion caused) you and I spent in our patches. > > -- > Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ata: do not hard code limit in ata_set_lba_range_entries()
On Mon, Aug 22, 2016 at 3:53 PM, Tom Yanwrote: > On 22 August 2016 at 20:32, Shaun Tancheff wrote: >> On Mon, Aug 22, 2016 at 3:07 PM, Tom Yan wrote: >>> I don't see how that's possible. count / n_block will always be >>> smaller than 65535 * ATA_MAX_TRIM_RNUM(64) = 4194240. Not to mention >>> that isn't even a "buffer limit" anyway. By SG_IO do you mean like >>> SCSI Write Same commands that issued with sg_write_same or so? If >>> that's the case, that's what exactly commit 5c79097a28c2 >>> ("libata-scsi: reject WRITE SAME (16) with n_block that exceeds >>> limit") is for. >> >> Ah, I see. You are guarding the only user of ata_set_lba_range_entries(). > > Yup. It is the only right thing to do anyway, that we leave the > function "open" and guard per context when we use it. Say if > ata_set_lba_range_entries() is gonna be a function that is shared by > others, it would only make this commit more important. As I said, we > did not guard it with a certain buffer limit, but merely redundantly > guard it with a ("humanized") limit that applies to TRIM only. But the "humanized" limit is the one you just added and proceeded to change ata_set_lba_range_entries(). You changed from a buffer size to use "num" instead and now you want to remove the protection entirely? Why not just change to put this in front of ata_set_lba_range_entries() if (n_block > 65535 * ATA_MAX_TRIM_RNUM) { fp = 2; goto invalid_fld; } And then restore ata_set_lba_range_entries() to how it looked before you changed it in commit: 2983860c7 (libata-scsi: avoid repeated calculation of number of TRIM ranges) Then you can have ata_set_lba_range_entries() take the buffer size ... something like the following would be fine: size = ata_set_lba_range_entries(buf, scmd->device->sector_size, block, n_block); Now things are happily protected against both exceeding the b0 limit(s) and overflowing the sglist buffer. -- Shaun -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html