Re: KMSAN: kernel-infoleak in sg_scsi_ioctl
Hi, See below. On 2021-04-12 9:02 a.m., Hao Sun wrote: Hi When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz the Linux kernel, I found the following bug report. commit: 4ebaab5fb428374552175aa39832abf5cedb916a version: linux 5.12 git tree:kmsan kernel config and full log can be found in the attached file. = BUG: KMSAN: kernel-infoleak in kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249 CPU: 2 PID: 23939 Comm: executor Not tainted 5.12.0-rc6+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x1ff/0x275 lib/dump_stack.c:120 kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118 kmsan_internal_check_memory+0x48c/0x520 mm/kmsan/kmsan.c:437 kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249 instrument_copy_to_user ./include/linux/instrumented.h:121 [inline] _copy_to_user+0x112/0x1d0 lib/usercopy.c:33 copy_to_user ./include/linux/uaccess.h:209 [inline] sg_scsi_ioctl+0xfa9/0x1180 block/scsi_ioctl.c:507 sg_ioctl_common+0x2713/0x4930 drivers/scsi/sg.c:1108 sg_ioctl+0x166/0x2d0 drivers/scsi/sg.c:1162 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl+0x2c2/0x400 fs/ioctl.c:739 __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:739 do_syscall_64+0xa2/0x120 arch/x86/entry/common.c:48 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x47338d Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:7fe31ab90c58 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 0059c128 RCX: 0047338d RDX: 2040 RSI: 0001 RDI: 0003 RBP: 004e8e5d R08: R09: R10: R11: 0246 R12: 0059c128 R13: 7ffe2284af2f R14: 7ffe2284b0d0 R15: 7fe31ab90dc0 Uninit was stored to memory at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline] kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289 kmsan_memcpy_memmove_metadata+0x25b/0x290 mm/kmsan/kmsan.c:226 kmsan_memcpy_metadata+0xb/0x10 mm/kmsan/kmsan.c:246 __msan_memcpy+0x46/0x60 mm/kmsan/kmsan_instr.c:110 bio_copy_kern_endio_read+0x3ee/0x560 block/blk-map.c:443 bio_endio+0xa1a/0xcc0 block/bio.c:1453 req_bio_endio block/blk-core.c:265 [inline] blk_update_request+0xd4f/0x2190 block/blk-core.c:1456 scsi_end_request+0x111/0xc50 drivers/scsi/scsi_lib.c:570 scsi_io_completion+0x276/0x2840 drivers/scsi/scsi_lib.c:970 scsi_finish_command+0x6fc/0x720 drivers/scsi/scsi.c:214 scsi_softirq_done+0x205/0xa40 drivers/scsi/scsi_lib.c:1450 blk_complete_reqs block/blk-mq.c:576 [inline] blk_done_softirq+0x133/0x1e0 block/blk-mq.c:581 __do_softirq+0x271/0x782 kernel/softirq.c:345 Uninit was created at: kmsan_save_stack_with_flags+0x3c/0x90 kmsan_alloc_page+0xc4/0x1b0 __alloc_pages_nodemask+0xdb0/0x54a0 alloc_pages_current+0x671/0x990 blk_rq_map_kern+0xb8e/0x1310 sg_scsi_ioctl+0xc94/0x1180 sg_ioctl_common+0x2713/0x4930 sg_ioctl+0x166/0x2d0 __se_sys_ioctl+0x2c2/0x400 __x64_sys_ioctl+0x4a/0x70 do_syscall_64+0xa2/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xae Byte 0 of 1 is uninitialized Memory access of size 1 starts at 99e033fb9360 Data copied to user address 2048 The following system call sequence (Syzlang format) can reproduce the crash: # {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1 Slowdown:1 Sandbox:none Fault:false FaultCall:-1 FaultNth:0 Leak:false NetInjection:true NetDevices:true NetReset:false Cgroups:false BinfmtMisc:true CloseFDs:true KCSAN:false DevlinkPCI:true USB:true VhciInjection:true Wifi:true IEEE802154:true Sysctl:true UseTmpDir:true HandleSegv:true Repro:false Trace:false} r0 = syz_open_dev$sg(&(0x7f00)='/dev/sg#\x00', 0x0, 0x2094b402) ioctl$SG_GET_LOW_DMA(r0, 0x227a, &(0x7f40)) ioctl$SCSI_IOCTL_SEND_COMMAND(r0, 0x1, &(0x7f40)={0x0, 0x1, 0x1}) Since the code opens a sg device node then the sg driver, which is a pass-through driver, is invoked. However instead of using sg's pass-through facilities, that call to ioctl(SCSI_IOCTL_SEND_COMMAND) is invoking the long deprecated SCSI mid-level pass-through. So if there is infoleak bug you should flag sg_scsi_ioctl() in block/scsi_ioctl.c. See the notes associated with that function which imply it can't be protected from certain types of abuse due to its interface design. That is why it is deprecated. Also the equivalent of root permissions are required to execute those functions. That code looks strange, ioctl(SG_GET_LOW_DMA) reads the host->unchecked_isa_dma value (now always 0 ??) into an int at 0x7f40. That same address is then used for the
Re: [scsi_debug] 20b58d1e6b: blktests.block.001.fail
On 2021-03-23 9:26 a.m., kernel test robot wrote: Greeting, FYI, we noticed the following commit (built with gcc-9): commit: 20b58d1e6b9cda142cd142a0a2f94c0d04b0a5a0 ("[RFC] scsi_debug: add hosts initialization --> worker") url: https://github.com/0day-ci/linux/commits/Douglas-Gilbert/scsi_debug-add-hosts-initialization-worker/20210319-230817 base: https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git for-next in testcase: blktests version: blktests-x86_64-a210761-1_20210124 with following parameters: disk: 1SSD test: block-group-00 ucode: 0xe2 on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): This RFC was proposed for Luis Chamberlain to consider for this report: https://bugzilla.kernel.org/show_bug.cgi?id=212337 Luis predicted that this change would trip up some blktests which is exactly what has happened here. The question here is whether it is reasonable (i.e. a correct simulation of what real hardware does) to assume that as soon as the loading of the scsi_debug is complete, that _all_ LUNs (devices) specified in its parameters are ready for media access? If yes then this RFC can be dropped or relegated to only occur when a driver parameter is set to a non-default value. If no then those blktest scripts need to be fixed to reflect that after a HBA is loaded, all the targets and LUNs connected to it do _not_ immediately become available. Doug Gilbert If you fix the issue, kindly add following tag Reported-by: kernel test robot 2021-03-21 02:40:23 sed "s:^:block/:" /lkp/benchmarks/blktests/tests/block-group-00 2021-03-21 02:40:23 ./check block/001 block/001 (stress device hotplugging) block/001 (stress device hotplugging)[failed] runtime ... 30.370s --- tests/block/001.out2021-01-24 06:04:08.0 + +++ /lkp/benchmarks/blktests/results/nodev/block/001.out.bad 2021-03-21 02:40:53.652003261 + @@ -1,4 +1,7 @@ Running block/001 Stressing sd +ls: cannot access '/sys/class/scsi_device/4:0:0:0/device/block': No such file or directory +ls: cannot access '/sys/class/scsi_device/5:0:0:0/device/block': No such file or directory Stressing sr +ls: cannot access '/sys/class/scsi_device/4:0:0:0/device/block': No such file or directory Test complete To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp installjob.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml bin/lkp runcompatible-job.yaml --- 0DAY/LKP+ Test Infrastructure Open Source Technology Center https://lists.01.org/hyperkitty/list/l...@lists.01.org Intel Corporation Thanks, Oliver Sang
Re: [syzbot] KASAN: invalid-free in sg_finish_scsi_blk_rq
On 2021-03-15 9:59 p.m., syzbot wrote: Hello, syzbot found the following issue on: HEAD commit:d98f554b Add linux-next specific files for 20210312 git tree: linux-next console output: https://syzkaller.appspot.com/x/log.txt?x=1189318ad0 kernel config: https://syzkaller.appspot.com/x/.config?x=e362835d2e58cef6 dashboard link: https://syzkaller.appspot.com/bug?extid=0a0e8ecea895d38332e6 Unfortunately, I don't have any reproducer for this issue yet. No need, I think I can see how it happens. A particular type of resource error from the block layer, together with a 32 byte (or larger) SCSI command. I'm testing a patch. Doug Gilbert IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+0a0e8ecea895d3833...@syzkaller.appspotmail.com == BUG: KASAN: double-free or invalid-free in slab_free mm/slub.c:3161 [inline] BUG: KASAN: double-free or invalid-free in kfree+0xe5/0x7f0 mm/slub.c:4215 CPU: 0 PID: 10481 Comm: syz-executor.5 Not tainted 5.12.0-rc2-next-20210312-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232 kasan_report_invalid_free+0x51/0x80 mm/kasan/report.c:357 kasan_slab_free mm/kasan/common.c:340 [inline] __kasan_slab_free+0x118/0x130 mm/kasan/common.c:367 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1562 [inline] slab_free_freelist_hook+0x92/0x210 mm/slub.c:1600 slab_free mm/slub.c:3161 [inline] kfree+0xe5/0x7f0 mm/slub.c:4215 scsi_req_free_cmd include/scsi/scsi_request.h:28 [inline] sg_finish_scsi_blk_rq+0x690/0x810 drivers/scsi/sg.c:3224 sg_common_write+0xa07/0xe70 drivers/scsi/sg.c:1132 sg_v3_submit+0x3b1/0x530 drivers/scsi/sg.c:797 sg_ctl_sg_io drivers/scsi/sg.c:1785 [inline] sg_ioctl_common+0x3c86/0x97f0 drivers/scsi/sg.c:2014 sg_ioctl+0x7c/0x110 drivers/scsi/sg.c:2229 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x465f69 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:7f8413efa188 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 0056bf60 RCX: 00465f69 RDX: 20001780 RSI: 2285 RDI: 0003 RBP: 004bfa8f R08: R09: R10: R11: 0246 R12: 0056bf60 R13: 7ffe20e16e2f R14: 7f8413efa300 R15: 00022000 Allocated by task 10481: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:427 [inline] kasan_kmalloc mm/kasan/common.c:506 [inline] kasan_kmalloc mm/kasan/common.c:465 [inline] __kasan_kmalloc+0x99/0xc0 mm/kasan/common.c:515 kmalloc include/linux/slab.h:561 [inline] kzalloc include/linux/slab.h:686 [inline] sg_start_req+0x16f/0x24e0 drivers/scsi/sg.c:3044 sg_common_write+0x5fd/0xe70 drivers/scsi/sg.c:1109 sg_v3_submit+0x3b1/0x530 drivers/scsi/sg.c:797 sg_ctl_sg_io drivers/scsi/sg.c:1785 [inline] sg_ioctl_common+0x3c86/0x97f0 drivers/scsi/sg.c:2014 sg_ioctl+0x7c/0x110 drivers/scsi/sg.c:2229 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 10481: kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38 kasan_set_track+0x1c/0x30 mm/kasan/common.c:46 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:357 kasan_slab_free mm/kasan/common.c:360 [inline] kasan_slab_free mm/kasan/common.c:325 [inline] __kasan_slab_free+0xf5/0x130 mm/kasan/common.c:367 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1562 [inline] slab_free_freelist_hook+0x92/0x210 mm/slub.c:1600 slab_free mm/slub.c:3161 [inline] kfree+0xe5/0x7f0 mm/slub.c:4215 sg_start_req+0x1b33/0x24e0 drivers/scsi/sg.c:3106 sg_common_write+0x5fd/0xe70 drivers/scsi/sg.c:1109 sg_v3_submit+0x3b1/0x530 drivers/scsi/sg.c:797 sg_ctl_sg_io drivers/scsi/sg.c:1785 [inline] sg_ioctl_common+0x3c86/0x97f0 drivers/scsi/sg.c:2014 sg_ioctl+0x7c/0x110 drivers/scsi/sg.c:2229 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200
Re: [PATCH][next] scsi: sg: return -ENOMEM on out of memory error
On 2021-03-11 6:33 p.m., Colin King wrote: From: Colin Ian King The sg_proc_seq_show_debug should return -ENOMEM on an out of memory error rather than -1. Fix this. Fixes: 94cda6cf2e44 ("scsi: sg: Rework debug info") Signed-off-by: Colin Ian King Acked-by: Douglas Gilbert Thanks. --- drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 79f05afa4407..85e86cbc6891 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -4353,7 +4353,7 @@ sg_proc_seq_show_debug(struct seq_file *s, void *v) if (!bp) { seq_printf(s, "%s: Unable to allocate %d on heap, finish\n", __func__, bp_len); - return -1; + return -ENOMEM; } read_lock_irqsave(_index_lock, iflags); sdp = it ? sg_lookup_dev(it->index) : NULL;
Re: [PATCH][next] scsi: sg: Fix use of pointer sfp after it has been kfree'd
On 2021-03-11 5:37 a.m., Colin King wrote: From: Colin Ian King Currently SG_LOG is referencing sfp after it has been kfree'd which is probably a bad thing to do. Fix this by kfree'ing sfp after SG_LOG. Addresses-Coverity: ("Use after free") Fixes: af1fc95db445 ("scsi: sg: Replace rq array with xarray") Signed-off-by: Colin Ian King Acked-by: Douglas Gilbert Thanks. --- drivers/scsi/sg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 2d4bbc1a1727..79f05afa4407 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -3799,10 +3799,10 @@ sg_add_sfp(struct sg_device *sdp) if (rbuf_len > 0) { srp = sg_build_reserve(sfp, rbuf_len); if (IS_ERR(srp)) { - kfree(sfp); err = PTR_ERR(srp); SG_LOG(1, sfp, "%s: build reserve err=%ld\n", __func__, -err); + kfree(sfp); return ERR_PTR(err); } if (srp->sgat_h.buflen < rbuf_len) {
Re: linux-next: build failure after merge of the scsi-mkp tree
On 2021-01-27 2:01 a.m., Stephen Rothwell wrote: Hi all, On Mon, 25 Jan 2021 00:53:59 -0500 Douglas Gilbert wrote: On 2021-01-24 11:13 p.m., Stephen Rothwell wrote: After merging the scsi-mkp tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/scsi/sg.c: In function 'sg_find_srp_by_id': drivers/scsi/sg.c:2908:4: error: expected '}' before 'else' 2908 |else |^~~~ drivers/scsi/sg.c:2902:16: warning: unused variable 'cptp' [-Wunused-variable] 2902 |const char *cptp = "pack_id="; |^~~~ drivers/scsi/sg.c:2896:5: error: label 'good' used but not defined 2896 | goto good; | ^~~~ drivers/scsi/sg.c: At top level: drivers/scsi/sg.c:2913:2: error: expected identifier or '(' before 'return' 2913 | return NULL; | ^~ drivers/scsi/sg.c:2914:5: error: expected '=', ',', ';', 'asm' or '__attribute__' before ':' token 2914 | good: | ^ drivers/scsi/sg.c:2917:2: error: expected identifier or '(' before 'return' 2917 | return srp; | ^~ drivers/scsi/sg.c:2918:1: error: expected identifier or '(' before '}' token 2918 | } | ^ drivers/scsi/sg.c: In function 'sg_find_srp_by_id': drivers/scsi/sg.c:2912:2: error: control reaches end of non-void function [-Werror=return-type] 2912 | } | ^ Caused by commit 7323ad3618b6 ("scsi: sg: Replace rq array with xarray") SG_LOG() degenerates to "{}" in some configs ... I have used the scsi-mkp tree from next-20210122 for today. I sent a new patchset to the linux-scsi list about 4 hours ago to fix that. Doug Gilbert I am still getting this build failure. Hi, I resent the original patch set, with fixes, against the linux-scsi list yesterday but that was not the form that Martin Petersen needs it in. That was against his 5.12/scsi-queue branch which is roughly lk 5.11.0-rc2. He has referred me to his 5.12/scsi-staging branch which looks half applied from the 45 patch set that I have been sending to the linux-scsi list. Trying to find out if that was the intention or a mistake. The other issue is a large patchset that removes the first function argument from blk_execute_rq_nowait() which is used by the sg driver. Doug Gilbert
Re: linux-next: build failure after merge of the scsi-mkp tree
On 2021-01-24 11:13 p.m., Stephen Rothwell wrote: Hi all, After merging the scsi-mkp tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/scsi/sg.c: In function 'sg_find_srp_by_id': drivers/scsi/sg.c:2908:4: error: expected '}' before 'else' 2908 |else |^~~~ drivers/scsi/sg.c:2902:16: warning: unused variable 'cptp' [-Wunused-variable] 2902 |const char *cptp = "pack_id="; |^~~~ drivers/scsi/sg.c:2896:5: error: label 'good' used but not defined 2896 | goto good; | ^~~~ drivers/scsi/sg.c: At top level: drivers/scsi/sg.c:2913:2: error: expected identifier or '(' before 'return' 2913 | return NULL; | ^~ drivers/scsi/sg.c:2914:5: error: expected '=', ',', ';', 'asm' or '__attribute__' before ':' token 2914 | good: | ^ drivers/scsi/sg.c:2917:2: error: expected identifier or '(' before 'return' 2917 | return srp; | ^~ drivers/scsi/sg.c:2918:1: error: expected identifier or '(' before '}' token 2918 | } | ^ drivers/scsi/sg.c: In function 'sg_find_srp_by_id': drivers/scsi/sg.c:2912:2: error: control reaches end of non-void function [-Werror=return-type] 2912 | } | ^ Caused by commit 7323ad3618b6 ("scsi: sg: Replace rq array with xarray") SG_LOG() degenerates to "{}" in some configs ... I have used the scsi-mkp tree from next-20210122 for today. Hi, I sent a new patchset to the linux-scsi list about 4 hours ago to fix that. Doug Gilbert
[PATCH 0/3] scatterlist: sgl-sgl ops: copy, equal
Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for the target subsystem. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl(), sgl_equal_sgl() and sgl_memset(). Changes since v6 [posted 20210118]: - restarted with new patchset name, was "scatterlist: add new capabilities" - drop correction patch "sgl_alloc_order: remove 4 GiB limit, sgl_free() warning"; could be sent separately as a fix - rename sgl_compare_sgl() to sg_equal_sgl() and the helper to sg_equal_sgl_idx() Changes since v5 [posted 20201228]: - incorporate review requests from Jason Gunthorpe - replace integer overflow detection code in sgl_alloc_order() with a pre-condition statement - rebase on lk 5.11.0-rc4 Changes since v4 [posted 20201105]: - rebase on lk 5.10.0-rc2 Changes since v3 [posted 20201019]: - re-instate check on integer overflow of nent calculation in sgl_alloc_order(). Do it in such a way as to not limit the overall sgl size to 4 GiB - introduce sgl_compare_sgl_idx() helper function that, if requested and if a miscompare is detected, will yield the byte index of the first miscompare. - add Reviewed-by tags from Bodo Stroesser - rebase on lk 5.10.0-rc2 [was on lk 5.9.0] Changes since v2 [posted 20201018]: - remove unneeded lines from sgl_memset() definition. - change sg_zero_buffer() to call sgl_memset() as the former is a subset. Changes since v1 [posted 20201016]: - Bodo Stroesser pointed out a problem with the nesting of kmap_atomic() [called via sg_miter_next()] and kunmap_atomic() calls [called via sg_miter_stop()] and proposed a solution that simplifies the previous code. - the new implementation of the three functions has shorter periods when pre-emption is disabled (but has more them). This should make operations on large sgl_s more pre-emption "friendly" with a relatively small performance hit. - sgl_memset return type changed from void to size_t and is the number of bytes actually (over)written. That number is needed anyway internally so may as well return it as it may be useful to the caller. This patchset is against lk 5.11.0-rc4 Douglas Gilbert (3): scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_equal_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 32 - lib/scatterlist.c | 233 2 files changed, 243 insertions(+), 22 deletions(-) -- 2.25.1
[PATCH 2/3] scatterlist: add sgl_equal_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s for equality. This new function is designed to partially implement NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function begins scanning at the start (of each sgl) and returns false on the first miscompare and stops comparing. The sgl_equal_sgl_idx() function additionally yields the index (i.e. byte position) of the first miscompare. The additional parameter, miscompare_idx, is a pointer. If it is non-NULL and a miscompare is detected (i.e. the function returns false) then the byte index of the first miscompare is written to *miscompare_idx. Knowing the location of the first miscompare is needed to implement properly the SCSI COMPARE AND WRITE command. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 8 +++ lib/scatterlist.c | 110 2 files changed, 118 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 22111ee21383..40449ce96a18 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -324,6 +324,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_equal_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, + struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, + size_t n_bytes); + +bool sgl_equal_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, + struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, + size_t n_bytes, size_t *miscompare_idx); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 782bcfe72c60..a8672bc6d883 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1132,3 +1132,113 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski return offset; } EXPORT_SYMBOL(sgl_copy_sgl); + +/** + * sgl_equal_sgl_idx - check if x and y (both sgl_s) compare equal, report + *index for first unequal bytes + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The (maximum) number of bytes to compare + * @miscompare_idx: if return is false, index of first miscompare written + * to this pointer (if non-NULL). Value will be < n_bytes + * + * Returns: + * true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). If return + * is false and miscompare_idx is non-NULL, then index of first miscompared + * byte written to *miscompare_idx. + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +bool sgl_equal_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, + struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, + size_t n_bytes, size_t *miscompare_idx) +{ + bool equ = true; + size_t len; + size_t offset = 0; + struct sg_mapping_iter x_iter, y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(x_iter.length, y_iter.length, n_bytes - offset); + + equ = !memcmp(x_iter.addr, y_iter.addr, len); + if (!equ) + goto fini; + offset += len; + /* LIFO order is important when SG_MITER_ATOMIC is used */ + y_iter.consumed = len; + sg_miter_stop(_iter); + x_iter.consumed = len; + sg_miter_stop(_iter); + }
[PATCH 3/3] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Change implementation of sg_zero_buffer() to call this new function. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 20 +- lib/scatterlist.c | 79 + 2 files changed, 62 insertions(+), 37 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 40449ce96a18..04be80d1a07c 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -317,8 +317,6 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents, const void *buf, size_t buflen, off_t skip); size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen, off_t skip); -size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, - size_t buflen, off_t skip); size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, @@ -332,6 +330,24 @@ bool sgl_equal_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_ struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes, size_t *miscompare_idx); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + +/** + * sg_zero_buffer - Zero-out a part of a SG list + * @sgl: The SG list + * @nents: Number of SG entries + * @buflen:The number of bytes to zero out + * @skip: Number of bytes to skip before zeroing + * + * Returns the number of bytes zeroed. + **/ +static inline size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, + size_t buflen, off_t skip) +{ + return sgl_memset(sgl, nents, skip, 0, buflen); +} + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a8672bc6d883..cb4d59111c78 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1024,41 +1024,6 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, } EXPORT_SYMBOL(sg_pcopy_to_buffer); -/** - * sg_zero_buffer - Zero-out a part of a SG list - * @sgl:The SG list - * @nents: Number of SG entries - * @buflen: The number of bytes to zero out - * @skip: Number of bytes to skip before zeroing - * - * Returns the number of bytes zeroed. - **/ -size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, - size_t buflen, off_t skip) -{ - unsigned int offset = 0; - struct sg_mapping_iter miter; - unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; - - sg_miter_start(, sgl, nents, sg_flags); - - if (!sg_miter_skip(, skip)) - return false; - - while (offset < buflen && sg_miter_next()) { - unsigned int len; - - len = min(miter.length, buflen - offset); - memset(miter.addr, 0, len); - - offset += len; - } - - sg_miter_stop(); - return offset; -} -EXPORT_SYMBOL(sg_zero_buffer); - /** * sgl_copy_sgl - Copy over a destination sgl from a source sgl * @d_sgl: Destination sgl @@ -1242,3 +1207,47 @@ bool sgl_equal_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip return sgl_equal_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, y_skip, n_bytes, NULL); } EXPORT_SYMBOL(sgl_equal_sgl); + +/** + * sgl_memset - set byte 'val' up to n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The (maximum) number of bytes to modify + * + * Returns: + * The number of bytes written. + * + * Notes: + * Stops writing if either sgl or n_bytes is exhausted. If n_bytes is + * set SIZE_MAX then val will be written to each byte until the end + * of sgl. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + + if (n_bytes == 0) +
[PATCH 1/3] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a sgl to sgl copy. Stops when the first of the number of requested bytes to copy, or the source sgl, or the destination sgl is exhausted. So the destination sgl will _not_ grow. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 74 + 2 files changed, 78 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f70572b2938..22111ee21383 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -320,6 +320,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a59778946404..782bcfe72c60 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1058,3 +1058,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before starting + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before starting + * @n_bytes:The (maximum) number of bytes to copy + * + * Returns: + * The number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if either d_sgl, s_sgl or n_bytes is exhausted. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * Large copies are broken into copy segments whose sizes may vary. Those + * copy segment sizes are chosen by the min3() statement in the code below. + * Since SG_MITER_ATOMIC is used for both sides, each copy segment is started + * with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic() + * [in sg_miter_stop()]. This means pre-emption is inhibited for relatively + * short periods even in very large copies. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t len; + size_t offset = 0; + struct sg_mapping_iter d_iter, s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(d_iter.length, s_iter.length, n_bytes - offset); + + memcpy(d_iter.addr, s_iter.addr, len); + offset += len; + /* LIFO order (stop d_iter before s_iter) needed with SG_MITER_ATOMIC */ + d_iter.consumed = len; + sg_miter_stop(_iter); + s_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); -- 2.25.1
Re: [PATCH v6 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
On 2021-01-18 6:48 p.m., Jason Gunthorpe wrote: On Mon, Jan 18, 2021 at 10:22:56PM +0100, Bodo Stroesser wrote: On 18.01.21 21:24, Jason Gunthorpe wrote: On Mon, Jan 18, 2021 at 03:08:51PM -0500, Douglas Gilbert wrote: On 2021-01-18 1:28 p.m., Jason Gunthorpe wrote: On Mon, Jan 18, 2021 at 11:30:03AM -0500, Douglas Gilbert wrote: After several flawed attempts to detect overflow, take the fastest route by stating as a pre-condition that the 'order' function argument cannot exceed 16 (2^16 * 4k = 256 MiB). That doesn't help, the point of the overflow check is similar to overflow checks in kcalloc: to prevent the routine from allocating less memory than the caller might assume. For instance ipr_store_update_fw() uses request_firmware() (which is controlled by userspace) to drive the length argument to sgl_alloc_order(). If userpace gives too large a value this will corrupt kernel memory. So this math: nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); But that check itself overflows if order is too large (e.g. 65). I don't reall care about order. It is always controlled by the kernel and it is fine to just require it be low enough to not overflow. length is the data under userspace control so math on it must be checked for overflow. Also note there is another pre-condition statement in that function's definition, namely that length cannot be 0. I don't see callers checking for that either, if it is true length 0 can't be allowed it should be blocked in the function Jason A already said, I also think there should be a check for length or rather nent overflow. I like the easy to understand check in your proposed code: if (length >> (PAGE_SHIFT + order) >= UINT_MAX) return NULL; But I don't understand, why you open-coded the nent calculation: nent = length >> (PAGE_SHIFT + order); if (length & ((1ULL << (PAGE_SHIFT + order)) - 1)) nent++; It is necessary to properly check for overflow, because the easy to understand check doesn't prove that round_up will work, only that >> results in something that fits in an int and that +1 won't overflow the int. Wouldn't it be better to keep the original line instead: nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); This can overflow inside the round_up To protect against the "unsigned long long" length being too big why not pick a large power of two and if someone can justify a larger value, they can send a patch. if (length > 64ULL * 1024 * 1024 * 1024) return NULL; So 64 GiB or a similar calculation involving PAGE_SIZE. Compiler does the multiplication and at run time there is only a 64 bit comparison. I tested 6 one GiB ramdisks on an 8 GiB machine, worked fine until firefox was started. Then came the OOM killer ... Doug Gilbert
Re: [PATCH v6 3/4] scatterlist: add sgl_compare_sgl() function
On 2021-01-18 6:27 p.m., David Disseldorp wrote: On Mon, 18 Jan 2021 11:30:05 -0500, Douglas Gilbert wrote: After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stops comparing. A helper function called sgl_compare_sgl_idx() is added. It takes an additional parameter (miscompare_idx) which is a pointer. If that pointer is non-NULL and a miscompare is detected (i.e. the function returns false) then the byte index of the first miscompare is written to *miscomapre_idx. Knowing the location of the first miscompare is needed to implement the SCSI COMPARE AND WRITE command properly. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 8 +++ lib/scatterlist.c | 109 2 files changed, 117 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 3f836a3246aa..71be65f9ebb5 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -325,6 +325,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx); This patch looks good and works fine as a replacement for compare_and_write_do_cmp(). One minor suggestion would be to name it sgl_equal() or similar, to perhaps better reflect the bool return and avoid memcmp() confusion. Either way: Reviewed-by: David Disseldorp Thanks. NVMe calls the command that does this Compare and SCSI uses COMPARE AND WRITE (and VERIFY(BYTCHK=1) ) but "equal" is fine with me. There will be another patchset version (at least) so there is time to change. Do you want: - sgl_equal(...), or - sgl_equal_sgl(...) ? Doug Gilbert
Re: [PATCH] checkpatch: Improve TYPECAST_INT_CONSTANT test message
On 2021-01-18 12:19 p.m., Joe Perches wrote: Improve the TYPECAST_INT_CONSTANT test by showing the suggested conversion for various type of uses like (unsigned int)1 to 1U. The questionable code snippet was: unsigned int nent, nalloc; if (check_add_overflow(nent, (unsigned int)1, )) where the check_add_overflow() macro [include/linux/overflow.h] uses typeid to check its first and second arguments have the same type. So it is likely others could meet this issue. Doug Gilbert Signed-off-by: Joe Perches --- Douglas Gilbert sent me a private email (and in that email said he 'loves to hate checkpatch' ;) complaining that checkpatch warned on the use of the cast of '(unsigned int)1' so make it more obvious why the message is emitted by always showing the suggested conversion. scripts/checkpatch.pl | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 016115a62a9f..4f8494527139 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -6527,18 +6527,18 @@ sub process { if ($line =~ /(\(\s*$C90_int_types\s*\)\s*)($Constant)\b/) { my $cast = $1; my $const = $2; + my $suffix = ""; + my $newconst = $const; + $newconst =~ s/${Int_type}$//; + $suffix .= 'U' if ($cast =~ /\bunsigned\b/); + if ($cast =~ /\blong\s+long\b/) { + $suffix .= 'LL'; + } elsif ($cast =~ /\blong\b/) { + $suffix .= 'L'; + } if (WARN("TYPECAST_INT_CONSTANT", -"Unnecessary typecast of c90 int constant\n" . $herecurr) && +"Unnecessary typecast of c90 int constant - '$cast$const' could be '$const$suffix'\n" . $herecurr) && $fix) { - my $suffix = ""; - my $newconst = $const; - $newconst =~ s/${Int_type}$//; - $suffix .= 'U' if ($cast =~ /\bunsigned\b/); - if ($cast =~ /\blong\s+long\b/) { - $suffix .= 'LL'; - } elsif ($cast =~ /\blong\b/) { - $suffix .= 'L'; - } $fixed[$fixlinenr] =~ s/\Q$cast\E$const\b/$newconst$suffix/; } }
Re: [PATCH v6 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
On 2021-01-18 1:28 p.m., Jason Gunthorpe wrote: On Mon, Jan 18, 2021 at 11:30:03AM -0500, Douglas Gilbert wrote: After several flawed attempts to detect overflow, take the fastest route by stating as a pre-condition that the 'order' function argument cannot exceed 16 (2^16 * 4k = 256 MiB). That doesn't help, the point of the overflow check is similar to overflow checks in kcalloc: to prevent the routine from allocating less memory than the caller might assume. For instance ipr_store_update_fw() uses request_firmware() (which is controlled by userspace) to drive the length argument to sgl_alloc_order(). If userpace gives too large a value this will corrupt kernel memory. So this math: nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); But that check itself overflows if order is too large (e.g. 65). A pre-condition says that the caller must know or check a value is sane, and if the user space can have a hand in the value passed the caller _must_ check pre-conditions IMO. A pre-condition also implies that the function's implementation will not have code to check the pre-condition. My "log of both sides" proposal at least got around the overflowing left shift problem. And one reviewer, Bodo Stroesser, liked it. Needs to be checked, add a precondition to order does not help. I already proposed a straightforward algorithm you can use. It does help, it stops your proposed check from being flawed :-) Giving a false sense of security seems more dangerous than a pre-condition statement IMO. Bart's original overflow check (in the mainline) limits length to 4GB (due to wrapping inside a 32 bit unsigned). Also note there is another pre-condition statement in that function's definition, namely that length cannot be 0. So perhaps you, Bart Van Assche and Bodo Stroesser, should compare notes and come up with a solution that you are _all_ happy with. The pre-condition works for me and is the fastest. The 'length' argument might be large, say > 1 GB [I use 1 GB in testing but did try 4GB and found the bug I'm trying to fix] but having individual elements greater than say 32 MB each does not seem very practical (and fails on the systems that I test with). In my testing the largest element size is 4 MB. Doug Gilbert
[PATCH v6 3/4] scatterlist: add sgl_compare_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stops comparing. A helper function called sgl_compare_sgl_idx() is added. It takes an additional parameter (miscompare_idx) which is a pointer. If that pointer is non-NULL and a miscompare is detected (i.e. the function returns false) then the byte index of the first miscompare is written to *miscomapre_idx. Knowing the location of the first miscompare is needed to implement the SCSI COMPARE AND WRITE command properly. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 8 +++ lib/scatterlist.c | 109 2 files changed, 117 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 3f836a3246aa..71be65f9ebb5 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -325,6 +325,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c06f8caaff91..e3182de753d0 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1131,3 +1131,112 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski return offset; } EXPORT_SYMBOL(sgl_copy_sgl); + +/** + * sgl_compare_sgl_idx - Compare x and y (both sgl_s) + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The (maximum) number of bytes to compare + * @miscompare_idx: if return is false, index of first miscompare written + * to this pointer (if non-NULL). Value will be < n_bytes + * + * Returns: + * true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). If return + * is false and miscompare_idx is non-NULL, then index of first miscompared + * byte written to *miscompare_idx. + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx) +{ + bool equ = true; + size_t len; + size_t offset = 0; + struct sg_mapping_iter x_iter, y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(x_iter.length, y_iter.length, n_bytes - offset); + + equ = !memcmp(x_iter.addr, y_iter.addr, len); + if (!equ) + goto fini; + offset += len; + /* LIFO order is important when SG_MITER_ATOMIC is used */ + y_iter.consumed = len; + sg_miter_stop(_iter); + x_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + if (miscompare_idx && !equ) { + u8 *xp = x_iter.addr; + u8 *yp = y_iter.addr; + u8 *x_endp; + + fo
[PATCH v6 4/4] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Change implementation of sg_zero_buffer() to call this new function. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 20 +- lib/scatterlist.c | 79 + 2 files changed, 62 insertions(+), 37 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 71be65f9ebb5..69e87280b44d 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -318,8 +318,6 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, unsigned int nents, const void *buf, size_t buflen, off_t skip); size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, void *buf, size_t buflen, off_t skip); -size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, - size_t buflen, off_t skip); size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, @@ -333,6 +331,24 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes, size_t *miscompare_idx); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + +/** + * sg_zero_buffer - Zero-out a part of a SG list + * @sgl: The SG list + * @nents: Number of SG entries + * @buflen:The number of bytes to zero out + * @skip: Number of bytes to skip before zeroing + * + * Returns the number of bytes zeroed. + **/ +static inline size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, + size_t buflen, off_t skip) +{ + return sgl_memset(sgl, nents, skip, 0, buflen); +} + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index e3182de753d0..7e6acc67e9f6 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1023,41 +1023,6 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, } EXPORT_SYMBOL(sg_pcopy_to_buffer); -/** - * sg_zero_buffer - Zero-out a part of a SG list - * @sgl:The SG list - * @nents: Number of SG entries - * @buflen: The number of bytes to zero out - * @skip: Number of bytes to skip before zeroing - * - * Returns the number of bytes zeroed. - **/ -size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, - size_t buflen, off_t skip) -{ - unsigned int offset = 0; - struct sg_mapping_iter miter; - unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; - - sg_miter_start(, sgl, nents, sg_flags); - - if (!sg_miter_skip(, skip)) - return false; - - while (offset < buflen && sg_miter_next()) { - unsigned int len; - - len = min(miter.length, buflen - offset); - memset(miter.addr, 0, len); - - offset += len; - } - - sg_miter_stop(); - return offset; -} -EXPORT_SYMBOL(sg_zero_buffer); - /** * sgl_copy_sgl - Copy over a destination sgl from a source sgl * @d_sgl: Destination sgl @@ -1240,3 +1205,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk return sgl_compare_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, y_skip, n_bytes, NULL); } EXPORT_SYMBOL(sgl_compare_sgl); + +/** + * sgl_memset - set byte 'val' up to n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The (maximum) number of bytes to modify + * + * Returns: + * The number of bytes written. + * + * Notes: + * Stops writing if either sgl or n_bytes is exhausted. If n_bytes is + * set SIZE_MAX then val will be written to each byte until the end + * of sgl. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + + if (n
[PATCH v6 2/4] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a sgl to sgl copy. Stops when the first of the number of requested bytes to copy, or the source sgl, or the destination sgl is exhausted. So the destination sgl will _not_ grow. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 74 + 2 files changed, 78 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 8adff41f7cfa..3f836a3246aa 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 24ea2d31a405..c06f8caaff91 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1057,3 +1057,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before starting + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before starting + * @n_bytes:The (maximum) number of bytes to copy + * + * Returns: + * The number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if either d_sgl, s_sgl or n_bytes is exhausted. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * Large copies are broken into copy segments whose sizes may vary. Those + * copy segment sizes are chosen by the min3() statement in the code below. + * Since SG_MITER_ATOMIC is used for both sides, each copy segment is started + * with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic() + * [in sg_miter_stop()]. This means pre-emption is inhibited for relatively + * short periods even in very large copies. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t len; + size_t offset = 0; + struct sg_mapping_iter d_iter, s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(d_iter.length, s_iter.length, n_bytes - offset); + + memcpy(d_iter.addr, s_iter.addr, len); + offset += len; + /* LIFO order (stop d_iter before s_iter) needed with SG_MITER_ATOMIC */ + d_iter.consumed = len; + sg_miter_stop(_iter); + s_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); -- 2.25.1
[PATCH v6 0/4] scatterlist: add new capabilities
Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for the target subsystem. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl(), sgl_compare_sgl() and sgl_memset(). The existing sgl_alloc_order() function can be seen as a replacement for vmalloc() for large, long-term allocations. For what seems like no good reason, sgl_alloc_order() currently restricts its total allocation to less than or equal to 4 GiB. vmalloc() has no such restriction. Changes since v5 [posted 20201228]: - incorporate review requests from Jason Gunthorpe - replace integer overflow detection code in sgl_alloc_order() with a pre-condition statement - rebase on lk 5.11.0-rc4 Changes since v4 [posted 20201105]: - rebase on lk 5.10.0-rc2 Changes since v3 [posted 20201019]: - re-instate check on integer overflow of nent calculation in sgl_alloc_order(). Do it in such a way as to not limit the overall sgl size to 4 GiB - introduce sgl_compare_sgl_idx() helper function that, if requested and if a miscompare is detected, will yield the byte index of the first miscompare. - add Reviewed-by tags from Bodo Stroesser - rebase on lk 5.10.0-rc2 [was on lk 5.9.0] Changes since v2 [posted 20201018]: - remove unneeded lines from sgl_memset() definition. - change sg_zero_buffer() to call sgl_memset() as the former is a subset. Changes since v1 [posted 20201016]: - Bodo Stroesser pointed out a problem with the nesting of kmap_atomic() [called via sg_miter_next()] and kunmap_atomic() calls [called via sg_miter_stop()] and proposed a solution that simplifies the previous code. - the new implementation of the three functions has shorter periods when pre-emption is disabled (but has more them). This should make operations on large sgl_s more pre-emption "friendly" with a relatively small performance hit. - sgl_memset return type changed from void to size_t and is the number of bytes actually (over)written. That number is needed anyway internally so may as well return it as it may be useful to the caller. This patchset is against lk 5.11.0-rc4 Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 33 - lib/scatterlist.c | 253 +++- 2 files changed, 253 insertions(+), 33 deletions(-) -- 2.25.1
[PATCH v6 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
This patch fixes a check done by sgl_alloc_order() before it starts any allocations. The comment in the original said: "Check for integer overflow" but the check itself contained an integer overflow! The right hand side (rhs) of the expression in the condition is resolved as u32 so it could not exceed UINT32_MAX (4 GiB) which means 'length' could not exceed that value. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; After several flawed attempts to detect overflow, take the fastest route by stating as a pre-condition that the 'order' function argument cannot exceed 16 (2^16 * 4k = 256 MiB). This function may be used to replace vmalloc(unsigned long) for a large allocation (e.g. a ramdisk). vmalloc has no limit at 4 GiB so it seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order() have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer arithmetic. Revise some of this function's description to more accurately reflect what this function is doing. An earlier patch fixed a memory leak in sg_alloc_order() due to the misuse of sgl_free(). Take the opportunity to put a one line comment above sgl_free()'s declaration warning that it is not suitable when order > 0 . Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 21 ++--- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f70572b2938..8adff41f7cfa 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a59778946404..24ea2d31a405 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -554,13 +554,16 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages); #ifdef CONFIG_SGL_ALLOC /** - * sgl_alloc_order - allocate a scatterlist and its pages + * sgl_alloc_order - allocate a scatterlist with equally sized elements each + * of which has 2^@order continuous pages * @length: Length in bytes of the scatterlist. Must be at least one - * @order: Second argument for alloc_pages() + * @order: Second argument for alloc_pages(). Each sgl element size will + *be (PAGE_SIZE*2^@order) bytes. @order must not exceed 16. * @chainable: Whether or not to allocate an extra element in the scatterlist - * for scatterlist chaining purposes + *for scatterlist chaining purposes * @gfp: Memory allocation flags - * @nent_p: [out] Number of entries in the scatterlist that have pages + * @nent_p: [out] Number of entries in the scatterlist that have pages. + * Ignored if NULL is given. * * Returns: A pointer to an initialized scatterlist or %NULL upon failure. */ @@ -574,15 +577,11 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) - return NULL; - nalloc = nent; if (chainable) { - /* Check for integer overflow */ - if (nalloc + 1 < nalloc) + if (check_add_overflow(nent, 1U, )) return NULL; - nalloc++; + } else { + nalloc = nent; } sgl = kmalloc_array(nalloc, sizeof(struct scatterlist), gfp & ~GFP_DMA); -- 2.25.1
Re: [PATCH v5 4/4] scatterlist: add sgl_memset()
On 2021-01-07 12:46 p.m., Jason Gunthorpe wrote: On Mon, Dec 28, 2020 at 06:49:55PM -0500, Douglas Gilbert wrote: The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Change implementation of sg_zero_buffer() to call this new function. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert include/linux/scatterlist.h | 3 ++ lib/scatterlist.c | 65 + 2 files changed, 48 insertions(+), 20 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 71be65f9ebb5..70d3f1f73df1 100644 +++ b/include/linux/scatterlist.h @@ -333,6 +333,9 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes, size_t *miscompare_idx); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 9332365e7eb6..f06614a880c8 100644 +++ b/lib/scatterlist.c @@ -1038,26 +1038,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer); size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip) { - unsigned int offset = 0; - struct sg_mapping_iter miter; - unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; - - sg_miter_start(, sgl, nents, sg_flags); - - if (!sg_miter_skip(, skip)) - return false; - - while (offset < buflen && sg_miter_next()) { - unsigned int len; - - len = min(miter.length, buflen - offset); - memset(miter.addr, 0, len); - - offset += len; - } - - sg_miter_stop(); - return offset; + return sgl_memset(sgl, nents, skip, 0, buflen); } EXPORT_SYMBOL(sg_zero_buffer); May as well make this one liner a static inline in the header. Just rename this function to sgl_memset so the diff is clearer Yes, fine. I can roll a new version. Doug Gilbert
Re: [PATCH v5 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
On 2021-01-07 12:44 p.m., Jason Gunthorpe wrote: On Mon, Dec 28, 2020 at 06:49:52PM -0500, Douglas Gilbert wrote: diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a59778946404..4986545beef9 100644 +++ b/lib/scatterlist.c @@ -554,13 +554,15 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages); #ifdef CONFIG_SGL_ALLOC /** - * sgl_alloc_order - allocate a scatterlist and its pages + * sgl_alloc_order - allocate a scatterlist with equally sized elements * @length: Length in bytes of the scatterlist. Must be at least one - * @order: Second argument for alloc_pages() + * @order: Second argument for alloc_pages(). Each sgl element size will + *be (PAGE_SIZE*2^order) bytes * @chainable: Whether or not to allocate an extra element in the scatterlist - * for scatterlist chaining purposes + *for scatterlist chaining purposes * @gfp: Memory allocation flags - * @nent_p: [out] Number of entries in the scatterlist that have pages + * @nent_p: [out] Number of entries in the scatterlist that have pages. + * Ignored if NULL is given. * * Returns: A pointer to an initialized scatterlist or %NULL upon failure. */ @@ -574,8 +576,8 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) + /* Integer overflow if: length > nent*2^(PAGE_SHIFT+order) */ + if (ilog2(length) > ilog2(nent) + PAGE_SHIFT + order) return NULL; nalloc = nent; if (chainable) { This is a little bit too tortured now, how about this: if (length >> (PAGE_SHIFT + order) >= UINT_MAX) return NULL; nent = length >> (PAGE_SHIFT + order); if (length & ((1ULL << (PAGE_SHIFT + order)) - 1)) nent++; if (chainable) { if (check_add_overflow(nent, 1, )) return NULL; } else nalloc = nent; And your proposal is less <> ? I'm looking at performance, not elegance and I'm betting that two ilog2() calls [which boil down to ffs()] are faster than two right-shift-by-n_s and one left-shift-by-n . Perhaps an extra comment could help my code by noting that mathematically: /* if n > m for positive n and m then: log(n) > log(m) */ My original preference was to drop the check all together but Bart Van Assche (who wrote that function) wanted me to keep it. Any function that takes 'order' (i.e. an exponent) can blow up given a silly value. The chainable check_add_overflow() call is new and an improvement. Doug Gilbert
[PATCH v5 4/4] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Change implementation of sg_zero_buffer() to call this new function. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 3 ++ lib/scatterlist.c | 65 + 2 files changed, 48 insertions(+), 20 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 71be65f9ebb5..70d3f1f73df1 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -333,6 +333,9 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes, size_t *miscompare_idx); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 9332365e7eb6..f06614a880c8 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1038,26 +1038,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer); size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip) { - unsigned int offset = 0; - struct sg_mapping_iter miter; - unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; - - sg_miter_start(, sgl, nents, sg_flags); - - if (!sg_miter_skip(, skip)) - return false; - - while (offset < buflen && sg_miter_next()) { - unsigned int len; - - len = min(miter.length, buflen - offset); - memset(miter.addr, 0, len); - - offset += len; - } - - sg_miter_stop(); - return offset; + return sgl_memset(sgl, nents, skip, 0, buflen); } EXPORT_SYMBOL(sg_zero_buffer); @@ -1243,3 +1224,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk return sgl_compare_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, y_skip, n_bytes, NULL); } EXPORT_SYMBOL(sgl_compare_sgl); + +/** + * sgl_memset - set byte 'val' up to n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The (maximum) number of bytes to modify + * + * Returns: + * The number of bytes written. + * + * Notes: + * Stops writing if either sgl or n_bytes is exhausted. If n_bytes is + * set SIZE_MAX then val will be written to each byte until the end + * of sgl. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + + if (n_bytes == 0) + return 0; + sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(, skip)) + goto fini; + + while ((offset < n_bytes) && sg_miter_next()) { + len = min(miter.length, n_bytes - offset); + memset(miter.addr, val, len); + offset += len; + } +fini: + sg_miter_stop(); + return offset; +} +EXPORT_SYMBOL(sgl_memset); + -- 2.25.1
[PATCH v5 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
This patch fixes a check done by sgl_alloc_order() before it starts any allocations. The comment in the original said: "Check for integer overflow" but the check itself contained an integer overflow! The right hand side (rhs) of the expression in the condition is resolved as u32 so it could not exceed UINT32_MAX (4 GiB) which means 'length' could not exceed that value. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; Get around the integer overflow problem in the rhs of the original check by taking ilog2() of both sides. This function may be used to replace vmalloc(unsigned long) for a large allocation (e.g. a ramdisk). vmalloc has no limit at 4 GiB so it seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order() have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer arithmetic. Revise some of this function's description to more accurately reflect what this function is doing. An earlier patch fixed a memory leak in sg_alloc_order() due to the misuse of sgl_free(). Take the opportunity to put a one line comment above sgl_free()'s declaration warning that it is not suitable when order > 0 . Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 14 -- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f70572b2938..8adff41f7cfa 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a59778946404..4986545beef9 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -554,13 +554,15 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages); #ifdef CONFIG_SGL_ALLOC /** - * sgl_alloc_order - allocate a scatterlist and its pages + * sgl_alloc_order - allocate a scatterlist with equally sized elements * @length: Length in bytes of the scatterlist. Must be at least one - * @order: Second argument for alloc_pages() + * @order: Second argument for alloc_pages(). Each sgl element size will + *be (PAGE_SIZE*2^order) bytes * @chainable: Whether or not to allocate an extra element in the scatterlist - * for scatterlist chaining purposes + *for scatterlist chaining purposes * @gfp: Memory allocation flags - * @nent_p: [out] Number of entries in the scatterlist that have pages + * @nent_p: [out] Number of entries in the scatterlist that have pages. + * Ignored if NULL is given. * * Returns: A pointer to an initialized scatterlist or %NULL upon failure. */ @@ -574,8 +576,8 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) + /* Integer overflow if: length > nent*2^(PAGE_SHIFT+order) */ + if (ilog2(length) > ilog2(nent) + PAGE_SHIFT + order) return NULL; nalloc = nent; if (chainable) { -- 2.25.1
[PATCH v5 2/4] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a sgl to sgl copy. Stops when the first of the number of requested bytes to copy, or the source sgl, or the destination sgl is exhausted. So the destination sgl will _not_ grow. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 74 + 2 files changed, 78 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 8adff41f7cfa..3f836a3246aa 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 4986545beef9..af9cd7b9dc19 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1060,3 +1060,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before starting + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before starting + * @n_bytes:The (maximum) number of bytes to copy + * + * Returns: + * The number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if either d_sgl, s_sgl or n_bytes is exhausted. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * Large copies are broken into copy segments whose sizes may vary. Those + * copy segment sizes are chosen by the min3() statement in the code below. + * Since SG_MITER_ATOMIC is used for both sides, each copy segment is started + * with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic() + * [in sg_miter_stop()]. This means pre-emption is inhibited for relatively + * short periods even in very large copies. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t len; + size_t offset = 0; + struct sg_mapping_iter d_iter, s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(d_iter.length, s_iter.length, n_bytes - offset); + + memcpy(d_iter.addr, s_iter.addr, len); + offset += len; + /* LIFO order (stop d_iter before s_iter) needed with SG_MITER_ATOMIC */ + d_iter.consumed = len; + sg_miter_stop(_iter); + s_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); -- 2.25.1
[PATCH v5 3/4] scatterlist: add sgl_compare_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stops comparing. A helper function called sgl_compare_sgl_idx() is added. It takes an additional parameter (miscompare_idx) which is a pointer. If that pointer is non-NULL and a miscompare is detected (i.e. the function returns false) then the byte index of the first miscompare is written to *miscomapre_idx. Knowing the location of the first miscompare is needed to implement the SCSI COMPARE AND WRITE command properly. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 8 +++ lib/scatterlist.c | 109 2 files changed, 117 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 3f836a3246aa..71be65f9ebb5 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -325,6 +325,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index af9cd7b9dc19..9332365e7eb6 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1134,3 +1134,112 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski return offset; } EXPORT_SYMBOL(sgl_copy_sgl); + +/** + * sgl_compare_sgl_idx - Compare x and y (both sgl_s) + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The (maximum) number of bytes to compare + * @miscompare_idx: if return is false, index of first miscompare written + * to this pointer (if non-NULL). Value will be < n_bytes + * + * Returns: + * true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). If return + * is false and miscompare_idx is non-NULL, then index of first miscompared + * byte written to *miscompare_idx. + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx) +{ + bool equ = true; + size_t len; + size_t offset = 0; + struct sg_mapping_iter x_iter, y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(x_iter.length, y_iter.length, n_bytes - offset); + + equ = !memcmp(x_iter.addr, y_iter.addr, len); + if (!equ) + goto fini; + offset += len; + /* LIFO order is important when SG_MITER_ATOMIC is used */ + y_iter.consumed = len; + sg_miter_stop(_iter); + x_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + if (miscompare_idx && !equ) { + u8 *xp = x_iter.addr; + u8 *yp = y_iter.addr; + u8 *x_endp; + + fo
[PATCH v5 0/4] scatterlist: add new capabilities
Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for the target subsystem. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl(), sgl_compare_sgl() and sgl_memset(). The existing sgl_alloc_order() function can be seen as a replacement for vmalloc() for large, long-term allocations. For what seems like no good reason, sgl_alloc_order() currently restricts its total allocation to less than or equal to 4 GiB. vmalloc() has no such restriction. Changes since v3 [posted 20201105]: - rebase on lk 5.11.0-rc2 Changes since v3 [posted 20201019]: - re-instate check on integer overflow of nent calculation in sgl_alloc_order(). Do it in such a way as to not limit the overall sgl size to 4 GiB - introduce sgl_compare_sgl_idx() helper function that, if requested and if a miscompare is detected, will yield the byte index of the first miscompare. - add Reviewed-by tags from Bodo Stroesser - rebase on lk 5.10.0-rc2 [was on lk 5.9.0] Changes since v2 [posted 20201018]: - remove unneeded lines from sgl_memset() definition. - change sg_zero_buffer() to call sgl_memset() as the former is a subset. Changes since v1 [posted 20201016]: - Bodo Stroesser pointed out a problem with the nesting of kmap_atomic() [called via sg_miter_next()] and kunmap_atomic() calls [called via sg_miter_stop()] and proposed a solution that simplifies the previous code. - the new implementation of the three functions has shorter periods when pre-emption is disabled (but has more them). This should make operations on large sgl_s more pre-emption "friendly" with a relatively small performance hit. - sgl_memset return type changed from void to size_t and is the number of bytes actually (over)written. That number is needed anyway internally so may as well return it as it may be useful to the caller. This patchset is against lk 5.10.0-rc2 Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 16 +++ lib/scatterlist.c | 244 +--- 2 files changed, 243 insertions(+), 17 deletions(-) -- 2.25.1
Re: [PATCH] [v2] scsi: scsi_debug: Fix memleak in scsi_debug_init
On 2020-12-26 1:15 a.m., Dinghao Liu wrote: When sdeb_zbc_model does not match BLK_ZONED_NONE, BLK_ZONED_HA or BLK_ZONED_HM, we should free sdebug_q_arr to prevent memleak. Also there is no need to execute sdebug_erase_store() on failure of sdeb_zbc_model_str(). Signed-off-by: Dinghao Liu Acked-by: Douglas Gilbert Thanks. --- Changelog: v2: - Add missed assignment statement for ret. --- drivers/scsi/scsi_debug.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 24c0f7ec0351..4a08c450b756 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -6740,7 +6740,7 @@ static int __init scsi_debug_init(void) k = sdeb_zbc_model_str(sdeb_zbc_model_s); if (k < 0) { ret = k; - goto free_vm; + goto free_q_arr; } sdeb_zbc_model = k; switch (sdeb_zbc_model) { @@ -6753,7 +6753,8 @@ static int __init scsi_debug_init(void) break; default: pr_err("Invalid ZBC model\n"); - return -EINVAL; + ret = -EINVAL; + goto free_q_arr; } } if (sdeb_zbc_model != BLK_ZONED_NONE) {
Re: [PATCH v1 0/6] no-copy bvec
On 2020-12-24 1:41 a.m., Christoph Hellwig wrote: On Wed, Dec 23, 2020 at 08:32:45PM +, Pavel Begunkov wrote: On 23/12/2020 20:23, Douglas Gilbert wrote: On 2020-12-23 11:04 a.m., James Bottomley wrote: On Wed, 2020-12-23 at 15:51 +, Christoph Hellwig wrote: On Wed, Dec 23, 2020 at 12:52:59PM +, Pavel Begunkov wrote: Can scatterlist have 0-len entries? Those are directly translated into bvecs, e.g. in nvme/target/io-cmd-file.c and target/target_core_file.c. I've audited most of others by this moment, they're fine. For block layer SGLs we should never see them, and for nvme neither. I think the same is true for the SCSI target code, but please double check. Right, no-one ever wants to see a 0-len scatter list entry.?? The reason is that every driver uses the sgl to program the device DMA engine in the way NVME does.?? a 0 length sgl would be a dangerous corner case: some DMA engines would ignore it and others would go haywire, so if we ever let a 0 length list down into the driver, they'd have to understand the corner case behaviour of their DMA engine and filter it accordingly, which is why we disallow them in the upper levels, since they're effective nops anyway. When using scatter gather lists at the far end (i.e. on the storage device) the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly allow the "number of logical blocks" in their sgl_s to be zero and state that it is _not_ to be considered an error. It's fine for my case unless it leaks them out of device driver to the net/block layer/etc. Is it? None of the SCSI Command mentions above are supported by Linux, nevermind mapped to struct scatterlist. The POPULATE TOKEN / WRITE USING TOKEN pair can be viewed as a subset of EXTENDED COPY (SPC-4) which also supports "range descriptors". It is not clear if target_core_xcopy.c supports these range descriptors but if it did, it would be trying to map them to struct scatterlist objects. That said, it would be easy to skip the "number of logical blocks" == 0 case when translating range descriptors to sgl_s. In my ddpt utility (a dd clone) I have generalized skip= and seek= to optionally take sgl_s. If the last element in one of those sgl_s is LBAn,0 then it is interpreted as "until the end of that device" which is further restricted if the other sgl has a "hard" length or count= is given. The point being a length of 0 can have meaning, a benefit lost with NVMe's 0-based counts. Doug Gilbert
Re: [PATCH v1 0/6] no-copy bvec
On 2020-12-23 11:04 a.m., James Bottomley wrote: On Wed, 2020-12-23 at 15:51 +, Christoph Hellwig wrote: On Wed, Dec 23, 2020 at 12:52:59PM +, Pavel Begunkov wrote: Can scatterlist have 0-len entries? Those are directly translated into bvecs, e.g. in nvme/target/io-cmd-file.c and target/target_core_file.c. I've audited most of others by this moment, they're fine. For block layer SGLs we should never see them, and for nvme neither. I think the same is true for the SCSI target code, but please double check. Right, no-one ever wants to see a 0-len scatter list entry. The reason is that every driver uses the sgl to program the device DMA engine in the way NVME does. a 0 length sgl would be a dangerous corner case: some DMA engines would ignore it and others would go haywire, so if we ever let a 0 length list down into the driver, they'd have to understand the corner case behaviour of their DMA engine and filter it accordingly, which is why we disallow them in the upper levels, since they're effective nops anyway. When using scatter gather lists at the far end (i.e. on the storage device) the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly allow the "number of logical blocks" in their sgl_s to be zero and state that it is _not_ to be considered an error. Doug Gilbert
Re: [RFC PATCH v2 0/2] add simple copy support
On 2020-12-07 9:56 a.m., Hannes Reinecke wrote: On 12/7/20 3:11 PM, Christoph Hellwig wrote: So, I'm really worried about: a) a good use case. GC in f2fs or btrfs seem like good use cases, as does accelating dm-kcopyd. I agree with Damien that lifting dm-kcopyd to common code would also be really nice. I'm not 100% sure it should be a requirement, but it sure would be nice to have I don't think just adding an ioctl is enough of a use case for complex kernel infrastructure. b) We had a bunch of different attempts at SCSI XCOPY support form IIRC Martin, Bart and Mikulas. I think we need to pull them into this discussion, and make sure whatever we do covers the SCSI needs. And we shouldn't forget that the main issue which killed all previous implementations was a missing QoS guarantee. It's nice to have simply copy, but if the implementation is _slower_ than doing it by hand from the OS there is very little point in even attempting to do so. I can't see any provisions for that in the TPAR, leading me to the assumption that NVMe simple copy will suffer from the same issue. So if we can't address this I guess this attempt will fail, too. I have been doing quite a lot of work and testing in my sg driver rewrite in the copy and compare area. The baselines for performance are dd and io_uring-cp (in liburing). There are lots of ways to improve on them. Here are some: - the user data need never pass through the user space (could mmap it out during the READ if there is a good reason). Only the metadata (e.g. NVMe or SCSI commands) needs to come from the user space and errors, if any, reported back to the user space. - break a large copy (or compare) into segments, with each segment a "comfortable" size for the OS to handle, say 256 KB - there is one constraint: the READ in each segment must complete before its paired WRITE can commence - extra constraint for some zoned disks: WRITEs must be issued in order (assuming they are applied in that order, if not, need to wait until each WRITE completes) - arrange for READ WRITE pair in each segment to share the same bio - have multiple slots each holding a segment (i.e. a bio and metadata to process a READ-WRITE pair) - re-use each slot's bio for the following READ-WRITE pair - issue the READs in each slot asynchronously and do an interleaved (io)poll for completion. Then issue the paired WRITE asynchronously - the above "slot" algorithm runs in one thread, so there can be multiple threads doing the same algorithm. Segment manager needs to be locked (or use an atomics) so that each segment (identified by its starting LBAs) is issued once and only once when the next thread wants a segment to copy Running multiple threads gives diminishing or even worsening returns. Runtime metrics on lock contention and storage bus capacity may help choosing the number of threads. A simpler approach might be add more threads until the combined throughput increase is less than 10% say. The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command (or NVMe NVM Compare command). Using dd logic, a disk to disk compare can be implemented with not much more work than changing the WRITE to a VERIFY command. This is a different approach to the Linux cmp utility which READs in both sides and does a memcmp() type operation. Using ramdisks (from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE operations taking a write lock over the store while the VERIFY only needs a read lock so many VERIFY operations can co-exist on the same store. Unfortunately on real SAS and NVMe SSDs that I tested the performance of the VERIFY and NVM Compare commands is underwhelming. For comparison, using scsi_debug ramdisks, dd copy throughput was < 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen 3600 based. Doug Gilbert
Re: [PATCH] scsi: ses: Fix crash caused by kfree an invalid pointer
On 2020-11-28 6:27 p.m., James Bottomley wrote: On Sat, 2020-11-28 at 20:23 +0800, Ding Hui wrote: We can get a crash when disconnecting the iSCSI session, the call trace like this: [2a00fb70] kfree at 0830e224 [2a00fba0] ses_intf_remove at 01f200e4 [2a00fbd0] device_del at 086b6a98 [2a00fc50] device_unregister at 086b6d58 [2a00fc70] __scsi_remove_device at 0870608c [2a00fca0] scsi_remove_device at 08706134 [2a00fcc0] __scsi_remove_target at 087062e4 [2a00fd10] scsi_remove_target at 087064c0 [2a00fd70] __iscsi_unbind_session at 01c872c4 [2a00fdb0] process_one_work at 0810f35c [2a00fe00] worker_thread at 0810f648 [2a00fe70] kthread at 08116e98 In ses_intf_add, components count could be 0, and kcalloc 0 size scomp, but not saved in edev->component[i].scratch In this situation, edev->component[0].scratch is an invalid pointer, when kfree it in ses_intf_remove_enclosure, a crash like above would happen The call trace also could be other random cases when kfree cannot catch the invalid pointer We should not use edev->component[] array when the components count is 0 We also need check index when use edev->component[] array in ses_enclosure_data_process Tested-by: Zeng Zhicong Cc: stable # 2.6.25+ Signed-off-by: Ding Hui This doesn't really look to be the right thing to do: an enclosure which has no component can't usefully be controlled by the driver since there's nothing for it to do, so what we should do in this situation is refuse to attach like the proposed patch below. It does seem a bit odd that someone would build an enclosure that doesn't enclose anything, so would you mind running sg_ses -e '-e' is the short form of '--enumerate'. That will report the names and abbreviations of the diagnostic pages that the utility itself knows about (and supports). It won't show anything specific about the environment that sg_ses is executed in. You probably meant: sg_ses Examples of the likely forms are: sg_ses /dev/bsg/1:0:0:0 sg_ses /dev/sg2 sg_ses /dev/ses0 This from a nearby machine: $ lsscsi -gs [3:0:0:0] disk ATA Samsung SSD 850 1B6Q /dev/sda /dev/sg0120GB [4:0:0:0] disk IBM-207x HUSMM8020ASS20 J4B6 /dev/sdc /dev/sg2200GB [4:0:1:0] disk ATA INTEL SSDSC2KW25 003C /dev/sdd /dev/sg3256GB [4:0:2:0] disk SEAGATE ST1NM0096E005 /dev/sde /dev/sg4 10.0TB [4:0:3:0] enclosu Areca Te ARC-802801.37.69 0137 -/dev/sg5- [4:0:4:0] enclosu IntelRES2SV2400d00 -/dev/sg6- [7:0:0:0] diskKingston DataTravelerMini PMAP /dev/sdb /dev/sg1 1.03GB [N:0:0:1] diskWDC WDS256G1X0C-00ENX0__1 /dev/nvme0n1 - 256GB # sg_ses /dev/sg5 Areca Te ARC-802801.37.69 0137 Supported diagnostic pages: Supported Diagnostic Pages [sdp] [0x0] Configuration (SES) [cf] [0x1] Enclosure Status/Control (SES) [ec,es] [0x2] String In/Out (SES) [str] [0x4] Threshold In/Out (SES) [th] [0x5] Element Descriptor (SES) [ed] [0x7] Additional Element Status (SES-2) [aes] [0xa] Supported SES Diagnostic Pages (SES-2) [ssp] [0xd] Download Microcode (SES-2) [dm] [0xe] Subenclosure Nickname (SES-2) [snic] [0xf] Protocol Specific (SAS transport) [] [0x3f] # sg_ses -p cf /dev/sg5 Areca Te ARC-802801.37.69 0137 Configuration diagnostic page: number of secondary subenclosures: 0 generation code: 0x0 enclosure descriptor list Subenclosure identifier: 0 [primary] relative ES process id: 1, number of ES processes: 1 number of type descriptor headers: 9 enclosure logical identifier (hex): d5b401503fc0ec16 enclosure vendor: Areca Te product: ARC-802801.37.69 rev: 0137 vendor-specific data: 11 22 33 44 55 00 00 00 ."3DU... type descriptor header and text list Element type: Array device slot, subenclosure id: 0 number of possible elements: 24 text: ArrayDevicesInSubEnclsr0 Element type: Enclosure, subenclosure id: 0 number of possible elements: 1 text: EnclosureElementInSubEnclsr0 Element type: SAS expander, subenclosure id: 0 number of possible elements: 1 text: SAS Expander Element type: Cooling, subenclosure id: 0 number of possible elements: 5 text: CoolingElementInSubEnclsr0 Element type: Temperature sensor, subenclosure id: 0 number of possible elements: 2 text: TempSensorsInSubEnclsr0 Element type: Voltage sensor, subenclosure id: 0 number of possible elements: 2 text: VoltageSensorsInSubEnclsr0 Element type: SAS connector, subenclosure id: 0 number of possible elements: 3 text: ConnectorsInSubEnclsr0 Element type: Power supply, subenclosure id: 0 number of possible
[PATCH v4 4/4] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Change implementation of sg_zero_buffer() to call this new function. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 3 ++ lib/scatterlist.c | 65 + 2 files changed, 48 insertions(+), 20 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 0f6d59bf66cb..8e4c050e6237 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -339,6 +339,9 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes, size_t *miscompare_idx); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 9332365e7eb6..f06614a880c8 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1038,26 +1038,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer); size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip) { - unsigned int offset = 0; - struct sg_mapping_iter miter; - unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; - - sg_miter_start(, sgl, nents, sg_flags); - - if (!sg_miter_skip(, skip)) - return false; - - while (offset < buflen && sg_miter_next()) { - unsigned int len; - - len = min(miter.length, buflen - offset); - memset(miter.addr, 0, len); - - offset += len; - } - - sg_miter_stop(); - return offset; + return sgl_memset(sgl, nents, skip, 0, buflen); } EXPORT_SYMBOL(sg_zero_buffer); @@ -1243,3 +1224,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk return sgl_compare_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, y_skip, n_bytes, NULL); } EXPORT_SYMBOL(sgl_compare_sgl); + +/** + * sgl_memset - set byte 'val' up to n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The (maximum) number of bytes to modify + * + * Returns: + * The number of bytes written. + * + * Notes: + * Stops writing if either sgl or n_bytes is exhausted. If n_bytes is + * set SIZE_MAX then val will be written to each byte until the end + * of sgl. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + + if (n_bytes == 0) + return 0; + sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(, skip)) + goto fini; + + while ((offset < n_bytes) && sg_miter_next()) { + len = min(miter.length, n_bytes - offset); + memset(miter.addr, val, len); + offset += len; + } +fini: + sg_miter_stop(); + return offset; +} +EXPORT_SYMBOL(sgl_memset); + -- 2.25.1
[PATCH v4 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
This patch fixes a check done by sgl_alloc_order() before it starts any allocations. The comment in the original said: "Check for integer overflow" but the check itself contained an integer overflow! The right hand side (rhs) of the expression in the condition is resolved as u32 so it could not exceed UINT32_MAX (4 GiB) which means 'length' could not exceed that value. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; Get around the integer overflow problem in the rhs of the original check by taking ilog2() of both sides. This function may be used to replace vmalloc(unsigned long) for a large allocation (e.g. a ramdisk). vmalloc has no limit at 4 GiB so it seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order() have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer arithmetic. Revise some of this function's description to more accurately reflect what this function is doing. An earlier patch fixed a memory leak in sg_alloc_order() due to the misuse of sgl_free(). Take the opportunity to put a one line comment above sgl_free()'s declaration warning that it is not suitable when order > 0 . Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 14 -- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 36c47e7e66a2..d9443ebd0a8e 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -308,6 +308,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a59778946404..4986545beef9 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -554,13 +554,15 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages); #ifdef CONFIG_SGL_ALLOC /** - * sgl_alloc_order - allocate a scatterlist and its pages + * sgl_alloc_order - allocate a scatterlist with equally sized elements * @length: Length in bytes of the scatterlist. Must be at least one - * @order: Second argument for alloc_pages() + * @order: Second argument for alloc_pages(). Each sgl element size will + *be (PAGE_SIZE*2^order) bytes * @chainable: Whether or not to allocate an extra element in the scatterlist - * for scatterlist chaining purposes + *for scatterlist chaining purposes * @gfp: Memory allocation flags - * @nent_p: [out] Number of entries in the scatterlist that have pages + * @nent_p: [out] Number of entries in the scatterlist that have pages. + * Ignored if NULL is given. * * Returns: A pointer to an initialized scatterlist or %NULL upon failure. */ @@ -574,8 +576,8 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) + /* Integer overflow if: length > nent*2^(PAGE_SHIFT+order) */ + if (ilog2(length) > ilog2(nent) + PAGE_SHIFT + order) return NULL; nalloc = nent; if (chainable) { -- 2.25.1
[PATCH v4 3/4] scatterlist: add sgl_compare_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stops comparing. A helper function called sgl_compare_sgl_idx() is added. It takes an additional parameter (miscompare_idx) which is a pointer. If that pointer is non-NULL and a miscompare is detected (i.e. the function returns false) then the byte index of the first miscompare is written to *miscomapre_idx. Knowing the location of the first miscompare is needed to implement the SCSI COMPARE AND WRITE command properly. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 8 +++ lib/scatterlist.c | 109 2 files changed, 117 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index f2922a34b140..0f6d59bf66cb 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -331,6 +331,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index af9cd7b9dc19..9332365e7eb6 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1134,3 +1134,112 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski return offset; } EXPORT_SYMBOL(sgl_copy_sgl); + +/** + * sgl_compare_sgl_idx - Compare x and y (both sgl_s) + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The (maximum) number of bytes to compare + * @miscompare_idx: if return is false, index of first miscompare written + * to this pointer (if non-NULL). Value will be < n_bytes + * + * Returns: + * true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). If return + * is false and miscompare_idx is non-NULL, then index of first miscompared + * byte written to *miscompare_idx. + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes, size_t *miscompare_idx) +{ + bool equ = true; + size_t len; + size_t offset = 0; + struct sg_mapping_iter x_iter, y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(x_iter.length, y_iter.length, n_bytes - offset); + + equ = !memcmp(x_iter.addr, y_iter.addr, len); + if (!equ) + goto fini; + offset += len; + /* LIFO order is important when SG_MITER_ATOMIC is used */ + y_iter.consumed = len; + sg_miter_stop(_iter); + x_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + if (miscompare_idx && !equ) { + u8 *xp = x_iter.addr; + u8 *yp = y_iter.addr; + u8 *x_endp; + + fo
[PATCH v4 2/4] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a sgl to sgl copy. Stops when the first of the number of requested bytes to copy, or the source sgl, or the destination sgl is exhausted. So the destination sgl will _not_ grow. Reviewed-by: Bodo Stroesser Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 74 + 2 files changed, 78 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index d9443ebd0a8e..f2922a34b140 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -327,6 +327,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 4986545beef9..af9cd7b9dc19 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1060,3 +1060,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before starting + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before starting + * @n_bytes:The (maximum) number of bytes to copy + * + * Returns: + * The number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if either d_sgl, s_sgl or n_bytes is exhausted. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * Large copies are broken into copy segments whose sizes may vary. Those + * copy segment sizes are chosen by the min3() statement in the code below. + * Since SG_MITER_ATOMIC is used for both sides, each copy segment is started + * with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic() + * [in sg_miter_stop()]. This means pre-emption is inhibited for relatively + * short periods even in very large copies. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t len; + size_t offset = 0; + struct sg_mapping_iter d_iter, s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(d_iter.length, s_iter.length, n_bytes - offset); + + memcpy(d_iter.addr, s_iter.addr, len); + offset += len; + /* LIFO order (stop d_iter before s_iter) needed with SG_MITER_ATOMIC */ + d_iter.consumed = len; + sg_miter_stop(_iter); + s_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); -- 2.25.1
[PATCH v4 0/4] scatterlist: add new capabilities
This patchset was sent to the linux-block and linux-scsi lists a few hours ago. If it is accepted that will probably be via the linux-block maintainer. It has potential users in the target sub-system and the scsi_debug driver. Other parts of the kernel that use may be interested which is why it is now being sent to the linux-kernel list. Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for the target subsystem. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl() and two other sgl operations. The existing sgl_alloc_order() function can be seen as a replacement for vmalloc() for large, long-term allocations. For what seems like no good reason, sgl_alloc_order() currently restricts its total allocation to less than or equal to 4 GiB. vmalloc() has no such restriction. Changes since v3 [posted 20201019]: - re-instate check on integer overflow of nent calculation in sgl_alloc_order(). Do it in such a way as to not limit the overall sgl size to 4 GiB - introduce sgl_compare_sgl_idx() helper function that, if requested and if a miscompare is detected, will yield the byte index of the first miscompare. - add Reviewed-by tags from Bodo Stroesser - rebase on lk 5.10.0-rc2 [was on lk 5.9.0] Changes since v2 [posted 20201018]: - remove unneeded lines from sgl_memset() definition. - change sg_zero_buffer() to call sgl_memset() as the former is a subset. Changes since v1 [posted 20201016]: - Bodo Stroesser pointed out a problem with the nesting of kmap_atomic() [called via sg_miter_next()] and kunmap_atomic() calls [called via sg_miter_stop()] and proposed a solution that simplifies the previous code. - the new implementation of the three functions has shorter periods when pre-emption is disabled (but has more them). This should make operations on large sgl_s more pre-emption "friendly" with a relatively small performance hit. - sgl_memset return type changed from void to size_t and is the number of bytes actually (over)written. That number is needed anyway internally so may as well return it as it may be useful to the caller. This patchset is against lk 5.10.0-rc2 Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 16 +++ lib/scatterlist.c | 244 +--- 2 files changed, 243 insertions(+), 17 deletions(-) -- 2.25.1
Re: [PATCH v3 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
On 2020-11-03 7:54 a.m., Bodo Stroesser wrote: Am 19.10.20 um 21:19 schrieb Douglas Gilbert: This patch removes a check done by sgl_alloc_order() before it starts any allocations. The comment before the removed code says: "Check for integer overflow" arguably gives a false sense of security. The right hand side of the expression in the condition is resolved as u32 so cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed that amount. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; I think the intention of the check is to reject calls, where length is so high, that calculation of nent overflows unsigned int nent/nalloc. Consistently a similar check is done few lines later before incrementing nalloc due to chainable = true. So I think the code tries to allow length values up to 4G << (PAGE_SHIFT + order). That said I think instead of removing the check it better should be fixed, e.g. by adding an unsigned long long cast before nent BTW: I don't know why there are two checks. I think one check after conditionally incrementing nalloc would be enough. Okay, I'm working on a "v4" patchset. Apart from the above, my plan is to extend sgl_compare_sgl() with a helper that additionally yields the byte index of the first miscompare. Doug Gilbert The author's intention is to use sgl_alloc_order() to replace vmalloc(unsigned long) for a large allocation (debug ramdisk). vmalloc has no limit at 4 GiB so its seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order(chainable=false) have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer maths. Having previously sent a patch to fix a memory leak in sg_alloc_order() take the opportunity to put a one line comment above sgl_free()'s declaration that it is not suitable when order > 0 . The mis-use of sgl_free() when order > 0 was the reason for the memory leak. The other users of sgl_alloc_order() in the kernel where checked and found to handle free-ing properly. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 3 --- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 45cf7b69d852..80178afc2a4a 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c448642e0f78..d5770e7f1030 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) - return NULL; nalloc = nent; if (chainable) { /* Check for integer overflow */
tools/perf: noise from check-headers.sh
Executing that script in linux-stable [lk 5.10.0-rc1] gives the following output: Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h' diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h Warning: Kernel ABI header at 'tools/include/uapi/linux/fscrypt.h' differs from latest version at 'include/uapi/linux/fscrypt.h' diff -u tools/include/uapi/linux/fscrypt.h include/uapi/linux/fscrypt.h Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from latest version at 'include/uapi/linux/kvm.h' diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h Warning: Kernel ABI header at 'tools/include/uapi/linux/mount.h' differs from latest version at 'include/uapi/linux/mount.h' diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h' diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h Warning: Kernel ABI header at 'tools/include/uapi/linux/prctl.h' differs from latest version at 'include/uapi/linux/prctl.h' diff -u tools/include/uapi/linux/prctl.h include/uapi/linux/prctl.h Warning: Kernel ABI header at 'tools/arch/x86/include/asm/disabled-features.h' differs from latest version at 'arch/x86/include/asm/disabled-features.h' diff -u tools/arch/x86/include/asm/disabled-features.h arch/x86/include/asm/disabled-features.h Warning: Kernel ABI header at 'tools/arch/x86/include/asm/required-features.h' differs from latest version at 'arch/x86/include/asm/required-features.h' diff -u tools/arch/x86/include/asm/required-features.h arch/x86/include/asm/required-features.h Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h' diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h Warning: Kernel ABI header at 'tools/arch/x86/include/asm/msr-index.h' differs from latest version at 'arch/x86/include/asm/msr-index.h' diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/kvm.h' differs from latest version at 'arch/x86/include/uapi/asm/kvm.h' diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/svm.h' differs from latest version at 'arch/x86/include/uapi/asm/svm.h' diff -u tools/arch/x86/include/uapi/asm/svm.h arch/x86/include/uapi/asm/svm.h Warning: Kernel ABI header at 'tools/arch/s390/include/uapi/asm/sie.h' differs from latest version at 'arch/s390/include/uapi/asm/sie.h' diff -u tools/arch/s390/include/uapi/asm/sie.h arch/s390/include/uapi/asm/sie.h Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h' diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h' diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h Warning: Kernel ABI header at 'tools/include/uapi/linux/mman.h' differs from latest version at 'include/uapi/linux/mman.h' diff -u tools/include/uapi/linux/mman.h include/uapi/linux/mman.h Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl' diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl Warning: Kernel ABI header at 'tools/perf/util/hashmap.h' differs from latest version at 'tools/lib/bpf/hashmap.h' diff -u tools/perf/util/hashmap.h tools/lib/bpf/hashmap.h Warning: Kernel ABI header at 'tools/perf/util/hashmap.c' differs from latest version at 'tools/lib/bpf/hashmap.c' diff -u tools/perf/util/hashmap.c tools/lib/bpf/hashmap.c There was a bit of noise in lk 5.9.0-rc1 but it is considerably worse now. Doug Gilbert
[PATCH v3 2/4] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a sgl to sgl copy. Stops when the first of the number of requested bytes to copy, or the source sgl, or the destination sgl is exhausted. So the destination sgl will _not_ grow. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 75 + 2 files changed, 79 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 80178afc2a4a..6649414c0749 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index d5770e7f1030..1f9e093ad7da 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -974,3 +974,78 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before starting + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before starting + * @n_bytes:The (maximum) number of bytes to copy + * + * Returns: + * The number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if either d_sgl, s_sgl or n_bytes is exhausted. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * Large copies are broken into copy segments whose sizes may vary. Those + * copy segment sizes are chosen by the min3() statement in the code below. + * Since SG_MITER_ATOMIC is used for both sides, each copy segment is started + * with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic() + * [in sg_miter_stop()]. This means pre-emption is inhibited for relatively + * short periods even in very large copies. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t len; + size_t offset = 0; + struct sg_mapping_iter d_iter, s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(d_iter.length, s_iter.length, n_bytes - offset); + + memcpy(d_iter.addr, s_iter.addr, len); + offset += len; + /* LIFO order (stop d_iter before s_iter) needed with SG_MITER_ATOMIC */ + d_iter.consumed = len; + sg_miter_stop(_iter); + s_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); + -- 2.25.1
[PATCH v3 4/4] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Change implementation of sg_zero_buffer() to call this new function. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 3 ++ lib/scatterlist.c | 65 + 2 files changed, 48 insertions(+), 20 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ae260dc5fedb..a40012c8a4e6 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -329,6 +329,9 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 49185536acba..6b430f7293e0 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -952,26 +952,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer); size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip) { - unsigned int offset = 0; - struct sg_mapping_iter miter; - unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; - - sg_miter_start(, sgl, nents, sg_flags); - - if (!sg_miter_skip(, skip)) - return false; - - while (offset < buflen && sg_miter_next()) { - unsigned int len; - - len = min(miter.length, buflen - offset); - memset(miter.addr, 0, len); - - offset += len; - } - - sg_miter_stop(); - return offset; + return sgl_memset(sgl, nents, skip, 0, buflen); } EXPORT_SYMBOL(sg_zero_buffer); @@ -1110,3 +1091,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk return equ; } EXPORT_SYMBOL(sgl_compare_sgl); + +/** + * sgl_memset - set byte 'val' up to n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The (maximum) number of bytes to modify + * + * Returns: + * The number of bytes written. + * + * Notes: + * Stops writing if either sgl or n_bytes is exhausted. If n_bytes is + * set SIZE_MAX then val will be written to each byte until the end + * of sgl. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + + if (n_bytes == 0) + return 0; + sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(, skip)) + goto fini; + + while ((offset < n_bytes) && sg_miter_next()) { + len = min(miter.length, n_bytes - offset); + memset(miter.addr, val, len); + offset += len; + } +fini: + sg_miter_stop(); + return offset; +} +EXPORT_SYMBOL(sgl_memset); + -- 2.25.1
[PATCH v3 0/4] scatterlist: add new capabilities
Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for caches. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl() and two other sgl operations. The existing sgl_alloc_order() function can be seen as a replacement for vmalloc() for large, long-term allocations. For what seems like no good reason, sgl_alloc_order() currently restricts its total allocation to less than or equal to 4 GiB. vmalloc() has no such restriction. Changes since v2 [posted 20201018]: - remove unneeded lines from sgl_memset() definition. - change sg_zero_buffer() to call sgl_memset() as the former is a subset. Changes since v1 [posted 20201016]: - Bodo Stroesser pointed out a problem with the nesting of kmap_atomic() [called via sg_miter_next()] and kunmap_atomic() calls [called via sg_miter_stop()] and proposed a solution that simplifies the previous code. - the new implementation of the three functions has shorter periods when pre-emption is disabled (but has more them). This should make operations on large sgl_s more pre-emption "friendly" with a relatively small performance hit. - sgl_memset return type changed from void to size_t and is the number of bytes actually (over)written. That number is needed anyway internally so may as well return it as it may be useful to the caller. This patchset is against lk 5.9.0 Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 12 +++ lib/scatterlist.c | 186 +--- 2 files changed, 184 insertions(+), 14 deletions(-) -- 2.25.1
[PATCH v3 3/4] scatterlist: add sgl_compare_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stops comparing. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 +++ lib/scatterlist.c | 61 + 2 files changed, 65 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6649414c0749..ae260dc5fedb 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -325,6 +325,10 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 1f9e093ad7da..49185536acba 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1049,3 +1049,64 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski } EXPORT_SYMBOL(sgl_copy_sgl); +/** + * sgl_compare_sgl - Compare x and y (both sgl_s) + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The (maximum) number of bytes to compare + * + * Returns: + * true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes) +{ + bool equ = true; + size_t len; + size_t offset = 0; + struct sg_mapping_iter x_iter, y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + while (equ && offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(x_iter.length, y_iter.length, n_bytes - offset); + + equ = !memcmp(x_iter.addr, y_iter.addr, len); + offset += len; + /* LIFO order is important when SG_MITER_ATOMIC is used */ + y_iter.consumed = len; + sg_miter_stop(_iter); + x_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return equ; +} +EXPORT_SYMBOL(sgl_compare_sgl); -- 2.25.1
[PATCH v3 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
This patch removes a check done by sgl_alloc_order() before it starts any allocations. The comment before the removed code says: "Check for integer overflow" arguably gives a false sense of security. The right hand side of the expression in the condition is resolved as u32 so cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed that amount. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; The author's intention is to use sgl_alloc_order() to replace vmalloc(unsigned long) for a large allocation (debug ramdisk). vmalloc has no limit at 4 GiB so its seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order(chainable=false) have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer maths. Having previously sent a patch to fix a memory leak in sg_alloc_order() take the opportunity to put a one line comment above sgl_free()'s declaration that it is not suitable when order > 0 . The mis-use of sgl_free() when order > 0 was the reason for the memory leak. The other users of sgl_alloc_order() in the kernel where checked and found to handle free-ing properly. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 3 --- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 45cf7b69d852..80178afc2a4a 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c448642e0f78..d5770e7f1030 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) - return NULL; nalloc = nent; if (chainable) { /* Check for integer overflow */ -- 2.25.1
[PATCH v2 2/4] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a copy. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 74 + 2 files changed, 78 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 80178afc2a4a..6649414c0749 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index d5770e7f1030..a0a86059c10e 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -974,3 +974,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before starting + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before starting + * @n_bytes:The (maximum) number of bytes to copy + * + * Returns the number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if either d_sgl, s_sgl or n_bytes is exhausted. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * Large copies are broken into copy segments whose sizes may vary. Those + * copy segment sizes are chosen by the min3() statement in the code below. + * Since SG_MITER_ATOMIC is used for both sides, each copy segment is started + * with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic() + * [in sg_miter_stop()]. This means pre-emption is inhibited for relatively + * short periods even in very large copies. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t len; + size_t offset = 0; + struct sg_mapping_iter d_iter, s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + + while (offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(d_iter.length, s_iter.length, n_bytes - offset); + + memcpy(d_iter.addr, s_iter.addr, len); + offset += len; + /* LIFO order (stop d_iter before s_iter) needed with SG_MITER_ATOMIC */ + d_iter.consumed = len; + sg_miter_stop(_iter); + s_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); + -- 2.25.1
[PATCH v2 0/4] scatterlist: add new capabilities
Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for caches. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl() and two other sgl operations. The existing sgl_alloc_order() function can be seen as a replacement for vmalloc() for large, long-term allocations. For what seems like no good reason, sgl_alloc_order() currently restricts its total allocation to less than or equal to 4 GiB. vmalloc() has no such restriction. Changes since v1 [posted 20201016]: - Bodo Stroesser pointed out a problem with the nesting of kmap_atomic() [called via sg_miter_next()] and kunmap_atomic() calls [called via sg_miter_stop()] and proposed a solution that simplifies the previous code. - the new implementation of the three functions has shorter periods when pre-emption is disabled (but has more them). This should make operations on large sgl_s more pre-emption "friendly" with a relatively small performance hit. - sgl_memset return type changed from void to size_t and is the number of bytes actually (over)written. That number is needed anyway internally so may as well return it as it may be useful to the caller. This patchset is against lk 5.9.0 Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 12 +++ lib/scatterlist.c | 204 +++- 2 files changed, 213 insertions(+), 3 deletions(-) -- 2.25.1 *** BLURB HERE *** Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 12 +++ lib/scatterlist.c | 185 +++- 2 files changed, 194 insertions(+), 3 deletions(-) -- 2.25.1
[PATCH v2 4/4] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Plus it returns the number of bytes (over)written. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 3 +++ lib/scatterlist.c | 54 ++--- 2 files changed, 54 insertions(+), 3 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ae260dc5fedb..a40012c8a4e6 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -329,6 +329,9 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes); +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index d910776a4c96..a704039ab54d 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -985,7 +985,8 @@ EXPORT_SYMBOL(sg_zero_buffer); * @s_skip: Number of bytes to skip in source before starting * @n_bytes:The (maximum) number of bytes to copy * - * Returns the number of copied bytes. + * Returns: + * The number of copied bytes. * * Notes: * Destination arguments appear before the source arguments, as with memcpy(). @@ -1058,8 +1059,9 @@ EXPORT_SYMBOL(sgl_copy_sgl); * @y_skip: Number of bytes to skip in y (right) before starting * @n_bytes:The (maximum) number of bytes to compare * - * Returns true if x and y compare equal before x, y or n_bytes is exhausted. - * Otherwise on a miscompare, returns false (and stops comparing). + * Returns: + * true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). * * Notes: * x and y are symmetrical: they can be swapped and the result is the same. @@ -1108,3 +1110,49 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk return equ; } EXPORT_SYMBOL(sgl_compare_sgl); + +/** + * sgl_memset - set byte 'val' up to n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The (maximum) number of bytes to modify + * + * Returns: + * The number of bytes written. + * + * Notes: + * Stops writing if either sgl or n_bytes is exhausted. If n_bytes is + * set SIZE_MAX then val will be written to each byte until the end + * of sgl. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + + if (n_bytes == 0) + return 0; + sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + if (!sg_miter_skip(, skip)) + goto fini; + + while ((offset < n_bytes) && sg_miter_next()) { + len = min(miter.length, n_bytes - offset); + memset(miter.addr, val, len); + offset += len; + miter.consumed = len; + sg_miter_stop(); + } +fini: + sg_miter_stop(); + return offset; +} +EXPORT_SYMBOL(sgl_memset); + -- 2.25.1
[PATCH v2 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
This patch removes a check done by sgl_alloc_order() before it starts any allocations. The comment before the removed code says: "Check for integer overflow" arguably gives a false sense of security. The right hand side of the expression in the condition is resolved as u32 so cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed that amount. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; The author's intention is to use sgl_alloc_order() to replace vmalloc(unsigned long) for a large allocation (debug ramdisk). vmalloc has no limit at 4 GiB so its seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order(chainable=false) have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer maths. Having previously sent a patch to fix a memory leak in sg_alloc_order() take the opportunity to put a one line comment above sgl_free()'s declaration that it is not suitable when order > 0 . The mis-use of sgl_free() when order > 0 was the reason for the memory leak. The other users of sgl_alloc_order() in the kernel where checked and found to handle free-ing properly. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 3 --- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 45cf7b69d852..80178afc2a4a 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c448642e0f78..d5770e7f1030 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) - return NULL; nalloc = nent; if (chainable) { /* Check for integer overflow */ -- 2.25.1
[PATCH v2 3/4] scatterlist: add sgl_compare_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stop comparing. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 +++ lib/scatterlist.c | 60 + 2 files changed, 64 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6649414c0749..ae260dc5fedb 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -325,6 +325,10 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index a0a86059c10e..d910776a4c96 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1048,3 +1048,63 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski } EXPORT_SYMBOL(sgl_copy_sgl); +/** + * sgl_compare_sgl - Compare x and y (both sgl_s) + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The (maximum) number of bytes to compare + * + * Returns true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * The notes in sgl_copy_sgl() about large sgl_s _applies here as well. + * + **/ +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes) +{ + bool equ = true; + size_t len; + size_t offset = 0; + struct sg_mapping_iter x_iter, y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + while (equ && offset < n_bytes) { + if (!sg_miter_next(_iter)) + break; + if (!sg_miter_next(_iter)) + break; + len = min3(x_iter.length, y_iter.length, n_bytes - offset); + + equ = !memcmp(x_iter.addr, y_iter.addr, len); + offset += len; + /* LIFO order is important when SG_MITER_ATOMIC is used */ + y_iter.consumed = len; + sg_miter_stop(_iter); + x_iter.consumed = len; + sg_miter_stop(_iter); + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return equ; +} +EXPORT_SYMBOL(sgl_compare_sgl); -- 2.25.1
Re: [PATCH 2/4] scatterlist: add sgl_copy_sgl() function
On 2020-10-16 7:17 a.m., Bodo Stroesser wrote: Hi Douglas, AFAICS this patch - and also patch 3 - are not correct. When started with SG_MITER_ATOMIC, sg_miter_next and sg_miter_stop use the k(un)map_atomic calls. But these have to be used strictly nested according to docu and code. The below code uses the atomic mappings in overlapping mode. That being the case, I'll add d_flags and s_flags arguments that are expected to take either 0 or SG_MITER_ATOMIC and re-test. There probably should be a warning in the notes not to set both d_flags and s_flags to SG_MITER_ATOMIC. My testing to date has not been in irq or soft interrupt state. I should be able to rig a test for the latter. Thanks Doug Gilbert Am 16.10.20 um 06:52 schrieb Douglas Gilbert: Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a copy. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 86 + 2 files changed, 90 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 80178afc2a4a..6649414c0749 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index d5770e7f1030..1ec2c909c8d4 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -974,3 +974,89 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents: Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before copying + * @s_sgl: Source sgl + * @s_nents: Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before copying + * @n_bytes: The number of bytes to copy + * + * Returns the number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if the end of d_sgl or s_sgl is reached. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t d_off, s_off, len, d_len, s_len; + size_t offset = 0; + struct sg_mapping_iter d_iter; + struct sg_mapping_iter s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + + for (d_off = 0, s_off = 0; true ; ) { + /* Assume d_iter.length and s_iter.length can never be 0 */ + if (d_off == 0) { + if (!sg_miter_next(_iter)) + break; + d_len = d_iter.length; + } else { + d_len = d_iter.length - d_off; + } + if (s_off == 0) { + if (!sg_miter_next(_iter)) + break; + s_len = s_iter.length; + } else { + s_len = s_iter.length - s_off; + } + len = min3(d_len, s_len, n_bytes - offset); + + memcpy(d_iter.addr + d_off, s_iter.addr + s_off, len); + offset += len; + if (offset >= n_bytes) + break; + if (d_len == s_len) { + d_off = 0; + s_off = 0; +
[PATCH 3/4] scatterlist: add sgl_compare_sgl() function
After enabling copies between scatter gather lists (sgl_s), another storage related operation is to compare two sgl_s. This new function is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function returns false on the first miscompare and stop comparing. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 84 - 2 files changed, 86 insertions(+), 2 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6649414c0749..ae260dc5fedb 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -325,6 +325,10 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, size_t n_bytes); +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 1ec2c909c8d4..344725990b9d 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -979,10 +979,10 @@ EXPORT_SYMBOL(sg_zero_buffer); * sgl_copy_sgl - Copy over a destination sgl from a source sgl * @d_sgl: Destination sgl * @d_nents:Number of SG entries in destination sgl - * @d_skip: Number of bytes to skip in destination before copying + * @d_skip: Number of bytes to skip in destination before starting * @s_sgl: Source sgl * @s_nents:Number of SG entries in source sgl - * @s_skip: Number of bytes to skip in source before copying + * @s_skip: Number of bytes to skip in source before starting * @n_bytes:The number of bytes to copy * * Returns the number of copied bytes. @@ -1060,3 +1060,83 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_ski } EXPORT_SYMBOL(sgl_copy_sgl); +/** + * sgl_compare_sgl - Compare x and y (both sgl_s) + * @x_sgl: x (left) sgl + * @x_nents:Number of SG entries in x (left) sgl + * @x_skip: Number of bytes to skip in x (left) before starting + * @y_sgl: y (right) sgl + * @y_nents:Number of SG entries in y (right) sgl + * @y_skip: Number of bytes to skip in y (right) before starting + * @n_bytes:The number of bytes to compare + * + * Returns true if x and y compare equal before x, y or n_bytes is exhausted. + * Otherwise on a miscompare, returns false (and stops comparing). + * + * Notes: + * x and y are symmetrical: they can be swapped and the result is the same. + * + * Implementation is based on memcmp(). x and y segments may overlap. + * + * Same comment from sgl_copy_sgl() about large _skip arguments applies here + * as well. + * + **/ +bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, +struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, +size_t n_bytes) +{ + bool equ = true; + size_t x_off, y_off, len, x_len, y_len; + size_t offset = 0; + struct sg_mapping_iter x_iter; + struct sg_mapping_iter y_iter; + + if (n_bytes == 0) + return true; + sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, x_skip)) + goto fini; + if (!sg_miter_skip(_iter, y_skip)) + goto fini; + + for (x_off = 0, y_off = 0; true ; ) { + /* Assume x_iter.length and y_iter.length can never be 0 */ + if (x_off == 0) { + if (!sg_miter_next(_iter)) + break; + x_len = x_iter.length; + } else { + x_len = x_iter.length - x_off; + } + if (y_off == 0) { + if (!sg_miter_next(_iter)) + break; + y_len = y_iter.length; + } else { + y_len = y_iter.length - y_off; + } + len = min3(x_len, y_len, n_bytes - offset); + + equ = memcmp(x_iter.addr + x_off, y_iter.addr + y_off, len) == 0; + offset += len; + if (!equ || offset >= n_bytes) + break; + if (x_len == y_len) { + x_off = 0; + y_off = 0; + } else if (x_len <
[PATCH 4/4] scatterlist: add sgl_memset()
The existing sg_zero_buffer() function is a bit restrictive. For example protection information (PI) blocks are usually initialized to 0xff bytes. As its name suggests sgl_memset() is modelled on memset(). One difference is the type of the val argument which is u8 rather than int. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 3 +++ lib/scatterlist.c | 39 +++-- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index ae260dc5fedb..e50dc9a6d887 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -329,6 +329,9 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, size_t n_bytes); +void sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 344725990b9d..3ca66f0c949f 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -1083,8 +1083,8 @@ EXPORT_SYMBOL(sgl_copy_sgl); * **/ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip, -struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, -size_t n_bytes) + struct scatterlist *y_sgl, unsigned int y_nents, off_t y_skip, + size_t n_bytes) { bool equ = true; size_t x_off, y_off, len, x_len, y_len; @@ -1140,3 +1140,38 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_sk return equ; } EXPORT_SYMBOL(sgl_compare_sgl); + +/** + * sgl_memset - set byte 'val' n_bytes times on SG list + * @sgl:The SG list + * @nents: Number of SG entries in sgl + * @skip: Number of bytes to skip before starting + * @val:byte value to write to sgl + * @n_bytes:The number of bytes to modify + * + * Notes: + * Writes val n_bytes times or until sgl is exhausted. + * + **/ +void sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip, + u8 val, size_t n_bytes) +{ + size_t offset = 0; + size_t len; + struct sg_mapping_iter miter; + unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG; + + if (n_bytes == 0) + return; + sg_miter_start(, sgl, nents, sg_flags); + if (!sg_miter_skip(, skip)) + goto fini; + + while ((offset < n_bytes) && sg_miter_next()) { + len = min(miter.length, n_bytes - offset); + memset(miter.addr, val, len); + offset += len; + } +fini: + sg_miter_stop(); +} -- 2.25.1
[PATCH 0/4] scatterlist: add new capabilities
Scatter-gather lists (sgl_s) are frequently used as data carriers in the block layer. For example the SCSI and NVMe subsystems interchange data with the block layer using sgl_s. The sgl API is declared in The author has extended these transient sgl use cases to a store (i.e. ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s could be for caches. When this extra step is taken, the need to copy between sgl_s becomes apparent. The patchset adds sgl_copy_sgl() and a few other sgl operations. The existing sgl_alloc_order() function can be seen as a replacement for vmalloc() for large, long-term allocations. For what seems like no good reason, sgl_alloc_order() currently restricts its total allocation to less than or equal to 4 GiB. vmalloc() has no such restriction. This patchset is against lk 5.9.0 Douglas Gilbert (4): sgl_alloc_order: remove 4 GiB limit, sgl_free() warning scatterlist: add sgl_copy_sgl() function scatterlist: add sgl_compare_sgl() function scatterlist: add sgl_memset() include/linux/scatterlist.h | 12 +++ lib/scatterlist.c | 204 +++- 2 files changed, 213 insertions(+), 3 deletions(-) -- 2.25.1
[PATCH 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
This patch removes a check done by sgl_alloc_order() before it starts any allocations. The comment before the removed code says: "Check for integer overflow" arguably gives a false sense of security. The right hand side of the expression in the condition is resolved as u32 so cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed that amount. If that was the intention then the comment above it could be dropped and the condition rewritten more clearly as: if (length > UINT32_MAX) <>; The author's intention is to use sgl_alloc_order() to replace vmalloc(unsigned long) for a large allocation (debug ramdisk). vmalloc has no limit at 4 GiB so its seems unreasonable that: sgl_alloc_order(unsigned long long length, ) does. sgl_s made with sgl_alloc_order(chainable=false) have equally sized segments placed in a scatter gather array. That allows O(1) navigation around a big sgl using some simple integer maths. Having previously sent a patch to fix a memory leak in sg_alloc_order() take the opportunity to put a one line comment above sgl_free()'s declaration that it is not suitable when order > 0 . The mis-use of sgl_free() when order > 0 was the reason for the memory leak. The other users of sgl_alloc_order() in the kernel where checked and found to handle free-ing properly. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 1 + lib/scatterlist.c | 3 --- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 45cf7b69d852..80178afc2a4a 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp, unsigned int *nent_p); void sgl_free_n_order(struct scatterlist *sgl, int nents, int order); void sgl_free_order(struct scatterlist *sgl, int order); +/* Only use sgl_free() when order is 0 */ void sgl_free(struct scatterlist *sgl); #endif /* CONFIG_SGL_ALLOC */ diff --git a/lib/scatterlist.c b/lib/scatterlist.c index c448642e0f78..d5770e7f1030 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, u32 elem_len; nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order); - /* Check for integer overflow */ - if (length > (nent << (PAGE_SHIFT + order))) - return NULL; nalloc = nent; if (chainable) { /* Check for integer overflow */ -- 2.25.1
[PATCH 2/4] scatterlist: add sgl_copy_sgl() function
Both the SCSI and NVMe subsystems receive user data from the block layer in scatterlist_s (aka scatter gather lists (sgl) which are often arrays). If drivers in those subsystems represent storage (e.g. a ramdisk) or cache "hot" user data then they may also choose to use scatterlist_s. Currently there are no sgl to sgl operations in the kernel. Start with a copy. Signed-off-by: Douglas Gilbert --- include/linux/scatterlist.h | 4 ++ lib/scatterlist.c | 86 + 2 files changed, 90 insertions(+) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 80178afc2a4a..6649414c0749 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents, size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, size_t buflen, off_t skip); +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes); + /* * Maximum number of entries that will be allocated in one piece, if * a list larger than this is required then chaining will be utilized. diff --git a/lib/scatterlist.c b/lib/scatterlist.c index d5770e7f1030..1ec2c909c8d4 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -974,3 +974,89 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents, return offset; } EXPORT_SYMBOL(sg_zero_buffer); + +/** + * sgl_copy_sgl - Copy over a destination sgl from a source sgl + * @d_sgl: Destination sgl + * @d_nents:Number of SG entries in destination sgl + * @d_skip: Number of bytes to skip in destination before copying + * @s_sgl: Source sgl + * @s_nents:Number of SG entries in source sgl + * @s_skip: Number of bytes to skip in source before copying + * @n_bytes:The number of bytes to copy + * + * Returns the number of copied bytes. + * + * Notes: + * Destination arguments appear before the source arguments, as with memcpy(). + * + * Stops copying if the end of d_sgl or s_sgl is reached. + * + * Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong + * to the same sgl and the copy regions overlap) are not supported. + * + * If d_skip is large, potentially spanning multiple d_nents then some + * integer arithmetic to adjust d_sgl may improve performance. For example + * if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl + * will be an array with equally sized segments facilitating that + * arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well. + * + **/ +size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t d_skip, + struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip, + size_t n_bytes) +{ + size_t d_off, s_off, len, d_len, s_len; + size_t offset = 0; + struct sg_mapping_iter d_iter; + struct sg_mapping_iter s_iter; + + if (n_bytes == 0) + return 0; + sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG); + sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | SG_MITER_FROM_SG); + if (!sg_miter_skip(_iter, d_skip)) + goto fini; + if (!sg_miter_skip(_iter, s_skip)) + goto fini; + + for (d_off = 0, s_off = 0; true ; ) { + /* Assume d_iter.length and s_iter.length can never be 0 */ + if (d_off == 0) { + if (!sg_miter_next(_iter)) + break; + d_len = d_iter.length; + } else { + d_len = d_iter.length - d_off; + } + if (s_off == 0) { + if (!sg_miter_next(_iter)) + break; + s_len = s_iter.length; + } else { + s_len = s_iter.length - s_off; + } + len = min3(d_len, s_len, n_bytes - offset); + + memcpy(d_iter.addr + d_off, s_iter.addr + s_off, len); + offset += len; + if (offset >= n_bytes) + break; + if (d_len == s_len) { + d_off = 0; + s_off = 0; + } else if (d_len < s_len) { + d_off = 0; + s_off += len; + } else { + d_off += len; + s_off = 0; + } + } +fini: + sg_miter_stop(_iter); + sg_miter_stop(_iter); + return offset; +} +EXPORT_SYMBOL(sgl_copy_sgl); + -- 2.25.1
[RESEND PATCH] sgl_alloc_order: fix memory leak
sgl_alloc_order() can fail when 'length' is large on a memory constrained system. When order > 0 it will potentially be making several multi-page allocations with the later ones more likely to fail than the earlier one. So it is important that sgl_alloc_order() frees up any pages it has obtained before returning NULL. In the case when order > 0 it calls the wrong free page function and leaks. In testing the leak was sufficient to bring down my 8 GiB laptop with OOM. Reviewed-by: Bart Van Assche Signed-off-by: Douglas Gilbert --- lib/scatterlist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 5d63a8857f36..c448642e0f78 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -514,7 +514,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, elem_len = min_t(u64, length, PAGE_SIZE << order); page = alloc_pages(gfp, order); if (!page) { - sgl_free(sgl); + sgl_free_order(sgl, order); return NULL; } -- 2.25.1
Re: [question] What happens when dd writes data to a missing device?
On 2020-10-11 3:46 p.m., Mikhail Gavrilov wrote: Hi folks! I have a question. What happens when dd writes data to a missing device? For example: # dd if=/home/mikhail/Downloads/Fedora-Workstation-Live-x86_64-Rawhide-20201010.n.0.iso of=/dev/adb Today I and wrongly entered /dev/adb instead of /dev/sdb, and what my surprise was when the data began to be written to the /dev/adb device without errors. But my surprise was even greater when cat /dev/adb started to display the written data. I have a question: Where the data was written and could it damage the stored data in memory or on disk? Others have answered your direct question. You may find 'oflag=nocreat' helpful if you (or others) do _not_ want a regular file created in /dev ; for example: if you have misspelt a device name. That flag may also be helpful in unstable systems (e.g. where device nodes are disappearing and re-appearing) as it can be a real pain if you manage to create a regular file with a name like /dev/sdc when the disk usually occupying that node is temporarily offline. When that disk comes back online then regular file '/dev/sdc' will stop device node '/dev/sdc' from being created. The solution is to remove the regular file /dev/sdc and you probably need to power cycle that disk. If this becomes a regular event then 'oflag=nocreat' is your friend [see 'man dd' for a little more information, it really should be expanded]. Doug Gilbert
Re: [PATCH] lib/scatterlist: Fix memory leak in sgl_alloc_order()
On 2020-09-20 4:11 p.m., Markus Elfring wrote: Noticed that when sgl_alloc_order() failed with order > 0 that free memory on my machine shrank. That function shouldn't call sgl_free() on its error path since that is only correct when order==0 . * Would an imperative wording become helpful for the change description? … … and the term "imperative wording" rings no bells in my grammatical education. … I suggest to take another look at the published Linux development documentation. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=bdcf11de8f776152c82d2197b255c2d04603f976#n151 * How do you think about to add the tag “Fixes” to the commit message?r In the workflow I'm used to, others (closer to LT) make that decision. Why waste my time? I find another bit of guidance relevant. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=bdcf11de8f776152c82d2197b255c2d04603f976#n183 * Will an other patch subject be more appropriate? Twas testing a 6 GB allocation with said function on my 8 GB laptop. It failed and free told me 5 GB had disappeared … … Have we got any different expectations for the canonical patch subject line? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=bdcf11de8f776152c82d2197b255c2d04603f976#n684 I am curious how the software will evolve further also according to your system test experiences. Sorry, I didn't come down in the last shower, it's not my first bug fix. Try consulting 'git log' and look for my name or the MAINTAINERS file. The culprits are usually happy as was the case with this patch. It's ack-ed and I would be very surprised if Jens Axboe doesn't accept it. It is an obvious flaw. Fix it and move on. Alternatively supply your own patch that ticks all the above boxes. If you want to talk about something substantial, then why do we have a function named sgl_free() that only works properly if, for example, the sgl_alloc_order() function creating the sgl used order==0 ? IMO sgl_free() should be removed or renamed. Doug Gilbert BTW The "imperative mood" stuff in that document is nonsense, at least in English. Wikipedia maps that term back to "the imperative" as in "Get thee to a nunnery" and "Et tu, Brute".
Re: [PATCH] lib/scatterlist: Fix memory leak in sgl_alloc_order()
On 2020-09-20 1:09 p.m., Markus Elfring wrote: Noticed that when sgl_alloc_order() failed with order > 0 that free memory on my machine shrank. That function shouldn't call sgl_free() on its error path since that is only correct when order==0 . * Would an imperative wording become helpful for the change description? No passive tense there. Or do you mean usage like: "Go to hell" or "Fix memory leak in ..."? I studied French and Latin at school; at a guess, my mother tongue got its grammar from the former. My mother taught English grammar and the term "imperative wording" rings no bells in my grammatical education. Google agrees with me. Please define: "imperative wording". * How do you think about to add the tag “Fixes” to the commit message?r In the workflow I'm used to, others (closer to LT) make that decision. Why waste my time? * Will an other patch subject be more appropriate? Twas testing a 6 GB allocation with said function on my 8 GB laptop. It failed and free told me 5 GB had disappeared (and 'cat /sys/kernel/debug/kmemleak' told me _nothing_). Umm, it is potentially a HUGE f@#$ing memory LEAK! Best to call a spade a spade. Doug Gilbert
[PATCH] sgl_alloc_order: memory leak
Noticed that when sgl_alloc_order() failed with order > 0 that free memory on my machine shrank. That function shouldn't call sgl_free() on its error path since that is only correct when order==0 . Signed-off-by: Douglas Gilbert --- lib/scatterlist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 5d63a8857f36..c448642e0f78 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -514,7 +514,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long length, elem_len = min_t(u64, length, PAGE_SIZE << order); page = alloc_pages(gfp, order); if (!page) { - sgl_free(sgl); + sgl_free_order(sgl, order); return NULL; } -- 2.25.1
[PATCH] tools/io_uring: fix compile breakage
It would seem none of the kernel continuous integration does this: $ cd tools/io_uring $ make Otherwise it may have noticed: cc -Wall -Wextra -g -D_GNU_SOURCE -c -o io_uring-bench.o io_uring-bench.c io_uring-bench.c:133:12: error: static declaration of ‘gettid’ follows non-static declaration 133 | static int gettid(void) |^~ In file included from /usr/include/unistd.h:1170, from io_uring-bench.c:27: /usr/include/x86_64-linux-gnu/bits/unistd_ext.h:34:16: note: previous declaration of ‘gettid’ was here 34 | extern __pid_t gettid (void) __THROW; |^~ make: *** [: io_uring-bench.o] Error 1 The problem on Ubuntu 20.04 (with lk 5.9.0-rc5) is that unistd.h already defines gettid(). So prefix the local definition with "lk_". Signed-off-by: Douglas Gilbert --- tools/io_uring/io_uring-bench.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/io_uring/io_uring-bench.c b/tools/io_uring/io_uring-bench.c index 0f257139b003..7703f0118385 100644 --- a/tools/io_uring/io_uring-bench.c +++ b/tools/io_uring/io_uring-bench.c @@ -130,7 +130,7 @@ static int io_uring_register_files(struct submitter *s) s->nr_files); } -static int gettid(void) +static int lk_gettid(void) { return syscall(__NR_gettid); } @@ -281,7 +281,7 @@ static void *submitter_fn(void *data) struct io_sq_ring *ring = >sq_ring; int ret, prepped; - printf("submitter=%d\n", gettid()); + printf("submitter=%d\n", lk_gettid()); srand48_r(pthread_self(), >rand); -- 2.25.1
Re: [PATCH] scsi: clear UAC before sending SG_IO
On 2020-09-10 6:15 a.m., Randall Huang wrote: Make sure UAC is clear before sending SG_IO. Signed-off-by: Randall Huang This patch just looks wrong. Imagine if every LLD front loaded some LLD specific code before each invocation of ioctl(SG_IO). Is UAC Unit Attention Condition? If so the mid-level notes them as they fly past. Haven't looked at the rest of the patchset but I suspect the "wlun_clr_uac" work needs a rethink. If that is the REPORT LUNS Well known LUN then perhaps it could be handled in the mid-level scanning code. Otherwise it should be handled in the LLD/UFS. Also users of ioctl(SG_IO) should be capable of handling UAs, even if they are irrelevant, and repeat the invocation. Finally ioctl(sg_dev, SG_IO) is not the only way to send a pass-through command, there are also - write(sg_dev, ...) - ioctl(bsg_dev, SG_IO, ...) - ioctl(most_blk_devs, SG_IO, ...) - ioctl(st_dev, SG_IO, ...) Hopefully I have convinced you by now not to take this route. Doug Gilbert --- drivers/scsi/sg.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 20472aaaf630..ad11bca47ae8 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -922,6 +922,7 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp, int result, val, read_only; Sg_request *srp; unsigned long iflags; + int _cmd; SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp, "sg_ioctl: cmd=0x%x\n", (int) cmd_in)); @@ -933,6 +934,13 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd *sfp, return -ENODEV; if (!scsi_block_when_processing_errors(sdp->device)) return -ENXIO; + + _cmd = SCSI_UFS_REQUEST_SENSE; + if (sdp->device->host->wlun_clr_uac) { + sdp->device->host->hostt->ioctl(sdp->device, _cmd, NULL); + sdp->device->host->wlun_clr_uac = false; + } + result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR, 1, read_only, 1, ); if (result < 0)
Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
On 2020-08-19 11:20 a.m., John Garry wrote: Hi all, Here is v8 of the patchset. In this version of the series, we keep the shared sbitmap for driver tags, and introduce changes to fix up the tag budgeting across request queues. We also have a change to count requests per-hctx for when an elevator is enabled, as an optimisation. I also dropped the debugfs changes - more on that below. Some performance figures: Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included, but it is not always an appropriate scheduler to use. Tag depth 4000 (default) 260** Baseline (v5.9-rc1): none sched: 2094K IOPS 513K mq-deadline sched: 2145K IOPS 1336K Final, host_tagset=0 in LLDD *, ***: none sched: 2120K IOPS 550K mq-deadline sched: 2121K IOPS 1309K Final ***: none sched: 2132K IOPS 1185 mq-deadline sched: 2145K IOPS 2097 * this is relevant as this is the performance in supporting but not enabling the feature ** depth=260 is relevant as some point where we are regularly waiting for tags to be available. Figures were are a bit unstable here. *** Included "[PATCH V4] scsi: core: only re-run queue in scsi_end_request() if device queue is busy" A copy of the patches can be found here: https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8 The hpsa patch depends on: https://lore.kernel.org/linux-scsi/20200430131904.5847-1-h...@suse.de/ And the smartpqi patch is not to be accepted. Comments (and testing) welcome, thanks! I tested this v8 patchset on MKP's 5.10/scsi-queue branch together with my rewritten sg driver on my laptop and a Ryzen 5 3600 machine. Since I don't have same hardware, I use the scsi_debug driver as the target: modprobe scsi_debug dev_size_mb=1024 sector_size=512 add_host=7 per_host_store=1 ndelay=1000 random=1 submit_queues=12 My test is a script which runs these three commands many times with differing parameters: sg_mrq_dd iflag=random bs=512 of=/dev/sg8 thr=64 time=2 time to transfer data was 0.312705 secs, 3433.72 MB/sec 2097152+0 records in 2097152+0 records out sg_mrq_dd bpt=256 thr=64 mrq=36 time=2 if=/dev/sg8 bs=512 of=/dev/sg9 time to transfer data was 0.212090 secs, 5062.67 MB/sec 2097152+0 records in 2097152+0 records out sg_mrq_dd --verify if=/dev/sg8 of=/dev/sg9 bs=512 bpt=256 thr=64 mrq=36 time=2 Doing verify/cmp rather than copy time to transfer data was 0.184563 secs, 5817.75 MB/sec 2097152+0 records in 2097152+0 records verified The above is the output from last section of the my script run on the Ryzen 5. So the three steps are: 1) produce random data on /dev/sg8 2) copy /dev/sg8 to /dev/sg9 3) verify /dev/sg8 and /dev/sg9 are the same. The latter step is done with a sequence of READ(/dev/sg8) and VERIFY(BYTCHK=1 on /dev/sg9). The "mrq" stands for multiple requests (in one invocation; the bsg driver did that before its write(2) command was removed. The SCSI devices on the Ryzen 5 machine are: # lsscsi -gs [2:0:0:0] diskIBM-207x HUSMM8020ASS20 J4B6 /dev/sda /dev/sg0 200GB [2:0:1:0] diskSEAGATE ST200FM0073 0007 /dev/sdb /dev/sg1 200GB [2:0:2:0] enclosu Areca Te ARC-802801.37.69 0137 - /dev/sg2 - [3:0:0:0] diskLinuxscsi_debug 0190 /dev/sdc /dev/sg3 1.07GB [4:0:0:0] diskLinuxscsi_debug 0190 /dev/sdd /dev/sg4 1.07GB [5:0:0:0] diskLinuxscsi_debug 0190 /dev/sde /dev/sg5 1.07GB [6:0:0:0] diskLinuxscsi_debug 0190 /dev/sdf /dev/sg6 1.07GB [7:0:0:0] diskLinuxscsi_debug 0190 /dev/sdg /dev/sg7 1.07GB [8:0:0:0] diskLinuxscsi_debug 0190 /dev/sdh /dev/sg8 1.07GB [9:0:0:0] diskLinuxscsi_debug 0190 /dev/sdi /dev/sg9 1.07GB [N:0:1:1] diskWDC WDS250G2B0C-00PXH0__1 /dev/nvme0n1 - 250GB My script took 17m12 and the highest throughput (on a copy) was 7.5 GB/sec. Then I reloaded the scsi_debug module, this time with an additional 'host_max_queue=128' parameter. The script run time was 5 seconds shorter and the maximum throughput was around 7.6 GB/sec. [Average throughput is around 4 GB/sec.] For comparison: # time liburing/examples/io_uring-cp /dev/sdh /dev/sdi real0m1.542s user0m0.004s sys 0m1.027s Umm, that's less then 1 GB/sec. In its defence io_uring-cp is an extremely simple, single threaded, proof-of-concept copy program, at least compared to sg_mrq_dd . As used by the sg_mrq_dd the rewritten sg driver bypasses moving 1 GB to and from the _user_ space while doing the above copy and verify steps. So: Tested-by: Douglas Gilbert Differences to v7: - Add null_blk and scsi_debug support - Drop debugfs tags patch - it's too difficult to be the same between hostw
Re: rework check_disk_change()
On 2020-09-02 10:11 a.m., Christoph Hellwig wrote: Hi Jens, this series replaced the not very nice check_disk_change() function with a new bdev_media_changed that avoids having the ->revalidate_disk call at its end. As a result ->revalidate_disk can be removed from a lot of drivers. For over 20 years the sg driver has been carrying this snippet that hangs off the completion callback: if (driver_stat & DRIVER_SENSE) { struct scsi_sense_hdr ssh; if (scsi_normalize_sense(sbp, sense_len, )) { if (!scsi_sense_is_deferred()) { if (ssh.sense_key == UNIT_ATTENTION) { if (sdp->device->removable) sdp->device->changed = 1; } } } } Is it needed? The unit attention (UA) may not be associated with the device changing. Shouldn't the SCSI mid-level monitor UAs if they impact the state of a scsi_device object? Doug Gilbert
Re: [PATCH] scsi: sd: add runtime pm to open / release
On 2020-07-29 10:32 a.m., Alan Stern wrote: On Wed, Jul 29, 2020 at 04:12:22PM +0200, Martin Kepplinger wrote: On 28.07.20 22:02, Alan Stern wrote: On Tue, Jul 28, 2020 at 09:02:44AM +0200, Martin Kepplinger wrote: Hi Alan, Any API cleanup is of course welcome. I just wanted to remind you that the underlying problem: broken block device runtime pm. Your initial proposed fix "almost" did it and mounting works but during file access, it still just looks like a runtime_resume is missing somewhere. Well, I have tested that proposed fix several times, and on my system it's working perfectly. When I stop accessing a drive it autosuspends, and when I access it again it gets resumed and works -- as you would expect. that's weird. when I mount, everything looks good, "sda1". But as soon as I cd to the mountpoint and do "ls" (on another SD card "ls" works but actual file reading leads to the exact same errors), I get: [ 77.474632] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s [ 77.474647] sd 0:0:0:0: [sda] tag#0 Sense Key : 0x6 [current] [ 77.474655] sd 0:0:0:0: [sda] tag#0 ASC=0x28 ASCQ=0x0 [ 77.474667] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 00 60 40 00 00 01 00 This error report comes from the SCSI layer, not the block layer. SCSI's first 11 byte command! I'm guessing the first byte is being repeated and it's actually: 28 00 00 00 60 40 00 00 01 00 [READ(10)] That should be fixed. It should be something like: "...CDB in hex: 28 00 ...". Doug Gilbert [ 77.474678] blk_update_request: I/O error, dev sda, sector 24640 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 77.485836] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 77.491628] blk_update_request: I/O error, dev sda, sector 24641 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 77.502275] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 77.508051] blk_update_request: I/O error, dev sda, sector 24642 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [ 77.518651] sd 0:0:0:0: [sda] tag#0 device offline or changed (...) [ 77.947653] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 77.953434] FAT-fs (sda1): Directory bread(block 16448) failed [ 77.959333] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 77.965118] FAT-fs (sda1): Directory bread(block 16449) failed [ 77.971014] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 77.976802] FAT-fs (sda1): Directory bread(block 16450) failed [ 77.982698] sd 0:0:0:0: [sda] tag#0 device offline or changed (...) [ 78.384929] FAT-fs (sda1): Filesystem has been set read-only [ 103.070973] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 103.076751] print_req_error: 118 callbacks suppressed [ 103.076760] blk_update_request: I/O error, dev sda, sector 9748 op 0x1:(WRITE) flags 0x10 phys_seg 1 prio class 0 [ 103.087428] Buffer I/O error on dev sda1, logical block 1556, lost async page write [ 103.095309] sd 0:0:0:0: [sda] tag#0 device offline or changed [ 103.101123] blk_update_request: I/O error, dev sda, sector 17162 op 0x1:(WRITE) flags 0x10 phys_seg 1 prio class 0 [ 103.111883] Buffer I/O error on dev sda1, logical block 8970, lost async page write I can't tell why you're getting that error. In one of my tests the device returned the same kind of error status (Sense Key = 6, ASC = 0x28) but the operation was then retried successfully. Perhaps the problem lies in the device you are testing. As we need to have that working at some point, I might look into it, but someone who has experience in the block layer can surely do it more efficiently. I suspect that any problems you still face are caused by something else. I then formatted sda1 to ext2 (on the runtime suspend system testing your patch) and that seems to have worked! Again accessing the mountpoint then yield the very same "device offline or changed" errors. What kind of device are you testing? You should be easily able to reproduce this using an "sd" device. I tested two devices: a SanDisk Cruzer USB flash drive and a g-mass-storage gadget running under dummy-hcd. They each showed up as /dev/sdb on my system. I haven't tried testing with an SD card. If you have any specific sequence of commands you would like me to run, let me know. The problems must lie in the different other drivers we use I guess. Or the devices. Have you tried testing with a USB flash drive? Alan Stern
Re: [RFC][PATCHES] drivers/scsi/sg.c uaccess cleanups/fixes
On 2019-10-17 9:36 p.m., Al Viro wrote: On Wed, Oct 16, 2019 at 09:25:40PM +0100, Al Viro wrote: FWIW, callers of __copy_from_user() remaining in the generic code: 6) drivers/scsi/sg.c nest: sg_read() ones are memdup_user() in disguise (i.e. fold with immediately preceding kmalloc()s). sg_new_write() - fold with access_ok() into copy_from_user() (for both call sites). sg_write() - lose access_ok(), use copy_from_user() (both call sites) and get_user() (instead of the solitary __get_user() there). Turns out that there'd been outright redundant access_ok() calls (not even warranted by __copy_...) *and* several __put_user()/__get_user() with no checking of return value (access_ok() was there, handling of unmapped addresses wasn't). The latter go back at least to 2.1.early... I've got a series that presumably fixes and cleans the things up in that area; it didn't get any serious testing (the kernel builds and boots, smartctl works as well as it used to, but that's not worth much - all it says is that SG_IO doesn't fail terribly; I don't have any test setup for really working with /dev/sg*). IOW, it needs more review and testing - this is _not_ a pull request. It's in vfs.git#work.sg; individual patches are in followups. Shortlog/diffstat: Al Viro (8): sg_ioctl(): fix copyout handling sg_new_write(): replace access_ok() + __copy_from_user() with copy_from_user() sg_write(): __get_user() can fail... sg_read(): simplify reading ->pack_id of userland sg_io_hdr_t sg_new_write(): don't bother with access_ok sg_read(): get rid of access_ok()/__copy_..._user() sg_write(): get rid of access_ok()/__copy_from_user()/__get_user() SG_IO: get rid of access_ok() drivers/scsi/sg.c | 98 1 file changed, 32 insertions(+), 66 deletions(-) Al, I am aware of these and have a 23 part patchset on the linux-scsi list for review (see https://marc.info/?l=linux-scsi=157052102631490=2 ) that amongst other things fixes all of these. It also re-adds the functionality removed from the bsg driver last year. Unfortunately that review process is going very slowly, so I have no objections if you apply these now. It is unlikely that these changes will introduce any bugs (they didn't in my testing). If you want to do more testing you may find the sg3_utils package helpful, especially in the testing directory: https://github.com/hreinecke/sg3_utils Doug Gilbert
Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default
On 2019-07-08 2:01 a.m., Hannes Reinecke wrote: On 7/5/19 7:53 PM, Douglas Gilbert wrote: On 2019-07-05 3:22 a.m., Hannes Reinecke wrote: [ .. ] As mentioned, rescan-scsi-bus.sh is keeping references to /proc/scsi as a fall back only, as it's meant to work kernel independent. Per default it'll be using /sys, and will happily work without /proc/scsi. So it's really only /proc/scsi/sg which carries some meaningful information; maybe we should move/copy it to somewhere else. I personally like getting rid of /proc/scsi. /proc/scsi/device_info doesn't seem to be in sysfs. Could the contents of /proc/scsi/sg/* be placed in /sys/class/scsi_generic/* ? Currently that directory only has symlinks to the sg devices. The sg parameters are already available in /sys/module/sg/parameters; so from that perspective I feel we're good. # ls /sys/module/sg/parameters/ allow_dio def_reserved_size scatter_elem_sz # ls /proc/scsi/sg/ allow_dio debug def_reserved_size device_hdr devices device_strs red_debug version So that doesn't work, what are in 'parameters' are passed in at module/driver initialization. Back to my original question: Could the contents of /proc/scsi/sg/* be placed in /sys/class/scsi_generic/* ? Problem is /proc/scsi/device_info, for which we currently don't have any other location to store it at. Hmm. Doug Gilbert
Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default
On 2019-07-05 3:22 a.m., Hannes Reinecke wrote: On 6/18/19 7:43 PM, Elliott, Robert (Servers) wrote: -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Bart Van Assche Sent: Monday, June 17, 2019 10:28 PM To: dgilb...@interlog.com; Marc Gonzalez ; James Bottomley ; Martin Petersen Cc: SCSI ; LKML ; Christoph Hellwig Subject: Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default On 6/17/19 5:35 PM, Douglas Gilbert wrote: For sg3_utils: $ find . -name '*.c' -exec grep "/proc/scsi" {} \; -print static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_read.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgp_dd.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgm_dd.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_dd.c "'echo 1 > /proc/scsi/sg/allow_dio'\n", q_len, dirio_count); ./testing/sg_tst_bidi.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./examples/sgq_dd.c That is 6 (not 38) by my count. Hi Doug, This is the command I ran: $ git grep /proc/scsi | wc -l 38 I think your query excludes scripts/rescan-scsi-bus.sh. Bart. Here's the full list to ensure the discussion doesn't overlook anything: sg3_utils-1.44$ grep -R /proc/scsi . ./src/sg_read.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgp_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgm_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./scripts/rescan-scsi-bus.sh:# Return hosts. /proc/scsi/HOSTADAPTER/? must exist ./scripts/rescan-scsi-bus.sh: for driverdir in /proc/scsi/*; do ./scripts/rescan-scsi-bus.sh:driver=${driverdir#/proc/scsi/} ./scripts/rescan-scsi-bus.sh: name=${hostdir#/proc/scsi/*/} ./scripts/rescan-scsi-bus.sh:# Get /proc/scsi/scsi info for device $host:$channel:$id:$lun ./scripts/rescan-scsi-bus.sh:SCSISTR=$(grep -A "$LN" -e "$grepstr" /proc/scsi/scsi) ./scripts/rescan-scsi-bus.sh:DRV=`grep 'Attached drivers:' /proc/scsi/scsi 2>/dev/null` ./scripts/rescan-scsi-bus.sh: echo "scsi report-devs 1" >/proc/scsi/scsi ./scripts/rescan-scsi-bus.sh: DRV=`grep 'Attached drivers:' /proc/scsi/scsi 2>/dev/null` ./scripts/rescan-scsi-bus.sh: echo "scsi report-devs 0" >/proc/scsi/scsi ./scripts/rescan-scsi-bus.sh:# Outputs description from /proc/scsi/scsi (unless arg passed) ./scripts/rescan-scsi-bus.sh:echo "scsi remove-single-device $devnr" > /proc/scsi/scsi ./scripts/rescan-scsi-bus.sh: echo "scsi add-single-device $devnr" > /proc/scsi/scsi ./scripts/rescan-scsi-bus.sh: echo "scsi add-single-device $devnr" > /proc/scsi/scsi ./scripts/rescan-scsi-bus.sh: echo "scsi add-single-device $devnr" > /proc/scsi/scsi ./scripts/rescan-scsi-bus.sh: echo "scsi add-single-device $host $channel $id $SCAN_WILD_CARD" > /proc/scsi/scsi ./scripts/rescan-scsi-bus.sh:if test ! -d /sys/class/scsi_host/ -a ! -d /proc/scsi/; then ./ChangeLog:/proc/scsi/sg/allow_dio is '0' ./ChangeLog: - change sg_debug to call system("cat /proc/scsi/sg/debug"); ./suse/sg3_utils.changes: * Support systems without /proc/scsi ./examples/sgq_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./doc/sg_read.8:If direct IO is selected and /proc/scsi/sg/allow_dio ./doc/sg_read.8:"echo 1 > /proc/scsi/sg/allow_dio". An alternate way to avoid the ./doc/sg_map.8:observing the output of the command: "cat /proc/scsi/scsi". ./doc/sgp_dd.8:at completion. If direct IO is selected and /proc/scsi/sg/allow_dio ./doc/sgp_dd.8:this at completion. If direct IO is selected and /proc/scsi/sg/allow_dio ./doc/sgp_dd.8:mapping to SCSI block devices should be checked with 'cat /proc/scsi/scsi' ./doc/sg_dd.8:notes this at completion. If direct IO is selected and /proc/scsi/sg/allow_dio ./doc/sg_dd.8:this at completion. If direct IO is selected and /proc/scsi/sg/allow_dio ./doc/sg_dd.8:with 'echo 1 > /proc/scsi/sg/allow_dio'. ./doc/sg_dd.8:mapping to SCSI block devices should be checked with 'cat /proc/scsi/scsi', As mentioned, rescan-scsi-bus.sh is keeping references to /proc/scsi as a fall back only, as it's meant to work kernel independent. Per default it'll be using /sys, and will happily work without /proc/scsi. So it's really only /proc/scsi/sg which carries some meaningful information; maybe we should move/copy it to somewhere else. I personally like getting rid of /proc/scsi. /proc/scsi/device_info doesn't seem to be in sysfs. Could the contents of /proc/scsi/sg/* be placed in /sys/class/scsi_generic/* ? Currently that directory only has symlinks to the sg devices. Doug Gilbert
Re: [PATCH 0/2] scsi: add support for request batching
On 2019-06-26 9:51 a.m., Paolo Bonzini wrote: On 30/05/19 13:28, Paolo Bonzini wrote: This allows a list of requests to be issued, with the LLD only writing the hardware doorbell when necessary, after the last request was prepared. This is more efficient if we have lists of requests to issue, particularly on virtualized hardware, where writing the doorbell is more expensive than on real hardware. This applies to any HBA, either singlequeue or multiqueue; the second patch implements it for virtio-scsi. Paolo Paolo Bonzini (2): scsi_host: add support for request batching virtio_scsi: implement request batching drivers/scsi/scsi_lib.c| 37 ++--- drivers/scsi/virtio_scsi.c | 55 +++--- include/scsi/scsi_cmnd.h | 1 + include/scsi/scsi_host.h | 16 +-- 4 files changed, 89 insertions(+), 20 deletions(-) Ping? Are there any more objections? I have no objections, just a few questions. To implement this is the scsi_debug driver, a per device queue would need to be added, correct? Then a 'commit_rqs' call would be expected at some later point and it would drain that queue and submit each command. Or is the queue draining ongoing in the LLD and 'commit_rqs' means: don't return until that queue is empty? So does that mean in the normal (i.e. non request batching) case there are two calls to the LLD for each submitted command? Or is 'commit_rqs' optional, a sync-ing type command? Doug Gilbert
Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default
On 2019-06-19 5:42 a.m., Marc Gonzalez wrote: On 18/06/2019 17:31, Douglas Gilbert wrote: On 2019-06-18 3:29 a.m., Marc Gonzalez wrote: Please note that I am _in no way_ suggesting that we remove any code. I just think it might be time to stop forcing CONFIG_SCSI_PROC_FS into every config, and instead require one to explicitly request the aging feature (which makes CONFIG_SCSI_PROC_FS show up in a defconfig). Maybe we could add CONFIG_SCSI_PROC_FS to arch/x86/configs/foo ? (For which foo? In a separate patch or squashed with this one?) Since current sg driver usage seems to depend more on SCSI_PROC_FS being "y" than other parts of the SCSI subsystem then if SCSI_PROC_FS is to default to "n" in the future then a new CONFIG_SG_PROC_FS variable could be added. If CONFIG_CHR_DEV_SG is "*" or "m" then default CONFIG_SG_PROC_FS to "y"; if CONFIG_SCSI_PROC_FS is "y" then default CONFIG_SG_PROC_FS to "y"; else default CONFIG_SG_PROC_FS to "n". Obviously the sg driver would need to be changed to use CONFIG_SG_PROC_FS instead of CONFIG_SCSI_PROC_FS . I like your idea, and I think it might even be made slightly simpler. I assume sg3_utils requires CHR_DEV_SG. Is it the case? If so, we would just need to enable SCSI_PROC_FS when CHR_DEV_SG is enabled. diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index 73bce9b6d037..642ca0e7d363 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -54,14 +54,12 @@ config SCSI_NETLINK config SCSI_PROC_FS bool "legacy /proc/scsi/ support" depends on SCSI && PROC_FS - default y + default CHR_DEV_SG ---help--- This option enables support for the various files in /proc/scsi. In Linux 2.6 this has been superseded by files in sysfs but many legacy applications rely on this. - If unsure say Y. - comment "SCSI support type (disk, tape, CD-ROM)" depends on SCSI Would that work for you? I checked that SCSI_PROC_FS=y whether CHR_DEV_SG=y or m I can spin a v2, with a blurb about how sg3_utils relies on SCSI_PROC_FS. Yes, but (see below) ... Does that defeat the whole purpose of your proposal or could it be seen as a partial step in that direction? What is the motivation for this proposal? The rationale was just to look for "special-purpose" options that are enabled by default, and change the default wherever possible, as a matter of uniformity. BTW We still have the non-sg related 'cat /proc/scsi/scsi' usage and 'cat /proc/scsi/device_info'. And I believe the latter one is writable even though its permissions say otherwise. Any relation between SG and BSG? Only in the sense that writing to /proc/scsi/device_info changes the way the SCSI mid-level handles the identified device. So that is in common with, and hence the same relation as, sd, sr, st, ses, etc have with the identified device (e.g. a specialized USB dongle). Example of use of /proc/scsi/scsi $ cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: LinuxModel: scsi_debug Rev: 0188 Type: Direct-AccessANSI SCSI revision: 07 Host: scsi0 Channel: 00 Id: 00 Lun: 01 Vendor: LinuxModel: scsi_debug Rev: 0188 Type: Direct-AccessANSI SCSI revision: 07 Host: scsi0 Channel: 00 Id: 00 Lun: 02 Vendor: LinuxModel: scsi_debug Rev: 0188 Type: Direct-AccessANSI SCSI revision: 07 Which can be replaced by: $ lsscsi [0:0:0:0]diskLinuxscsi_debug 0188 /dev/sda [0:0:0:1]diskLinuxscsi_debug 0188 /dev/sdb [0:0:0:2]diskLinuxscsi_debug 0188 /dev/sdc [N:0:1:1]diskINTEL SSDPEKKF256G7L__1/dev/nvme0n1 Or if one really likes the "classic" look: $ lsscsi -c Attached devices: Host: scsi0 Channel: 00 Target: 00 Lun: 00 Vendor: LinuxModel: scsi_debug Rev: 0188 Type: Direct-AccessANSI SCSI revision: 07 Host: scsi0 Channel: 00 Target: 00 Lun: 01 Vendor: LinuxModel: scsi_debug Rev: 0188 Type: Direct-AccessANSI SCSI revision: 07 Host: scsi0 Channel: 00 Target: 00 Lun: 02 Vendor: LinuxModel: scsi_debug Rev: 0188 Type: Direct-AccessANSI SCSI revision: 07 Now looking at /proc/scsi/device_info IMO unless there is a replacement for /proc/scsi/device_info then your patch should not go ahead . If it does, any reasonable distro should override it. $ cat /proc/scsi/device_info 'Aashima' 'IMAGERY 2400SP' 0x1 'CHINON' 'CD-ROM CDS-431' 0x1 'CHINON' 'CD-ROM CDS-535' 0x1 'DENON' 'DRD-25X' 0x1 ... 'XYRATEX' 'RS' 0x240 'Zzyzx' 'RocketStor 500S' 0x40 'Zzyzx' 'RocketStor 2000' 0x40 That is a black (or quirks) list that can be added to by writing an entry to /proc/scsi/device_info . So if
Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default
On 2019-06-18 3:29 a.m., Marc Gonzalez wrote: On 18/06/2019 03:08, Finn Thain wrote: On Mon, 17 Jun 2019, Douglas Gilbert wrote: On 2019-06-17 5:11 p.m., Bart Van Assche wrote: On 6/12/19 6:59 AM, Marc Gonzalez wrote: According to the option's help message, SCSI_PROC_FS has been superseded for ~15 years. Don't select it by default anymore. Signed-off-by: Marc Gonzalez --- drivers/scsi/Kconfig | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index 73bce9b6d037..8c95e9ad6470 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -54,14 +54,11 @@ config SCSI_NETLINK config SCSI_PROC_FS bool "legacy /proc/scsi/ support" depends on SCSI && PROC_FS -default y ---help--- This option enables support for the various files in /proc/scsi. In Linux 2.6 this has been superseded by files in sysfs but many legacy applications rely on this. - If unsure say Y. - comment "SCSI support type (disk, tape, CD-ROM)" depends on SCSI Hi Doug, If I run grep "/proc/scsi" over the sg3_utils source code then grep reports 38 matches for that string. Does sg3_utils break with SCSI_PROC_FS=n? First, the sg driver. If placing #undef CONFIG_SCSI_PROC_FS prior to the includes in sg.c is a valid way to test that then the answer is no. Ah, but you are talking about sg3_utils . Or are you? For sg3_utils: $ find . -name '*.c' -exec grep "/proc/scsi" {} \; -print static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_read.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgp_dd.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgm_dd.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_dd.c "'echo 1 > /proc/scsi/sg/allow_dio'\n", q_len, dirio_count); ./testing/sg_tst_bidi.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./examples/sgq_dd.c That is 6 (not 38) by my count. Those 6 are all for direct IO (see below) which is off by default. I suspect old scanning utilities like sg_scan and sg_map might also use /proc/scsi/* . That is one reason why I wrote lsscsi. However I can't force folks to use lsscsi. As a related example, I still get bug reports for sginfo which I inherited from Eric Youngdale. If I was asked to debug a problem with the sg driver in a system without CONFIG_SCSI_PROC_FS defined, I would decline. The absence of /proc/scsi/sg/debug would be my issue. Can this be set up to do the same thing: cat /sys/class/scsi_generic/debug Is that breaking any sysfs rules? Also folks who rely on this to work: cat /proc/scsi/sg/devices 0 0 0 0 0 1 255 0 1 0 0 0 1 0 1 255 0 1 0 0 0 2 0 1 255 0 1 would be disappointed. Further I note that setting allow_dio via /proc/scsi/sg/allow_dio can also be done via /sys/module/sg/allow_dio . So that would be an interface breakage, but with an alternative. You can grep for /proc/scsi/ across all Debian packages: https://codesearch.debian.net/ This reveals that /proc/scsi/sg/ appears in smartmontools and other packages, for example. Hello everyone, Please note that I am _in no way_ suggesting that we remove any code. I just think it might be time to stop forcing CONFIG_SCSI_PROC_FS into every config, and instead require one to explicitly request the aging feature (which makes CONFIG_SCSI_PROC_FS show up in a defconfig). Maybe we could add CONFIG_SCSI_PROC_FS to arch/x86/configs/foo ? (For which foo? In a separate patch or squashed with this one?) Marc, Since current sg driver usage seems to depend more on SCSI_PROC_FS being "y" than other parts of the SCSI subsystem then if SCSI_PROC_FS is to default to "n" in the future then a new CONFIG_SG_PROC_FS variable could be added. If CONFIG_CHR_DEV_SG is "*" or "m" then default CONFIG_SG_PROC_FS to "y"; if CONFIG_SCSI_PROC_FS is "y" then default CONFIG_SG_PROC_FS to "y"; else default CONFIG_SG_PROC_FS to "n". Obviously the sg driver would need to be changed to use CONFIG_SG_PROC_FS instead of CONFIG_SCSI_PROC_FS . Does that defeat the whole purpose of your proposal or could it be seen as a partial step in that direction? What is the motivation for this proposal? Doug Gilbert BTW We still have the non-sg related 'cat /proc/scsi/scsi' usage and 'cat /proc/scsi/device_info'. And I believe the latter one is writable even though its permissions say otherwise.
Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default
On 2019-06-17 5:11 p.m., Bart Van Assche wrote: On 6/12/19 6:59 AM, Marc Gonzalez wrote: According to the option's help message, SCSI_PROC_FS has been superseded for ~15 years. Don't select it by default anymore. Signed-off-by: Marc Gonzalez --- drivers/scsi/Kconfig | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index 73bce9b6d037..8c95e9ad6470 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -54,14 +54,11 @@ config SCSI_NETLINK config SCSI_PROC_FS bool "legacy /proc/scsi/ support" depends on SCSI && PROC_FS - default y ---help--- This option enables support for the various files in /proc/scsi. In Linux 2.6 this has been superseded by files in sysfs but many legacy applications rely on this. - If unsure say Y. - comment "SCSI support type (disk, tape, CD-ROM)" depends on SCSI Hi Doug, If I run grep "/proc/scsi" over the sg3_utils source code then grep reports 38 matches for that string. Does sg3_utils break with SCSI_PROC_FS=n? First, the sg driver. If placing #undef CONFIG_SCSI_PROC_FS prior to the includes in sg.c is a valid way to test that then the answer is no. Ah, but you are talking about sg3_utils . Or are you? For sg3_utils: $ find . -name '*.c' -exec grep "/proc/scsi" {} \; -print static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_read.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgp_dd.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sgm_dd.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./src/sg_dd.c "'echo 1 > /proc/scsi/sg/allow_dio'\n", q_len, dirio_count); ./testing/sg_tst_bidi.c static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; ./examples/sgq_dd.c That is 6 (not 38) by my count. Those 6 are all for direct IO (see below) which is off by default. I suspect old scanning utilities like sg_scan and sg_map might also use /proc/scsi/* . That is one reason why I wrote lsscsi. However I can't force folks to use lsscsi. As a related example, I still get bug reports for sginfo which I inherited from Eric Youngdale. If I was asked to debug a problem with the sg driver in a system without CONFIG_SCSI_PROC_FS defined, I would decline. The absence of /proc/scsi/sg/debug would be my issue. Can this be set up to do the same thing: cat /sys/class/scsi_generic/debug ? Is that breaking any sysfs rules? Also folks who rely on this to work: cat /proc/scsi/sg/devices 0 0 0 0 0 1 255 0 1 0 0 0 1 0 1 255 0 1 0 0 0 2 0 1 255 0 1 would be disappointed. Further I note that setting allow_dio via /proc/scsi/sg/allow_dio can also be done via /sys/module/sg/allow_dio . So that would be an interface breakage, but with an alternative. Doug Gilbert
Re: [PATCH] sg: Fix a double-fetch bug in drivers/scsi/sg.c
On 2019-06-05 2:00 a.m., Jiri Slaby wrote: On 23. 05. 19, 4:38, Gen Zhang wrote: In sg_write(), the opcode of the command is fetched the first time from the userspace by __get_user(). Then the whole command, the opcode included, is fetched again from userspace by __copy_from_user(). However, a malicious user can change the opcode between the two fetches. This can cause inconsistent data and potential errors as cmnd is used in the following codes. Thus we should check opcode between the two fetches to prevent this. Signed-off-by: Gen Zhang --- diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index d3f1531..a2971b8 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -694,6 +694,8 @@ sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) hp->flags = input_size; /* structure abuse ... */ hp->pack_id = old_hdr.pack_id; hp->usr_ptr = NULL; + if (opcode != cmnd[0]) + return -EINVAL; Isn't it too early to check cmnd which is copied only here: if (__copy_from_user(cmnd, buf, cmd_size)) return -EFAULT; /* --- Hi, Yes, it is too early. It needs to be after that __copy_from_user(cmnd, buf, cmd_size) call. To put this in context, this is a very old interface; dating from 1992 and deprecated for almost 20 years. The fact that the first byte of the SCSI cdb needs to be read first to work out that size of the following SCSI command and optionally the offset of a data-out buffer that may follow the command; is one reason why that interface was replaced. Also the implementation did not handle SCSI variable length cdb_s. Then there is the question of whether this double-fetch is exploitable? I cannot think of an example, but there might be (e.g. turning a READ command into a WRITE). But the "double-fetch" issue may be more wide spread. The replacement interface passes the command and data-in/-out as pointers while their corresponding lengths are placed in the newer interface structure. This assumes that the cdb and data-out won't change in the user space between when the write(2) is called and before or while the driver, using those pointers, reads the data. All drivers that use pointers to pass data have this "feature". Also I'm looking at this particular double-fetch from the point of view of the driver rewrite I have done and is currently in the early stages of review [linux-scsi list: "[PATCH 00/19] sg: v4 interface, rq sharing + multiple rqs"] and this problem is more difficult to fix since the full cdb read is delayed to a common point further along the submit processing path. To detect a change in cbd[0] my current code would need to be altered to carry cdb[0] through to that common point. So is it worth it for such an old, deprecated and replaced interface?? What cdb/user_permissions checking that is done, is done _after_ the full cdb is read. So trying to get around a user exclusion of say WRITE(10) by first using the first byte of READ(10), won't succeed. Doug Gilbert
Re: [PATCH] scsi: ses: Fix out-of-bounds memory access in ses_enclosure_data_process()
On 2019-05-20 12:05 p.m., Martin K. Petersen wrote: James, Please. What I'm interested in is whether this is simply a bug in the array firmware, in which case the fix is sufficient, or whether there's some problem with the parser, like mismatched expectations over added trailing nulls or something. Our support folks have been looking at this for a while. We have seen problems with devices from several vendors. To the extent that I gave up the idea of blacklisting all of them. I am collecting "bad" SES pages from these devices. I have added support for RECEIVE DIAGNOSTICS to scsi_debug and added a bunch of deliberately broken SES pages so we could debug this Patches ?? It appears to be very common for devices to return inconsistent or invalid data. So pretty much all of the ses.c parsing needs to have sanity checking heuristics added to prevent KASAN hiccups. And it is not just SES device implementations that were broken. The relationship between Additional Element Status diagnostic page (dpage) and the Enclosure Status dpage was under-specified in SES-2 and that led to the EIIOE field being introduced during the SES-3 revisions. And the meaning of EIIOE was tweaked several times *** before SES-3 was standardized. Anyone interested in the adventures of EIIOE can see the code of sg_ses.c in sg3_utils. The sg_ses utility is many times more complex than anything else in the sg3_utils package. And that complexity led me to suspect that the Linux SES driver was broken. It should be 3 or 4 times larger than it is! It simply doesn't do enough checking. So yes Martin, you are on the right track. Doug Gilbert BTW the NVME Management Interface folks have decided to use SES-3 for NVME enclosure management rather than invent their own can of worms :-) *** For example EIIOE started life as a 1 bit field, but two cases wasn't enough, so it became a 2 bit field and now uses all four possibilities.
Re: [PATCH 21/24] sg: switch to SPDX tags
On 2019-05-01 6:14 p.m., Christoph Hellwig wrote: Use the the GPLv2+ SPDX tag instead of verbose boilerplate text. IOWs replace 3.5 lines with 1. Signed-off-by: Christoph Hellwig Acked-by: Douglas Gilbert --- drivers/scsi/sg.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index d3f15319b9b3..bcdc28e5ede7 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0+ /* * History: * Started: Aug 9 by Lawrence Foard (entr...@world.std.com), @@ -8,12 +9,6 @@ *Copyright (C) 1992 Lawrence Foard * Version 2 and 3 extensions to driver: *Copyright (C) 1998 - 2014 Douglas Gilbert - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2, or (at your option) - * any later version. - * */ static int sg_version_num = 30536; /* 2 digits for each component */
Re: Recent removal of bsg read/write support
Updated reply, see below. On 2018-09-03 4:34 a.m., Dror Levin wrote: On Sun, Sep 2, 2018 at 8:55 PM Linus Torvalds wrote: On Sun, Sep 2, 2018 at 4:44 AM Richard Weinberger wrote: CC'ing relevant people. Otherwise your mail might get lost. Indeed. Sorry for that. On Sun, Sep 2, 2018 at 1:37 PM Dror Levin wrote: We have an internal tool that uses the bsg read/write interface to issue SCSI commands as part of a test suite for a storage device. After recently reading on LWN that this interface is to be removed we tried porting our code to use sg instead. However, that raises new issues - mainly getting ENOMEM over iSCSI for unknown reasons. Is there any chance that you can make more data available? Sure, I can try. We use writev() to send up to SG_MAX_QUEUE tasks at a time. Occasionally not all tasks are written at which point we wait for tasks to return before sending more, but then writev() fails with ENOMEM and we see this in the syslog: Sep 1 20:58:14 gdc-qa-io-017 kernel: sd 441:0:0:5: [sg73] sg_common_write: start_req err=-12 Failing tasks are reads of 128KiB. This is the block layer running out of resources. The sg driver is a relatively thin shim and when it gets a "no can do" from the layers below it, the driver has little option than to return said errno. I'd rather fix the sg interface (which while also broken garbage, we can't get rid of) than re-surrect the bsg interface. That said, the removed bsg code looks a hell of a lot prettier than the nasty sg interface code does, although it also lacks ansolutely _any_ kind of security checking. For us the bsg interface also has several advantages over sg: 1. The device name is its HCTL which is nicer than an arbitrary integer. Not much the sg driver can do about that. The minor number the sg driver uses and HCT are all arbitrary integers (with the L coming from the storage device), but I agree the HCTL is more widely used. The ioctl(, SG_GET_SCSI_ID) fills a structure which includes HCTL. In my sg v4 driver rewrite the L (LUN) has been tweaked to additionally send back the 8 byte T10 LUN representation. The lsscsi utility will show the relationship between HCTL and sg driver device name with 'lsscsi -g'. It uses sysfs datamining. 2. write() supports writing more than one sg_io_v4 struct so we don't have to resort to writev(). In my sg v4 rewrite the sg_io_v4 interface can only be sent through ioctl(SG_IO) [for sync usage] and ioctl(SG_IOSUBMIT) [for async usage]. So it can't be sent through write(2). SG_IOSUBMIT is new and uses the _IOWR macro which encodes the expected length into the SG_IOSUBMIT value and that is the size of sg_io_v4. So you can't send an arbitrary number of sg_io_v4 objects through that ioctl directly. If need be, that can be cured with another level of indirection (e.g. with a new flag the data-out can be interpreted as an array sg_io_v4 objects). 3. Queue size is the device's queue depth and not SG_MAX_QUEUE which is 16. That limit is gone in the sg v4 driver rewrite. Because of this we would like to continue using the bsg interface, even if some changes are required to meet security concerns. I wonder if we could at least try to unify the bsg/sg code - possibly by making sg use the prettier bsg code (but definitely have to add all the security measures). And dammit, the SCSI people need to get their heads out of their arses. This whole "stream random commands over read/write" needs to go the f*ck away. Could we perhaps extend the SG_IO interace to have an async mode? Instead of "read/write", have "SG_IOSUBMIT" and "SG_IORECEIVE" and have the SG_IO ioctl just be a shorthand of "both". Done. Just my two cents - having an interface other than read/write won't allow users to treat this fd as a regular file with epoll() and read(). This is a major bonus for this interface - an sg/bsg device can be used just like a socket or pipe in any reactor (we use boost asio for example). Well poll() certainly works (see sg3_utils beta rev 809 testing/sgs_dd.c and testing/sgh_dd.c) and I can't see why epoll() won't work. These calls work against the file descriptor and the sg driver keeps the same context around sg device file descriptors as it has always done. [And that is the major design flaw in the bsg driver: it doesn't keep proper file descriptor context.] It is the security folks who don't like the sg inspired (there in lk 1.0.0 from 1992) write(2)/read(2) asynchronous interface. Also, ideally we need two streams: one for metadata (e.g. commands and responses (status and sense data)) and another for user data. Protection information could be a third stream, between the other two. Jamming that all into one stream is a bit ugly. References: sg v3 driver rewrite, description and downloads: http://sg.danny.cz/sg/sg_v40.html sg3_utils version 1.45 beta, rev 809, link at the top of this page: http://sg.danny.cz/sg Doug Gilbert
Re: [ANNOUNCE] v4 sg driver: ready for testing
There is an update to the SCSI Generic (sg) v4 driver adding synchronous and asynchronous bidi command support. Plus lots of fixes and some minor improvements. See: http://sg.danny.cz/sg/sg_v40.html The kernel code is split in two in the tarball below, one targeting lk 5.0 and the other targeting lk 4.20 and earlier ***. Each section contains the 3 files that represent the sg v4 driver plus a meandering 17 part patchset. Those patchsets reflect the driver's rewrite rather than a logical progression. http://sg.danny.cz/sg/p/sgv4_20190116.tgz Plus there are updated testing utilities in sg3_utils-1.45 (beta, revision 807) at the top of this page: http://sg.danny.cz/sg/index.html Doug Gilbert *** the reason for the split is the tree wide change to the access_ok() function. On 2018-12-25 2:39 a.m., Douglas Gilbert wrote: There is an update to the sg v4 driver with some error fixes, SIGIO and RT signals work plus single READ, multiple WRITE sharing support. See: http://sg.danny.cz/sg/sg_v40.html with testing utilities in sg3_utils-1.45 (beta, revision 802) on the main page: http://sg.danny.cz/sg/index.html Doug Gilbert On 2018-12-18 6:41 p.m., Douglas Gilbert wrote: After an underwhelming response to my intermediate level patchsets to modernize the sg driver in October this year (see "[PATCH 0/8] sg: major cleanup, remove max_queue limit" followed by v2 and v3 between 20181019 and 20181028), I decided to move ahead and add the functionality proposed for the version 4 sg driver. That means accepting interface objects of type 'struct sg_io_v4' (as found in include/uapi/linux/bsg) plus two new ioctls: SG_IOSUBMIT and SG_IORECEIVE as proposed by Linus Torvalds to replace the unloved write(2)/read(2) asynchronous interface . There is a new feature called "sharing" explained in the web page (see below). Yes, there is a patchset available (14 part and growing) but even without explanatory comments at the top of each patch, that patchset is 4 times larger than the v4 sg driver (i.e. the finished product) and over 6 times larger than the original v3 sg driver! Part of the reason for the patchset size is the multiple backtracks and rewrites associated with a real development process. The cleanest patchset would have 3 parts: 1) split the current include/scsi/sg.h into the end product headers: include/uapi/scsi/sg.h and include/scsi/sg.h 2) delete drivers/scsi/sg.c 3) add the v4 drivers/scsi/sg.c After part 2) you could build a kernel and I can guarantee that no-one will be able to find any sg driver bugs but some users might get upset (but not the Linux security folks). So there is a working v4 sg driver discussed here, with a download: http://sg.danny.cz/sg/sg_v40.html I will keep that page up to date while the driver is in this phase. There is a sg3_utils beta of 1.45 (revision 799) package in the News section at the top of the main page: http://sg.danny.cz/sg/index.html That sg3_utils beta package will use the v4 sg interface via sg devices if the v4 driver is detected. There are also three test utilities in the 'testing' directory designed to exercise the v4 extensions. The degree of backward compatibility with the v3 driver should be high but there are limits to backward compatibility. As an example, it is possible that there are user apps that depend on hitting the 16 outstanding command limit (per fd) in the v3 driver and go "wild" when v4 removes that ceiling. If so, a "high_v3_compat" driver option could be added to put that ceiling back. The only way to find out is for folks to try and if there is a failure, contact me, or send mail to this list. Code reviews welcome as well. Doug Gilbert I felt this was a better use of my time than trying to invent a new debug/trace mechanism for the whole SCSI subsystem. That is what _SCSI_ system maintainers are for, I'll stick to the sg driver (and scsi_debug). Add user space tools and there is more than enough work there ...
Re: [PATCH v2] rbtree: fix the red root
On 2019-01-14 12:58 p.m., Qian Cai wrote: Unfortunately, I could not trigger any of those here both in a bare-metal and virtual machines. All I triggered were hung tasks and soft-lockup due to fork bomb. The only other thing I can think of is to setup kdump to capture a vmcore when either GPF or BUG() happens, and then share the vmcore somewhere, so I might pork around to see where the memory corruption looks like. Another question that I forgot to ask, what type of device is /dev/sg0 ? On a prior occasion (KASAN, throw spaghetti ...) it was a SATA device and the problem was in libata. Doug Gilbert
Re: [PATCH] scsi: wd719x Replace GFP_KERNEL with GFP_ATOMIC in wd719x_chip_init
On 2019-01-14 10:29 a.m., Christoph Hellwig wrote: On Mon, Jan 14, 2019 at 11:24:49PM +0800, wangbo wrote: wd719x_host_reset get spinlock first then call wd719x_chip_init, so replace GFP_KERNEL with GFP_ATOMIC in wd719x_chip_init. Please move the allocation outside the lock instead. GFP_ATOMIC DMA allocations are generally a bad idea and should be avoided where we can. More importantly we should never actually trigger the allocation under the lock as far as fw_virt will always be set already in that case. So I think you can safely move the request firmware + allocation + memcpy from wd719x_chip_init to wd719x_board_found, but I'd rather have Ondrej review that plan. Further to this, the result of holding a lock (probably with _irqsave() tacked onto it) during a GFP_KERNEL is a message like this in the log: hrtimer: interrupt took 1084 ns It is not always easy to find since it is a "_once" message. The sg v3 driver (the one in production) produces these. I have been able to stamp them out by taking care in the sg v4 driver (in testing) around allocations. It also meant adding a new state in my state machine to fend off "bad things" happening to that object while it is unlocked. So there may be a cost to dropping the lock. Doug Gilbert
Re: [PATCH v2] rbtree: fix the red root
On 2019-01-13 10:59 p.m., Esme wrote: ‐‐‐ Original Message ‐‐‐ On Sunday, January 13, 2019 10:52 PM, Douglas Gilbert wrote: On 2019-01-13 10:07 p.m., Esme wrote: ‐‐‐ Original Message ‐‐‐ On Sunday, January 13, 2019 9:33 PM, Qian Cai c...@lca.pw wrote: On 1/13/19 9:20 PM, David Lechner wrote: On 1/11/19 8:58 PM, Michel Lespinasse wrote: On Fri, Jan 11, 2019 at 3:47 PM David Lechner da...@lechnology.com wrote: On 1/11/19 2:58 PM, Qian Cai wrote: A GPF was reported, kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN kasan_die_handler.cold.22+0x11/0x31 notifier_call_chain+0x17b/0x390 atomic_notifier_call_chain+0xa7/0x1b0 notify_die+0x1be/0x2e0 do_general_protection+0x13e/0x330 general_protection+0x1e/0x30 rb_insert_color+0x189/0x1480 create_object+0x785/0xca0 kmemleak_alloc+0x2f/0x50 kmem_cache_alloc+0x1b9/0x3c0 getname_flags+0xdb/0x5d0 getname+0x1e/0x20 do_sys_open+0x3a1/0x7d0 __x64_sys_open+0x7e/0xc0 do_syscall_64+0x1b3/0x820 entry_SYSCALL_64_after_hwframe+0x49/0xbe It turned out, gparent = rb_red_parent(parent); tmp = gparent->rb_right; <-- GPF was triggered here. Apparently, "gparent" is NULL which indicates "parent" is rbtree's root which is red. Otherwise, it will be treated properly a few lines above. /* * If there is a black parent, we are done. * Otherwise, take some corrective action as, * per 4), we don't want a red root or two * consecutive red nodes. */ if(rb_is_black(parent)) break; Hence, it violates the rule #1 (the root can't be red) and need a fix up, and also add a regression test for it. This looks like was introduced by 6d58452dc06 where it no longer always paint the root as black. Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only when necessary) Reported-by: Esme espl...@protonmail.ch Tested-by: Joey Pabalinas joeypabali...@gmail.com Signed-off-by: Qian Cai c...@lca.pw Tested-by: David Lechner da...@lechnology.com FWIW, this fixed the following crash for me: Unable to handle kernel NULL pointer dereference at virtual address 0004 Just to clarify, do you have a way to reproduce this crash without the fix ? I am starting to suspect that my crash was caused by some new code in the drm-misc-next tree that might be causing a memory corruption. It threw me off that the stack trace didn't contain anything related to drm. See: https://patchwork.freedesktop.org/patch/276719/ It may be useful for those who could reproduce this issue to turn on those memory corruption debug options to narrow down a bit. CONFIG_DEBUG_PAGEALLOC=y CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y CONFIG_KASAN=y CONFIG_KASAN_GENERIC=y CONFIG_SLUB_DEBUG_ON=y I have been on SLAB, I configured SLAB DEBUG with a fresh pull from github. Linux syzkaller 5.0.0-rc2 #9 SMP Sun Jan 13 21:57:40 EST 2019 x86_64 ... In an effort to get a different stack into the kernel, I felt that nothing works better than fork bomb? :) Let me know if that helps. root@syzkaller:~# gcc -o test3 test3.c root@syzkaller:~# while : ; do ./test3 & done And is test3 the same multi-threaded program that enters the kernel via /dev/sg0 and then calls SCSI_IOCTL_SEND_COMMAND which goes to the SCSI mid-level and thence to the block layer? And please remind me, does it also fail on lk 4.20.2 ? Doug Gilbert Yes, the same C repro from the earlier thread. It was a 4.20.0 kernel where it was first detected. I can move to 4.20.2 and see if that changes anything. Hi, I don't think there is any need to check lk 4.20.2 (as it would be very surprising if it didn't also have this "feature"). More interesting might be: has "test3" been run on lk 4.19 or any earlier kernel? Doug Gilbert
Re: [PATCH v2] rbtree: fix the red root
On 2019-01-13 10:07 p.m., Esme wrote: ‐‐‐ Original Message ‐‐‐ On Sunday, January 13, 2019 9:33 PM, Qian Cai wrote: On 1/13/19 9:20 PM, David Lechner wrote: On 1/11/19 8:58 PM, Michel Lespinasse wrote: On Fri, Jan 11, 2019 at 3:47 PM David Lechner da...@lechnology.com wrote: On 1/11/19 2:58 PM, Qian Cai wrote: A GPF was reported, kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN kasan_die_handler.cold.22+0x11/0x31 notifier_call_chain+0x17b/0x390 atomic_notifier_call_chain+0xa7/0x1b0 notify_die+0x1be/0x2e0 do_general_protection+0x13e/0x330 general_protection+0x1e/0x30 rb_insert_color+0x189/0x1480 create_object+0x785/0xca0 kmemleak_alloc+0x2f/0x50 kmem_cache_alloc+0x1b9/0x3c0 getname_flags+0xdb/0x5d0 getname+0x1e/0x20 do_sys_open+0x3a1/0x7d0 __x64_sys_open+0x7e/0xc0 do_syscall_64+0x1b3/0x820 entry_SYSCALL_64_after_hwframe+0x49/0xbe It turned out, gparent = rb_red_parent(parent); tmp = gparent->rb_right; <-- GPF was triggered here. Apparently, "gparent" is NULL which indicates "parent" is rbtree's root which is red. Otherwise, it will be treated properly a few lines above. /* * If there is a black parent, we are done. * Otherwise, take some corrective action as, * per 4), we don't want a red root or two * consecutive red nodes. */ if(rb_is_black(parent)) break; Hence, it violates the rule #1 (the root can't be red) and need a fix up, and also add a regression test for it. This looks like was introduced by 6d58452dc06 where it no longer always paint the root as black. Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only when necessary) Reported-by: Esme espl...@protonmail.ch Tested-by: Joey Pabalinas joeypabali...@gmail.com Signed-off-by: Qian Cai c...@lca.pw - Tested-by: David Lechner da...@lechnology.com FWIW, this fixed the following crash for me: Unable to handle kernel NULL pointer dereference at virtual address 0004 Just to clarify, do you have a way to reproduce this crash without the fix ? I am starting to suspect that my crash was caused by some new code in the drm-misc-next tree that might be causing a memory corruption. It threw me off that the stack trace didn't contain anything related to drm. See: https://patchwork.freedesktop.org/patch/276719/ It may be useful for those who could reproduce this issue to turn on those memory corruption debug options to narrow down a bit. CONFIG_DEBUG_PAGEALLOC=y CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y CONFIG_KASAN=y CONFIG_KASAN_GENERIC=y CONFIG_SLUB_DEBUG_ON=y I have been on SLAB, I configured SLAB DEBUG with a fresh pull from github. Linux syzkaller 5.0.0-rc2 #9 SMP Sun Jan 13 21:57:40 EST 2019 x86_64 ... In an effort to get a different stack into the kernel, I felt that nothing works better than fork bomb? :) Let me know if that helps. root@syzkaller:~# gcc -o test3 test3.c root@syzkaller:~# while : ; do ./test3 & done And is test3 the same multi-threaded program that enters the kernel via /dev/sg0 and then calls SCSI_IOCTL_SEND_COMMAND which goes to the SCSI mid-level and thence to the block layer? And please remind me, does it also fail on lk 4.20.2 ? Doug Gilbert
Re: [PATCH] scsi: associate bio write hint with WRITE CDB
On 2019-01-03 4:47 a.m., Randall Huang wrote: On Wed, Jan 02, 2019 at 11:51:33PM -0800, Christoph Hellwig wrote: On Wed, Dec 26, 2018 at 12:15:04PM +0800, Randall Huang wrote: In SPC-3, WRITE(10)/(16) support grouping function. Let's associate bio write hint with group number for enabling StreamID or Turbo Write feature. Signed-off-by: Randall Huang --- drivers/scsi/sd.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 4b49cb67617e..28bfa9ed2b54 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1201,7 +1201,12 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd *SCpnt) SCpnt->cmnd[11] = (unsigned char) (this_count >> 16) & 0xff; SCpnt->cmnd[12] = (unsigned char) (this_count >> 8) & 0xff; SCpnt->cmnd[13] = (unsigned char) this_count & 0xff; - SCpnt->cmnd[14] = SCpnt->cmnd[15] = 0; + if (rq_data_dir(rq) == WRITE) { + SCpnt->cmnd[14] = rq->bio->bi_write_hint & 0x3f; + } else { + SCpnt->cmnd[14] = 0; + } No need for braces here. Already send a new version But what I'm more worried about is devices not recognizing the feature throwing up on the field. Can you check what SBC version first references these or come up with some other decently smart conditional? My reference is SCSI Block Commands – 3 (SBC-3) Revision 25. Section 5.32 WRITE (10) and 5.34 WRITE (16) Maybe Martin has a good idea, too. That is the GROUP NUMBER field. Also found in READ(16) at the same location within its cdb. The proposed code deserves at least an explanatory comment. Since it is relatively recent, perhaps the above should only be done iff: - the REPORT SUPPORTED OPERATION CODES (RSOC) command is supported, and - in the RSOC entry for WRITE(16), the CDB USAGE DATA field (a bit mask) indicates the GROUP NUMBER field is supported That check can be done once, at disk attachment time where there is already code to fetch RSOC. Is there a bi_read_hint ? If not then the bi_write_hint should also be applied to READ(16). Makes that variable naming look pretty silly though. Doug Gilbert
Re: [PATCH] scsi: avoid a double-fetch and a redundant copy
On 2018-12-25 3:15 p.m., Kangjie Lu wrote: What we need is only "pack_id", so do not create a heap object or copy the whole object in. The fix efficiently copies "pack_id" only. Now this looks like a worthwhile optimization, in some pretty tricky code. I can't see a security angle in it. Did you test it? Well the code as presented doesn't compile and the management takes a dim view of that. Signed-off-by: Kangjie Lu --- drivers/scsi/sg.c | 12 ++-- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index c6ad00703c5b..4dacbfffd113 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -446,16 +446,8 @@ sg_read(struct file *filp, char __user *buf, size_t count, loff_t * ppos) } if (old_hdr->reply_len < 0) { if (count >= SZ_SG_IO_HDR) { - sg_io_hdr_t *new_hdr; - new_hdr = kmalloc(SZ_SG_IO_HDR, GFP_KERNEL); - if (!new_hdr) { - retval = -ENOMEM; - goto free_old_hdr; - } - retval =__copy_from_user - (new_hdr, buf, SZ_SG_IO_HDR); - req_pack_id = new_hdr->pack_id; - kfree(new_hdr); + retval = get_user(req_pack_id, + &((sg_io_hdr_t *)buf->pack_id)); The '->' binds higher then the cast and since buf is a 'char *' it doesn't have a member called pack_id . Hopefully your drive to remove redundancy went a little too far and removed the required (but missing) parentheses binding the cast to 'buf'. if (retval) { retval = -EFAULT; goto free_old_hdr; Good work, silly mistake, but its got me thinking, the heap allocation can be replaced by stack since its short. The code in this area is more tricky in the v4 driver because I want to specifically exclude the sg_io_v4 (aka v4) interface being sent through write(2)/read(2). The way to do that is to read the first 32 bit integer which should be 'S' or v3, 'Q' for v4. Hmm, just looking further along my mailer I see the kbuild test robot has picked up the error and you have presented another patch which also won't compile. Please stop doing that; apply your patch to kernel source and compile it _before_ sending it to this list. Doug Gilbert
Re: [PATCH] scsi: fix a double-fetch bug in sg_write
On 2018-12-25 3:24 p.m., Kangjie Lu wrote: "opcode" has been copied in from user space and checked. We should not copy it in again, which may have been modified by malicous multi-threading user programs through race conditions. The fix uses the opcode fetched in the first copy. Signed-off-by: Kangjie Lu Acked-by: Douglas Gilbert Also applied to my sg v4 driver code. The v1 and v2 interfaces (based on struct sg_header) did not provide a command length field. The sg driver needed to read the first byte of the command (the "opcode") to determine the full command's length prior to actually reading it in full. Hard to think of an example of an exploit based on this double read. --- drivers/scsi/sg.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 4dacbfffd113..41774e4f9508 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -686,7 +686,8 @@ sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) hp->flags = input_size; /* structure abuse ... */ hp->pack_id = old_hdr.pack_id; hp->usr_ptr = NULL; - if (__copy_from_user(cmnd, buf, cmd_size)) + cmnd[0] = opcode; + if (__copy_from_user(cmnd + 1, buf + 1, cmd_size - 1)) return -EFAULT; /* * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV,
Re: remove exofs, the T10 OSD code and block/scsi bidi support V3
On 2018-12-19 9:43 a.m., Christoph Hellwig wrote: On Mon, Nov 26, 2018 at 07:11:10PM +0200, Boaz Harrosh wrote: On 11/11/18 15:32, Christoph Hellwig wrote: The only real user of the T10 OSD protocol, the pNFS object layout driver never went to the point of having shipping products, and we removed it 1.5 years ago. Exofs is just a simple example without real life users. You have failed to say what is your motivation for this patchset? What is it you are trying to fix/improve. Drop basically unused support, which allows us to 1) reduce the size of every kernel with block layer support, and even more for every kernel with scsi support By proposing the removal of bidi support from the block layer, it isn't just the SCSI subsystem that will be impacted. Those NVMe documents that you referred me to earlier in the year, in the command tables in 1.3c and earlier you have noticed the 2 bit direction field and what 11b means? Even if there aren't any bidi NVMe commands *** yet, the fact that NVMe's 64 byte command format has provision for 4 (not 2) independent data transfers (data + meta, for each direction). Surely NVMe will sooner or later take advantage of those ... a command like READ GATHERED comes to mind. 2) reduce the size of the critical struct request structure by 128 bits, thus reducing the memory used by every blk-mq driver significantly, never mind the cache effects Hmm, one pointer (that is null in the non-bidi case) should be enough, that's 64 or 32 bits. 3) stop having the maintainance overhead for this code in the block layer, which has been rather painful at times You won't get any sympathy from me :-) The sg driver is trying to inject _SCSI_ commands into the SCSI mid-level for onward processing by SCSI LLDs. So WTF does it have to deal with the block layer. While on the subject of bidi, the order of transfers: is the data-out (to the target) always before the data-in or is it the target device that decides (depending on the semantics of the command) who is first? Doug Gilbert *** there could already be vendor specific bidi NVMe commands out there (ditto for SCSI)
Re: Recent removal of bsg read/write support
On 2018-09-03 10:34 AM, Dror Levin wrote: On Sun, Sep 2, 2018 at 8:55 PM Linus Torvalds wrote: On Sun, Sep 2, 2018 at 4:44 AM Richard Weinberger wrote: CC'ing relevant people. Otherwise your mail might get lost. Indeed. Sorry for that. On Sun, Sep 2, 2018 at 1:37 PM Dror Levin wrote: We have an internal tool that uses the bsg read/write interface to issue SCSI commands as part of a test suite for a storage device. After recently reading on LWN that this interface is to be removed we tried porting our code to use sg instead. However, that raises new issues - mainly getting ENOMEM over iSCSI for unknown reasons. Is there any chance that you can make more data available? Sure, I can try. We use writev() to send up to SG_MAX_QUEUE tasks at a time. Occasionally not all tasks are written at which point we wait for tasks to return before sending more, but then writev() fails with ENOMEM and we see this in the syslog: Sep 1 20:58:14 gdc-qa-io-017 kernel: sd 441:0:0:5: [sg73] sg_common_write: start_req err=-12 Failing tasks are reads of 128KiB. I'd rather fix the sg interface (which while also broken garbage, we can't get rid of) than re-surrect the bsg interface. That said, the removed bsg code looks a hell of a lot prettier than the nasty sg interface code does, although it also lacks ansolutely _any_ kind of security checking. For us the bsg interface also has several advantages over sg: 1. The device name is its HCTL which is nicer than an arbitrary integer. 2. write() supports writing more than one sg_io_v4 struct so we don't have to resort to writev(). 3. Queue size is the device's queue depth and not SG_MAX_QUEUE which is 16. Because of this we would like to continue using the bsg interface, even if some changes are required to meet security concerns. I wonder if we could at least try to unify the bsg/sg code - possibly by making sg use the prettier bsg code (but definitely have to add all the security measures). And dammit, the SCSI people need to get their heads out of their arses. This whole "stream random commands over read/write" needs to go the f*ck away. Could we perhaps extend the SG_IO interace to have an async mode? Instead of "read/write", have "SG_IOSUBMIT" and "SG_IORECEIVE" and have the SG_IO ioctl just be a shorthand of "both". Just my two cents - having an interface other than read/write won't allow users to treat this fd as a regular file with epoll() and read(). This is a major bonus for this interface - an sg/bsg device can be used just like a socket or pipe in any reactor (we use boost asio for example). The advantage of having two ioctls is that they can both pass (meta-)data bidirectionally. That is hard to do with standard read() and write() calls. The command tag is the piece if meta-data that goes against the flow: returned from SG_IOSUBMIT, optionally given to SG_IORECEIVE (which might have a 'cancel command' flag). The sg v1, v2 and v3 interfaces could keep their write()/read() interfaces for backward compatibility (to Linux 1.0.0, March 1994 for sg v1). New, clean submit and receive paths could be added to the sg driver for the v3 and v4 twin ioctl interface. Previously the sg v4 interface was only supported by the bsg driver. One advantage of sg v4 over v3 is support for bidi commands. Not sure if epoll/poll works with an ioctl, if not we could add a "dummy" read() call that notionally returned SCSI status. The SG_IORECEIVE ioctl would still be needed to "clean up" the command, and optionally transfer the data-in buffer. Tony Battersby has also requested twin ioctls saying that it is extremely tedious ploughing through logs full of SG_IO calls and that clearly separating submits from receives would make things somewhat better. Doug Gilbert
Re: Recent removal of bsg read/write support
On 2018-09-03 10:34 AM, Dror Levin wrote: On Sun, Sep 2, 2018 at 8:55 PM Linus Torvalds wrote: On Sun, Sep 2, 2018 at 4:44 AM Richard Weinberger wrote: CC'ing relevant people. Otherwise your mail might get lost. Indeed. Sorry for that. On Sun, Sep 2, 2018 at 1:37 PM Dror Levin wrote: We have an internal tool that uses the bsg read/write interface to issue SCSI commands as part of a test suite for a storage device. After recently reading on LWN that this interface is to be removed we tried porting our code to use sg instead. However, that raises new issues - mainly getting ENOMEM over iSCSI for unknown reasons. Is there any chance that you can make more data available? Sure, I can try. We use writev() to send up to SG_MAX_QUEUE tasks at a time. Occasionally not all tasks are written at which point we wait for tasks to return before sending more, but then writev() fails with ENOMEM and we see this in the syslog: Sep 1 20:58:14 gdc-qa-io-017 kernel: sd 441:0:0:5: [sg73] sg_common_write: start_req err=-12 Failing tasks are reads of 128KiB. I'd rather fix the sg interface (which while also broken garbage, we can't get rid of) than re-surrect the bsg interface. That said, the removed bsg code looks a hell of a lot prettier than the nasty sg interface code does, although it also lacks ansolutely _any_ kind of security checking. For us the bsg interface also has several advantages over sg: 1. The device name is its HCTL which is nicer than an arbitrary integer. 2. write() supports writing more than one sg_io_v4 struct so we don't have to resort to writev(). 3. Queue size is the device's queue depth and not SG_MAX_QUEUE which is 16. Because of this we would like to continue using the bsg interface, even if some changes are required to meet security concerns. I wonder if we could at least try to unify the bsg/sg code - possibly by making sg use the prettier bsg code (but definitely have to add all the security measures). And dammit, the SCSI people need to get their heads out of their arses. This whole "stream random commands over read/write" needs to go the f*ck away. Could we perhaps extend the SG_IO interace to have an async mode? Instead of "read/write", have "SG_IOSUBMIT" and "SG_IORECEIVE" and have the SG_IO ioctl just be a shorthand of "both". Just my two cents - having an interface other than read/write won't allow users to treat this fd as a regular file with epoll() and read(). This is a major bonus for this interface - an sg/bsg device can be used just like a socket or pipe in any reactor (we use boost asio for example). The advantage of having two ioctls is that they can both pass (meta-)data bidirectionally. That is hard to do with standard read() and write() calls. The command tag is the piece if meta-data that goes against the flow: returned from SG_IOSUBMIT, optionally given to SG_IORECEIVE (which might have a 'cancel command' flag). The sg v1, v2 and v3 interfaces could keep their write()/read() interfaces for backward compatibility (to Linux 1.0.0, March 1994 for sg v1). New, clean submit and receive paths could be added to the sg driver for the v3 and v4 twin ioctl interface. Previously the sg v4 interface was only supported by the bsg driver. One advantage of sg v4 over v3 is support for bidi commands. Not sure if epoll/poll works with an ioctl, if not we could add a "dummy" read() call that notionally returned SCSI status. The SG_IORECEIVE ioctl would still be needed to "clean up" the command, and optionally transfer the data-in buffer. Tony Battersby has also requested twin ioctls saying that it is extremely tedious ploughing through logs full of SG_IO calls and that clearly separating submits from receives would make things somewhat better. Doug Gilbert
Re: 4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63
On 2018-08-30 02:03 PM, Ulf Hansson wrote: On 28 August 2018 at 23:47, Douglas Gilbert wrote: I usually boot my Lenovo X270 with a SD card in its: # lspci 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) ... In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in the logs. If I remove the SD card my machine boots and works okay until I insert the SD card. Then: Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card at address Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2 Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63 Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer dereference at 0018 Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0 Aug 28 23:30:38 xtwo70 kernel: Oops: [#1] SMP Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not tainted 4.19.0-rc1 #78 Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW, BIOS R0IET53W (1.31 ) 05/22/2018 Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260 [rtsx_pci] Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00 00 00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96 90 00 00 00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83 c7 Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202 Aug 28 23:30:38 xtwo70 kernel: RAX: RBX: 0003 RCX: Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021 RDI: 8801b6328000 Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400 R09: Aug 28 23:30:38 xtwo70 kernel: R10: R11: R12: a800 Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400 R15: 0001 Aug 28 23:30:38 xtwo70 kernel: FS: () GS:88022738() knlGS: Aug 28 23:30:38 xtwo70 kernel: CS: 0010 DS: ES: CR0: 80050033 Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006 CR4: 003606e0 Aug 28 23:30:38 xtwo70 kernel: Call Trace: Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80 Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40 Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0 Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0 Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0 Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130 Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80 Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50 Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep ccm btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc intel_rapl nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat coretemp kvm_intel arc4 snd_hda_codec_hdmi kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mac80211 intel_cstate intel_uncore intel_rapl_perf snd_hda_intel joydev snd_hda_codec mousedev snd_hwdep iwlwifi snd_hda_core input_leds efi_pstore snd_pcm serio_raw efivars cfg80211 rtsx_pci_ms memstick mei_me idma64 virt_dma mei intel_lpss_pci intel_lpss intel_pch_thermal thinkpad_acpi nvram snd_seq_dummy tps6598x snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd soundcore rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc mmc_core i915 nvme e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]--- Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostn
Re: 4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63
On 2018-08-30 02:03 PM, Ulf Hansson wrote: On 28 August 2018 at 23:47, Douglas Gilbert wrote: I usually boot my Lenovo X270 with a SD card in its: # lspci 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) ... In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in the logs. If I remove the SD card my machine boots and works okay until I insert the SD card. Then: Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card at address Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2 Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63 Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer dereference at 0018 Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0 Aug 28 23:30:38 xtwo70 kernel: Oops: [#1] SMP Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not tainted 4.19.0-rc1 #78 Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW, BIOS R0IET53W (1.31 ) 05/22/2018 Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260 [rtsx_pci] Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00 00 00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96 90 00 00 00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83 c7 Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202 Aug 28 23:30:38 xtwo70 kernel: RAX: RBX: 0003 RCX: Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021 RDI: 8801b6328000 Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400 R09: Aug 28 23:30:38 xtwo70 kernel: R10: R11: R12: a800 Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400 R15: 0001 Aug 28 23:30:38 xtwo70 kernel: FS: () GS:88022738() knlGS: Aug 28 23:30:38 xtwo70 kernel: CS: 0010 DS: ES: CR0: 80050033 Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006 CR4: 003606e0 Aug 28 23:30:38 xtwo70 kernel: Call Trace: Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80 Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40 Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0 Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0 Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0 Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130 Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80 Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50 Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep ccm btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc intel_rapl nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat coretemp kvm_intel arc4 snd_hda_codec_hdmi kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mac80211 intel_cstate intel_uncore intel_rapl_perf snd_hda_intel joydev snd_hda_codec mousedev snd_hwdep iwlwifi snd_hda_core input_leds efi_pstore snd_pcm serio_raw efivars cfg80211 rtsx_pci_ms memstick mei_me idma64 virt_dma mei intel_lpss_pci intel_lpss intel_pch_thermal thinkpad_acpi nvram snd_seq_dummy tps6598x snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd soundcore rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc mmc_core i915 nvme e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]--- Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostn
4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63
I usually boot my Lenovo X270 with a SD card in its: # lspci 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) ... In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in the logs. If I remove the SD card my machine boots and works okay until I insert the SD card. Then: Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card at address Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2 Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63 Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer dereference at 0018 Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0 Aug 28 23:30:38 xtwo70 kernel: Oops: [#1] SMP Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not tainted 4.19.0-rc1 #78 Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW, BIOS R0IET53W (1.31 ) 05/22/2018 Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260 [rtsx_pci] Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00 00 00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96 90 00 00 00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83 c7 Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202 Aug 28 23:30:38 xtwo70 kernel: RAX: RBX: 0003 RCX: Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021 RDI: 8801b6328000 Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400 R09: Aug 28 23:30:38 xtwo70 kernel: R10: R11: R12: a800 Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400 R15: 0001 Aug 28 23:30:38 xtwo70 kernel: FS: () GS:88022738() knlGS: Aug 28 23:30:38 xtwo70 kernel: CS: 0010 DS: ES: CR0: 80050033 Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006 CR4: 003606e0 Aug 28 23:30:38 xtwo70 kernel: Call Trace: Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80 Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40 Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0 Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0 Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0 Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130 Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80 Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50 Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep ccm btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc intel_rapl nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat coretemp kvm_intel arc4 snd_hda_codec_hdmi kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mac80211 intel_cstate intel_uncore intel_rapl_perf snd_hda_intel joydev snd_hda_codec mousedev snd_hwdep iwlwifi snd_hda_core input_leds efi_pstore snd_pcm serio_raw efivars cfg80211 rtsx_pci_ms memstick mei_me idma64 virt_dma mei intel_lpss_pci intel_lpss intel_pch_thermal thinkpad_acpi nvram snd_seq_dummy tps6598x snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd soundcore rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc mmc_core i915 nvme e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]--- Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.77' (uid=1000 pid=3518
4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63
I usually boot my Lenovo X270 with a SD card in its: # lspci 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader (rev 01) ... In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in the logs. If I remove the SD card my machine boots and works okay until I insert the SD card. Then: Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card at address Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2 Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63 Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer dereference at 0018 Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0 Aug 28 23:30:38 xtwo70 kernel: Oops: [#1] SMP Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not tainted 4.19.0-rc1 #78 Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW, BIOS R0IET53W (1.31 ) 05/22/2018 Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260 [rtsx_pci] Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00 00 00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96 90 00 00 00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83 c7 Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202 Aug 28 23:30:38 xtwo70 kernel: RAX: RBX: 0003 RCX: Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021 RDI: 8801b6328000 Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400 R09: Aug 28 23:30:38 xtwo70 kernel: R10: R11: R12: a800 Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400 R15: 0001 Aug 28 23:30:38 xtwo70 kernel: FS: () GS:88022738() knlGS: Aug 28 23:30:38 xtwo70 kernel: CS: 0010 DS: ES: CR0: 80050033 Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006 CR4: 003606e0 Aug 28 23:30:38 xtwo70 kernel: Call Trace: Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80 Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40 Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc] Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0 Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0 Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0 Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130 Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80 Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50 Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep ccm btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc intel_rapl nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat coretemp kvm_intel arc4 snd_hda_codec_hdmi kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mac80211 intel_cstate intel_uncore intel_rapl_perf snd_hda_intel joydev snd_hda_codec mousedev snd_hwdep iwlwifi snd_hda_core input_leds efi_pstore snd_pcm serio_raw efivars cfg80211 rtsx_pci_ms memstick mei_me idma64 virt_dma mei intel_lpss_pci intel_lpss intel_pch_thermal thinkpad_acpi nvram snd_seq_dummy tps6598x snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd soundcore rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc mmc_core i915 nvme e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]--- Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.77' (uid=1000 pid=3518
Re: [PATCH] scsi: sg: fix a missing-check bug
On 2018-05-05 11:21 PM, Wenwen Wang wrote: In sg_write(), the opcode of the command is firstly copied from the userspace pointer 'buf' and saved to the kernel variable 'opcode', using the __get_user() function. The size of the command, i.e., 'cmd_size' is then calculated based on the 'opcode'. After that, the whole command, including the opcode, is copied again from 'buf' using the __copy_from_user() function and saved to 'cmnd'. Finally, the function sg_common_write() is invoked to process 'cmnd'. Given that the 'buf' pointer resides in userspace, a malicious userspace process can race to change the opcode of the command between the two copies. That means, the opcode indicated by the variable 'opcode' could be different from the opcode in 'cmnd'. This can cause inconsistent data in 'cmnd' and potential logical errors in the function sg_common_write(), as it needs to work on 'cmnd'. This patch reuses the opcode obtained in the first copy and only copies the remaining part of the command from userspace. Signed-off-by: Wenwen Wang--- drivers/scsi/sg.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index c198b963..0ad8106 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -657,7 +657,8 @@ sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) hp->flags = input_size; /* structure abuse ... */ hp->pack_id = old_hdr.pack_id; hp->usr_ptr = NULL; - if (__copy_from_user(cmnd, buf, cmd_size)) + cmnd[0] = opcode; + if (__copy_from_user(cmnd + 1, buf + 1, cmd_size - 1)) return -EFAULT; /* * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV, That is in the deprecated "v2" part of the sg driver (for around 15 years). There are lots more interesting races with that interface than that one described above. My guess is that all system calls would be susceptible to playing around with a buffer being passed to or from the OS by a thread other than the one doing the system call, during that call. Surely no Unix like OS gives any security guarantees to a thread being attacked by a malevolent thread in the same process! My question is did this actually cause to program to fail; or is it something that a sanity checker flagged? Also wouldn't it be better just to return an error such as EINVAL if opcode != command[0] ? Doug Gilbert
Re: [PATCH] scsi: sg: fix a missing-check bug
On 2018-05-05 11:21 PM, Wenwen Wang wrote: In sg_write(), the opcode of the command is firstly copied from the userspace pointer 'buf' and saved to the kernel variable 'opcode', using the __get_user() function. The size of the command, i.e., 'cmd_size' is then calculated based on the 'opcode'. After that, the whole command, including the opcode, is copied again from 'buf' using the __copy_from_user() function and saved to 'cmnd'. Finally, the function sg_common_write() is invoked to process 'cmnd'. Given that the 'buf' pointer resides in userspace, a malicious userspace process can race to change the opcode of the command between the two copies. That means, the opcode indicated by the variable 'opcode' could be different from the opcode in 'cmnd'. This can cause inconsistent data in 'cmnd' and potential logical errors in the function sg_common_write(), as it needs to work on 'cmnd'. This patch reuses the opcode obtained in the first copy and only copies the remaining part of the command from userspace. Signed-off-by: Wenwen Wang --- drivers/scsi/sg.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index c198b963..0ad8106 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -657,7 +657,8 @@ sg_write(struct file *filp, const char __user *buf, size_t count, loff_t * ppos) hp->flags = input_size; /* structure abuse ... */ hp->pack_id = old_hdr.pack_id; hp->usr_ptr = NULL; - if (__copy_from_user(cmnd, buf, cmd_size)) + cmnd[0] = opcode; + if (__copy_from_user(cmnd + 1, buf + 1, cmd_size - 1)) return -EFAULT; /* * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV, That is in the deprecated "v2" part of the sg driver (for around 15 years). There are lots more interesting races with that interface than that one described above. My guess is that all system calls would be susceptible to playing around with a buffer being passed to or from the OS by a thread other than the one doing the system call, during that call. Surely no Unix like OS gives any security guarantees to a thread being attacked by a malevolent thread in the same process! My question is did this actually cause to program to fail; or is it something that a sanity checker flagged? Also wouldn't it be better just to return an error such as EINVAL if opcode != command[0] ? Doug Gilbert
Re: usercopy whitelist woe in scsi_sense_cache
On 2018-04-04 04:32 PM, Kees Cook wrote: On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenkowrote: [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling smartctl in a loop and doing some usual background I/O. The warning is triggered within 3 minutes or so (not instantly). Also: Can you send me your .config? What SCSI drivers are you using in the VM and on the real server? Are you able to see what ioctl()s smartctl is issuing? I'll try to reproduce this on my end... smartctl -r scsiioctl,3
Re: usercopy whitelist woe in scsi_sense_cache
On 2018-04-04 04:32 PM, Kees Cook wrote: On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko wrote: [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling smartctl in a loop and doing some usual background I/O. The warning is triggered within 3 minutes or so (not instantly). Also: Can you send me your .config? What SCSI drivers are you using in the VM and on the real server? Are you able to see what ioctl()s smartctl is issuing? I'll try to reproduce this on my end... smartctl -r scsiioctl,3
Re: usercopy whitelist woe in scsi_sense_cache
On 2018-04-04 04:21 PM, Kees Cook wrote: On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenkowrote: With v4.16 I get the following dump while using smartctl: [...] [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! [...] [ 261.345976] Call Trace: [ 261.350620] __check_object_size+0x130/0x1a0 [ 261.355775] sg_io+0x269/0x3f0 [ 261.360729] ? path_lookupat+0xaa/0x1f0 [ 261.364027] ? current_time+0x18/0x70 [ 261.366684] scsi_cmd_ioctl+0x257/0x410 [ 261.369871] ? xfs_bmapi_read+0x1c3/0x340 [xfs] [ 261.372231] sd_ioctl+0xbf/0x1a0 [sd_mod] [ 261.375456] blkdev_ioctl+0x8ca/0x990 [ 261.381156] ? read_null+0x10/0x10 [ 261.384984] block_ioctl+0x39/0x40 [ 261.388739] do_vfs_ioctl+0xa4/0x630 [ 261.392624] ? vfs_write+0x164/0x1a0 [ 261.396658] SyS_ioctl+0x74/0x80 [ 261.399563] do_syscall_64+0x74/0x190 [ 261.402685] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 This is: sg_io+0x269/0x3f0: blk_complete_sghdr_rq at block/scsi_ioctl.c:280 (inlined by) sg_io at block/scsi_ioctl.c:376 which is: if (req->sense_len && hdr->sbp) { int len = min((unsigned int) hdr->mx_sb_len, req->sense_len); if (!copy_to_user(hdr->sbp, req->sense, len)) hdr->sb_len_wr = len; else ret = -EFAULT; } [...] I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling smartctl in a loop and doing some usual background I/O. The warning is triggered within 3 minutes or so (not instantly). Initially, it was produced on my server after a kernel update (because disks are monitored with smartctl via Zabbix). Looks like the thing was introduced with 0afe76e88c57d91ef5697720aed380a339e3df70. Any idea how to deal with this please? If needed, I can provide any additional info, and also I'm happy/ready to test any proposed patches. Interesting, and a little confusing. So, what's strange here is that the scsi_sense_cache already has a full whitelist: kmem_cache_create_usercopy("scsi_sense_cache", SCSI_SENSE_BUFFERSIZE, 0, SLAB_HWCACHE_ALIGN, 0, SCSI_SENSE_BUFFERSIZE, NULL); Arg 2 is the buffer size, arg 5 is the whitelist offset (0), and the whitelist size (same as arg2). In other words, the entire buffer should be whitelisted. include/scsi/scsi_cmnd.h says: #define SCSI_SENSE_BUFFERSIZE 96 That means scsi_sense_cache should be 96 bytes in size? But a 22 byte read starting at offset 94 happened? That seems like a 20 byte read beyond the end of the SLUB object? Though if it were reading past the actual end of the object, I'd expect the hardened usercopy BUG (rather than the WARN) to kick in. Ah, it looks like /sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes of actual allocation, so the 20 bytes doesn't strictly overlap another object (hence no BUG): /sys/kernel/slab/scsi_sense_cache# grep . object_size usersize slab_size object_size:96 usersize:96 slab_size:128 Ah, right, due to SLAB_HWCACHE_ALIGN, the allocation is rounded up to the next cache line size, so there's 32 bytes of padding to reach 128. James or Martin, is this over-read "expected" behavior? i.e. does the sense cache buffer usage ever pull the ugly trick of silently expanding its allocation into the space the slab allocator has given it? If not, this looks like a real bug. What I don't see is how req->sense is _not_ at offset 0 in the scsi_sense_cache object... Looking at the smartctl SCSI code it pulls 32 byte sense buffers. Can't see 22 anywhere relevant in its code. There are two types of sense: fixed and descriptor: with fixed you seldom need more than 18 bytes (but it can only represent 32 bit LBAs). The other type has a header and 0 or more variable length descriptors. If decoding of descriptor sense went wrong you might end up at offset 94. But not with smartctl Doug Gilbert
Re: usercopy whitelist woe in scsi_sense_cache
On 2018-04-04 04:21 PM, Kees Cook wrote: On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko wrote: With v4.16 I get the following dump while using smartctl: [...] [ 261.262135] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)! [...] [ 261.345976] Call Trace: [ 261.350620] __check_object_size+0x130/0x1a0 [ 261.355775] sg_io+0x269/0x3f0 [ 261.360729] ? path_lookupat+0xaa/0x1f0 [ 261.364027] ? current_time+0x18/0x70 [ 261.366684] scsi_cmd_ioctl+0x257/0x410 [ 261.369871] ? xfs_bmapi_read+0x1c3/0x340 [xfs] [ 261.372231] sd_ioctl+0xbf/0x1a0 [sd_mod] [ 261.375456] blkdev_ioctl+0x8ca/0x990 [ 261.381156] ? read_null+0x10/0x10 [ 261.384984] block_ioctl+0x39/0x40 [ 261.388739] do_vfs_ioctl+0xa4/0x630 [ 261.392624] ? vfs_write+0x164/0x1a0 [ 261.396658] SyS_ioctl+0x74/0x80 [ 261.399563] do_syscall_64+0x74/0x190 [ 261.402685] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 This is: sg_io+0x269/0x3f0: blk_complete_sghdr_rq at block/scsi_ioctl.c:280 (inlined by) sg_io at block/scsi_ioctl.c:376 which is: if (req->sense_len && hdr->sbp) { int len = min((unsigned int) hdr->mx_sb_len, req->sense_len); if (!copy_to_user(hdr->sbp, req->sense, len)) hdr->sb_len_wr = len; else ret = -EFAULT; } [...] I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling smartctl in a loop and doing some usual background I/O. The warning is triggered within 3 minutes or so (not instantly). Initially, it was produced on my server after a kernel update (because disks are monitored with smartctl via Zabbix). Looks like the thing was introduced with 0afe76e88c57d91ef5697720aed380a339e3df70. Any idea how to deal with this please? If needed, I can provide any additional info, and also I'm happy/ready to test any proposed patches. Interesting, and a little confusing. So, what's strange here is that the scsi_sense_cache already has a full whitelist: kmem_cache_create_usercopy("scsi_sense_cache", SCSI_SENSE_BUFFERSIZE, 0, SLAB_HWCACHE_ALIGN, 0, SCSI_SENSE_BUFFERSIZE, NULL); Arg 2 is the buffer size, arg 5 is the whitelist offset (0), and the whitelist size (same as arg2). In other words, the entire buffer should be whitelisted. include/scsi/scsi_cmnd.h says: #define SCSI_SENSE_BUFFERSIZE 96 That means scsi_sense_cache should be 96 bytes in size? But a 22 byte read starting at offset 94 happened? That seems like a 20 byte read beyond the end of the SLUB object? Though if it were reading past the actual end of the object, I'd expect the hardened usercopy BUG (rather than the WARN) to kick in. Ah, it looks like /sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes of actual allocation, so the 20 bytes doesn't strictly overlap another object (hence no BUG): /sys/kernel/slab/scsi_sense_cache# grep . object_size usersize slab_size object_size:96 usersize:96 slab_size:128 Ah, right, due to SLAB_HWCACHE_ALIGN, the allocation is rounded up to the next cache line size, so there's 32 bytes of padding to reach 128. James or Martin, is this over-read "expected" behavior? i.e. does the sense cache buffer usage ever pull the ugly trick of silently expanding its allocation into the space the slab allocator has given it? If not, this looks like a real bug. What I don't see is how req->sense is _not_ at offset 0 in the scsi_sense_cache object... Looking at the smartctl SCSI code it pulls 32 byte sense buffers. Can't see 22 anywhere relevant in its code. There are two types of sense: fixed and descriptor: with fixed you seldom need more than 18 bytes (but it can only represent 32 bit LBAs). The other type has a header and 0 or more variable length descriptors. If decoding of descriptor sense went wrong you might end up at offset 94. But not with smartctl Doug Gilbert
Re: [PATCH] scsi: resolve COMMAND_SIZE at compile time
On 2018-03-10 03:49 PM, James Bottomley wrote: On Sat, 2018-03-10 at 14:29 +0100, Stephen Kitt wrote: Hi Bart, On Fri, 9 Mar 2018 22:47:12 +, Bart Van Asschewrote: On Fri, 2018-03-09 at 23:33 +0100, Stephen Kitt wrote: +/* + * SCSI command sizes are as follows, in bytes, for fixed size commands, per + * group: 6, 10, 10, 12, 16, 12, 10, 10. The top three bits of an opcode + * determine its group. + * The size table is encoded into a 32-bit value by subtracting each value + * from 16, resulting in a value of 1715488362 + * (6 << 28 + 6 << 24 + 4 << 20 + 0 << 16 + 4 << 12 + 6 << 8 + 6 << 4 + 10). + * Command group 3 is reserved and should never be used. + */ +#define COMMAND_SIZE(opcode) \ + (16 - (15 & (1715488362 >> (4 * (((opcode) >> 5) & 7) To me this seems hard to read and hard to verify. Could this have been written as a combination of ternary expressions, e.g. using a gcc statement expression to ensure that opcode is evaluated once? That’s what I’d tried initially, e.g. #define COMMAND_SIZE(opcode) ({ \ int index = ((opcode) >> 5) & 7; \ index == 0 ? 6 : (index == 4 ? 16 : index == 3 || index == 5 ? 12 : 10); \ }) But gcc still reckons that results in a VLA, defeating the initial purpose of the exercise. Does it help if I make the magic value construction clearer? #define SCSI_COMMAND_SIZE_TBL ( \ (16 - 6)\ + ((16 - 10) << 4) \ + ((16 - 10) << 8) \ + ((16 - 12) << 12) \ + ((16 - 16) << 16) \ + ((16 - 12) << 20) \ + ((16 - 10) << 24) \ + ((16 - 10) << 28)) #define COMMAND_SIZE(opcode) \ (16 - (15 & (SCSI_COMMAND_SIZE_TBL >> (4 * (((opcode) >> 5) & 7) Couldn't we do the less clever thing of making the array a static const and moving it to a header? That way the compiler should be able to work it out at compile time. And maybe add a comment that as of now (SPC-5 rev 19), COMMAND_SIZE is not valid for opcodes 0x7e and 0x7f plus everything above and including 0xc0. The latter ones are vendor specific and are loosely constrained, probably all even numbered lengths in the closed range: [6,260]. If the SCSI command sets want to keep up with NVMe, they may want to think about how they can gainfully use cdb_s that are > 64 bytes long. WRITE SCATTERED got into SBC-4 but READ GATHERED didn't, due to lack of interest. The READ GATHERED proposed was a bidi command, but it could have been a a simpler data-in command with a looong cdb (holding LBA, number_of_blocks pairs). Doug Gilbert
Re: [PATCH] scsi: resolve COMMAND_SIZE at compile time
On 2018-03-10 03:49 PM, James Bottomley wrote: On Sat, 2018-03-10 at 14:29 +0100, Stephen Kitt wrote: Hi Bart, On Fri, 9 Mar 2018 22:47:12 +, Bart Van Assche wrote: On Fri, 2018-03-09 at 23:33 +0100, Stephen Kitt wrote: +/* + * SCSI command sizes are as follows, in bytes, for fixed size commands, per + * group: 6, 10, 10, 12, 16, 12, 10, 10. The top three bits of an opcode + * determine its group. + * The size table is encoded into a 32-bit value by subtracting each value + * from 16, resulting in a value of 1715488362 + * (6 << 28 + 6 << 24 + 4 << 20 + 0 << 16 + 4 << 12 + 6 << 8 + 6 << 4 + 10). + * Command group 3 is reserved and should never be used. + */ +#define COMMAND_SIZE(opcode) \ + (16 - (15 & (1715488362 >> (4 * (((opcode) >> 5) & 7) To me this seems hard to read and hard to verify. Could this have been written as a combination of ternary expressions, e.g. using a gcc statement expression to ensure that opcode is evaluated once? That’s what I’d tried initially, e.g. #define COMMAND_SIZE(opcode) ({ \ int index = ((opcode) >> 5) & 7; \ index == 0 ? 6 : (index == 4 ? 16 : index == 3 || index == 5 ? 12 : 10); \ }) But gcc still reckons that results in a VLA, defeating the initial purpose of the exercise. Does it help if I make the magic value construction clearer? #define SCSI_COMMAND_SIZE_TBL ( \ (16 - 6)\ + ((16 - 10) << 4) \ + ((16 - 10) << 8) \ + ((16 - 12) << 12) \ + ((16 - 16) << 16) \ + ((16 - 12) << 20) \ + ((16 - 10) << 24) \ + ((16 - 10) << 28)) #define COMMAND_SIZE(opcode) \ (16 - (15 & (SCSI_COMMAND_SIZE_TBL >> (4 * (((opcode) >> 5) & 7) Couldn't we do the less clever thing of making the array a static const and moving it to a header? That way the compiler should be able to work it out at compile time. And maybe add a comment that as of now (SPC-5 rev 19), COMMAND_SIZE is not valid for opcodes 0x7e and 0x7f plus everything above and including 0xc0. The latter ones are vendor specific and are loosely constrained, probably all even numbered lengths in the closed range: [6,260]. If the SCSI command sets want to keep up with NVMe, they may want to think about how they can gainfully use cdb_s that are > 64 bytes long. WRITE SCATTERED got into SBC-4 but READ GATHERED didn't, due to lack of interest. The READ GATHERED proposed was a bidi command, but it could have been a a simpler data-in command with a looong cdb (holding LBA, number_of_blocks pairs). Doug Gilbert
Re: scsi: sg: assorted memory corruptions
On 2018-01-30 07:22 AM, Dmitry Vyukov wrote: Uh, I've answered this a week ago, but did not notice that Doug dropped everybody from CC. Reporting to all. On Mon, Jan 22, 2018 at 8:16 PM, Douglas Gilbert <dgilb...@interlog.com> wrote: On 2018-01-22 02:06 PM, Dmitry Vyukov wrote: On Mon, Jan 22, 2018 at 7:57 PM, Douglas Gilbert <dgilb...@interlog.com> Please show me the output of 'lsscsi -g' on your test machine. /dev/sg0 is often associated with /dev/sda which is often a SATA SSD (or a virtualized one) that holds the root file system. With the sg pass-through driver it is relatively easy to write random (user provided data) over the root file system which will almost certainly "root" the system. This is pretty standard qemu vm started with: qemu-system-x86_64 -hda wheezy.img -net user,host=10.0.2.10 -net nic -nographic -kernel arch/x86/boot/bzImage -append "console=ttyS0 root=/dev/sda earlyprintk=serial " -m 2G -smp 4 # lsscsi -g [0:0:0:0]diskATA QEMU HARDDISK0 /dev/sda /dev/sg0 With lk 4.15.0-rc9 I can run your test program (with some additions, see attachment) for 30 minutes against a scsi_debug simulated disk. You can easily replicate this test just run 'modprobe scsi_debug' and a third line should appear in your lsscsi output. The new device will most likely be /dev/sg2 . With lk 4.15.0 (release) running against a SAS SSD (SEAGATE ST200FM0073), the test has been running 20 minutes and counting without problems. That is using a LSI HBA with the mpt3sas driver. [1:0:0:0]cd/dvd QEMU QEMU DVD-ROM 2.0. /dev/sr0 /dev/sg1 # readlink /sys/class/scsi_generic/sg0 ../../devices/pci:00/:00:01.1/ata1/host0/target0:0:0/0:0:0:0/scsi_generic/sg0 # cat /sys/class/scsi_generic/sg0/device/vendor ATA ^ That subsystem is the culprit IMO, most likely libata. Until you can show this test failing on something other than an ATA disk, then I will treat this issue as closed. Doug Gilbert Perhaps it misbehaves when it gets a SCSI command in the T10 range (i.e. not vendor specific) with a 9 byte cdb length. As far as I'm aware T10 (and the Ansi committee before it) have never defined a cdb with an odd length. For those that are not aware, the sg driver is a relatively thin shim over the block layer, the SCSI mid-level, and a low-level driver which may have another kernel driver stack underneath it (e.g. UAS (USB attached SCSI)). The previous report from syzkaller on the sg driver ("scsi: memory leak in sg_start_req") has resulted in one accepted patch on the block layer with probably more to come in the same area. Testing the patch Dmitry gave (with some added error checks which reported no problems) with the scsi_debug driver supplying /dev/sg0 I have not seen any problems running that test program. Again there might be a very slow memory leak, but if there is I don't believe it is in the sg driver. Did you run it in a loop? First runs pass just fine for me too. Is thirty minutes long enough ?? Yes, it certainly should be enough. Here is what I see: # while ./a.out; do echo RUN; done RUN RUN RUN RUN RUN RUN RUN [ 371.977266] == [ 371.980158] BUG: KASAN: double-free or invalid-free in __put_task_struct+0x1e7/0x5c0 Here is full execution trace of the write call if that will be of any help: https://gist.githubusercontent.com/dvyukov/14ae64c3e753dedf9ab2608676ecf0b9/raw/9803d52bb1e317a9228e362236d042aaf0fa9d69/gistfile1.txt This is on upstream commit 0d665e7b109d512b7cae3ccef6e8654714887844. Also attaching my config just in case. // autogenerated by syzkaller (http://github.com/google/syzkaller) #include #include #include #include #include #include #include #include #define SG_NEXT_CMD_LEN 0x2283 static const char * usage = "sg_syzk_next_cdb # (e.g. '/dev/sg3') "; int main(int argc, const char * argv[]) { int res, err; int fd; long len = 9; char* p = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x47\x00\x00\x24\x00" "\x00\x00\x00\x00\x00\x1c\xbb\xac\x14\x00\xaa\xe0\x00\x00\x01" "\x00\x07\x07\x00\x00\x59\x08\x00\x00\x00\x80\xfe\x7f\x00\x00\x01"; const char * dev_name; struct stat a_stat; if (argc < 2) { fprintf(stderr, "Usage: %s\n", usage); return 1; } dev_name = argv[1]; if (0 != stat(dev_name, _stat)) { err = errno; fprintf(stderr, "Unable to stat %s, err: %s\n", dev_name, strerror(err)); return 1; } if ((a_stat.st_mode & S_IFMT) != S_IFCHR) { fprintf(stderr, "Expected %s, to be sg device\n", dev_name); return 1; } fd = open(dev_name, O_RDWR); if (fd < 0) { err = errno; fprintf(stderr, "open(%s) failed: %s [%d]\n", dev_name, strerror(err), err); } res = ioctl(fd, SG_NEXT_CMD