from:"Douglas Gilbert"

Re: KMSAN: kernel-infoleak in sg_scsi_ioctl

2021-04-12 Thread Douglas Gilbert


Hi,
See below.

On 2021-04-12 9:02 a.m., Hao Sun wrote:

Hi

When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
the Linux kernel, I found the following bug report.

commit:   4ebaab5fb428374552175aa39832abf5cedb916a
version:   linux 5.12
git tree:kmsan
kernel config and full log can be found in the attached file.

=
BUG: KMSAN: kernel-infoleak in kmsan_copy_to_user+0x9c/0xb0
mm/kmsan/kmsan_hooks.c:249
CPU: 2 PID: 23939 Comm: executor Not tainted 5.12.0-rc6+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x1ff/0x275 lib/dump_stack.c:120
  kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118
  kmsan_internal_check_memory+0x48c/0x520 mm/kmsan/kmsan.c:437
  kmsan_copy_to_user+0x9c/0xb0 mm/kmsan/kmsan_hooks.c:249
  instrument_copy_to_user ./include/linux/instrumented.h:121 [inline]
  _copy_to_user+0x112/0x1d0 lib/usercopy.c:33
  copy_to_user ./include/linux/uaccess.h:209 [inline]
  sg_scsi_ioctl+0xfa9/0x1180 block/scsi_ioctl.c:507
  sg_ioctl_common+0x2713/0x4930 drivers/scsi/sg.c:1108
  sg_ioctl+0x166/0x2d0 drivers/scsi/sg.c:1162
  vfs_ioctl fs/ioctl.c:48 [inline]
  __do_sys_ioctl fs/ioctl.c:753 [inline]
  __se_sys_ioctl+0x2c2/0x400 fs/ioctl.c:739
  __x64_sys_ioctl+0x4a/0x70 fs/ioctl.c:739
  do_syscall_64+0xa2/0x120 arch/x86/entry/common.c:48
  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x47338d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:7fe31ab90c58 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 0059c128 RCX: 0047338d
RDX: 2040 RSI: 0001 RDI: 0003
RBP: 004e8e5d R08:  R09: 
R10:  R11: 0246 R12: 0059c128
R13: 7ffe2284af2f R14: 7ffe2284b0d0 R15: 7fe31ab90dc0
Uninit was stored to memory at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
  kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
  kmsan_memcpy_memmove_metadata+0x25b/0x290 mm/kmsan/kmsan.c:226
  kmsan_memcpy_metadata+0xb/0x10 mm/kmsan/kmsan.c:246
  __msan_memcpy+0x46/0x60 mm/kmsan/kmsan_instr.c:110
  bio_copy_kern_endio_read+0x3ee/0x560 block/blk-map.c:443
  bio_endio+0xa1a/0xcc0 block/bio.c:1453
  req_bio_endio block/blk-core.c:265 [inline]
  blk_update_request+0xd4f/0x2190 block/blk-core.c:1456
  scsi_end_request+0x111/0xc50 drivers/scsi/scsi_lib.c:570
  scsi_io_completion+0x276/0x2840 drivers/scsi/scsi_lib.c:970
  scsi_finish_command+0x6fc/0x720 drivers/scsi/scsi.c:214
  scsi_softirq_done+0x205/0xa40 drivers/scsi/scsi_lib.c:1450
  blk_complete_reqs block/blk-mq.c:576 [inline]
  blk_done_softirq+0x133/0x1e0 block/blk-mq.c:581
  __do_softirq+0x271/0x782 kernel/softirq.c:345

Uninit was created at:
  kmsan_save_stack_with_flags+0x3c/0x90
  kmsan_alloc_page+0xc4/0x1b0
  __alloc_pages_nodemask+0xdb0/0x54a0
  alloc_pages_current+0x671/0x990
  blk_rq_map_kern+0xb8e/0x1310
  sg_scsi_ioctl+0xc94/0x1180
  sg_ioctl_common+0x2713/0x4930
  sg_ioctl+0x166/0x2d0
  __se_sys_ioctl+0x2c2/0x400
  __x64_sys_ioctl+0x4a/0x70
  do_syscall_64+0xa2/0x120
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Byte 0 of 1 is uninitialized
  Memory access of size 1 starts at 99e033fb9360
  Data copied to user address 2048

The following system call sequence (Syzlang format) can reproduce the crash:
# {Threaded:false Collide:false Repeat:false RepeatTimes:0 Procs:1
Slowdown:1 Sandbox:none Fault:false FaultCall:-1 FaultNth:0 Leak:false
NetInjection:true NetDevices:true NetReset:false Cgroups:false
BinfmtMisc:true CloseFDs:true KCSAN:false DevlinkPCI:true USB:true
VhciInjection:true Wifi:true IEEE802154:true Sysctl:true
UseTmpDir:true HandleSegv:true Repro:false Trace:false}

r0 = syz_open_dev$sg(&(0x7f00)='/dev/sg#\x00', 0x0, 0x2094b402)
ioctl$SG_GET_LOW_DMA(r0, 0x227a, &(0x7f40))
ioctl$SCSI_IOCTL_SEND_COMMAND(r0, 0x1, &(0x7f40)={0x0, 0x1, 0x1})


Since the code opens a sg device node then the sg driver, which is a
pass-through driver, is invoked. However instead of using sg's pass-through
facilities, that call to ioctl(SCSI_IOCTL_SEND_COMMAND) is invoking the
long deprecated SCSI mid-level pass-through. So if there is infoleak bug
you should flag sg_scsi_ioctl() in block/scsi_ioctl.c. See the notes
associated with that function which imply it can't be protected from
certain types of abuse due to its interface design. That is why it is
deprecated. Also the equivalent of root permissions are required
to execute those functions.

That code looks strange, ioctl(SG_GET_LOW_DMA) reads the
host->unchecked_isa_dma value (now always 0 ??) into an int at
0x7f40. That same address is then used for the

Re: [scsi_debug] 20b58d1e6b: blktests.block.001.fail

2021-03-23 Thread Douglas Gilbert


On 2021-03-23 9:26 a.m., kernel test robot wrote:



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 20b58d1e6b9cda142cd142a0a2f94c0d04b0a5a0 ("[RFC] scsi_debug: add hosts 
initialization --> worker")
url: 
https://github.com/0day-ci/linux/commits/Douglas-Gilbert/scsi_debug-add-hosts-initialization-worker/20210319-230817
base: https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git for-next

in testcase: blktests
version: blktests-x86_64-a210761-1_20210124
with following parameters:

disk: 1SSD
test: block-group-00
ucode: 0xe2



on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 32G 
memory

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


This RFC was proposed for Luis Chamberlain to consider for this report:
   https://bugzilla.kernel.org/show_bug.cgi?id=212337

Luis predicted that this change would trip up some blktests which is exactly 
what has happened here. The question here is whether it is reasonable (i.e.

a correct simulation of what real hardware does) to assume that as soon as
the loading of the scsi_debug is complete, that _all_ LUNs (devices) specified
in its parameters are ready for media access?

If yes then this RFC can be dropped or relegated to only occur when a driver
parameter is set to a non-default value.

If no then those blktest scripts need to be fixed to reflect that after a
HBA is loaded, all the targets and LUNs connected to it do _not_ immediately
become available.

Doug Gilbert



If you fix the issue, kindly add following tag
Reported-by: kernel test robot 

2021-03-21 02:40:23 sed "s:^:block/:" 
/lkp/benchmarks/blktests/tests/block-group-00
2021-03-21 02:40:23 ./check block/001
block/001 (stress device hotplugging)
block/001 (stress device hotplugging)[failed]
 runtime  ...  30.370s
 --- tests/block/001.out2021-01-24 06:04:08.0 +
 +++ /lkp/benchmarks/blktests/results/nodev/block/001.out.bad   
2021-03-21 02:40:53.652003261 +
 @@ -1,4 +1,7 @@
  Running block/001
  Stressing sd
 +ls: cannot access '/sys/class/scsi_device/4:0:0:0/device/block': No such 
file or directory
 +ls: cannot access '/sys/class/scsi_device/5:0:0:0/device/block': No such 
file or directory
  Stressing sr
 +ls: cannot access '/sys/class/scsi_device/4:0:0:0/device/block': No such 
file or directory
  Test complete



To reproduce:

 git clone https://github.com/intel/lkp-tests.git
 cd lkp-tests
 bin/lkp installjob.yaml  # job file is attached in 
this email
 bin/lkp split-job --compatible job.yaml
 bin/lkp runcompatible-job.yaml



---
0DAY/LKP+ Test Infrastructure   Open Source Technology Center
https://lists.01.org/hyperkitty/list/l...@lists.01.org   Intel Corporation

Thanks,
Oliver Sang

Re: [syzbot] KASAN: invalid-free in sg_finish_scsi_blk_rq

2021-03-16 Thread Douglas Gilbert


On 2021-03-15 9:59 p.m., syzbot wrote:

Hello,

syzbot found the following issue on:

HEAD commit:d98f554b Add linux-next specific files for 20210312
git tree:   linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1189318ad0
kernel config:  https://syzkaller.appspot.com/x/.config?x=e362835d2e58cef6
dashboard link: https://syzkaller.appspot.com/bug?extid=0a0e8ecea895d38332e6

Unfortunately, I don't have any reproducer for this issue yet.


No need, I think I can see how it happens. A particular type of resource
error from the block layer, together with a 32 byte (or larger) SCSI
command. I'm testing a patch.

Doug Gilbert


IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+0a0e8ecea895d3833...@syzkaller.appspotmail.com

==
BUG: KASAN: double-free or invalid-free in slab_free mm/slub.c:3161 [inline]
BUG: KASAN: double-free or invalid-free in kfree+0xe5/0x7f0 mm/slub.c:4215

CPU: 0 PID: 10481 Comm: syz-executor.5 Not tainted 
5.12.0-rc2-next-20210312-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:79 [inline]
  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
  print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
  kasan_report_invalid_free+0x51/0x80 mm/kasan/report.c:357
  kasan_slab_free mm/kasan/common.c:340 [inline]
  __kasan_slab_free+0x118/0x130 mm/kasan/common.c:367
  kasan_slab_free include/linux/kasan.h:200 [inline]
  slab_free_hook mm/slub.c:1562 [inline]
  slab_free_freelist_hook+0x92/0x210 mm/slub.c:1600
  slab_free mm/slub.c:3161 [inline]
  kfree+0xe5/0x7f0 mm/slub.c:4215
  scsi_req_free_cmd include/scsi/scsi_request.h:28 [inline]
  sg_finish_scsi_blk_rq+0x690/0x810 drivers/scsi/sg.c:3224
  sg_common_write+0xa07/0xe70 drivers/scsi/sg.c:1132
  sg_v3_submit+0x3b1/0x530 drivers/scsi/sg.c:797
  sg_ctl_sg_io drivers/scsi/sg.c:1785 [inline]
  sg_ioctl_common+0x3c86/0x97f0 drivers/scsi/sg.c:2014
  sg_ioctl+0x7c/0x110 drivers/scsi/sg.c:2229
  vfs_ioctl fs/ioctl.c:48 [inline]
  __do_sys_ioctl fs/ioctl.c:753 [inline]
  __se_sys_ioctl fs/ioctl.c:739 [inline]
  __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x465f69
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 
48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 
c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:7f8413efa188 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 0056bf60 RCX: 00465f69
RDX: 20001780 RSI: 2285 RDI: 0003
RBP: 004bfa8f R08:  R09: 
R10:  R11: 0246 R12: 0056bf60
R13: 7ffe20e16e2f R14: 7f8413efa300 R15: 00022000

Allocated by task 10481:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
  kasan_set_track mm/kasan/common.c:46 [inline]
  set_alloc_info mm/kasan/common.c:427 [inline]
  kasan_kmalloc mm/kasan/common.c:506 [inline]
  kasan_kmalloc mm/kasan/common.c:465 [inline]
  __kasan_kmalloc+0x99/0xc0 mm/kasan/common.c:515
  kmalloc include/linux/slab.h:561 [inline]
  kzalloc include/linux/slab.h:686 [inline]
  sg_start_req+0x16f/0x24e0 drivers/scsi/sg.c:3044
  sg_common_write+0x5fd/0xe70 drivers/scsi/sg.c:1109
  sg_v3_submit+0x3b1/0x530 drivers/scsi/sg.c:797
  sg_ctl_sg_io drivers/scsi/sg.c:1785 [inline]
  sg_ioctl_common+0x3c86/0x97f0 drivers/scsi/sg.c:2014
  sg_ioctl+0x7c/0x110 drivers/scsi/sg.c:2229
  vfs_ioctl fs/ioctl.c:48 [inline]
  __do_sys_ioctl fs/ioctl.c:753 [inline]
  __se_sys_ioctl fs/ioctl.c:739 [inline]
  __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 10481:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
  kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
  kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:357
  kasan_slab_free mm/kasan/common.c:360 [inline]
  kasan_slab_free mm/kasan/common.c:325 [inline]
  __kasan_slab_free+0xf5/0x130 mm/kasan/common.c:367
  kasan_slab_free include/linux/kasan.h:200 [inline]
  slab_free_hook mm/slub.c:1562 [inline]
  slab_free_freelist_hook+0x92/0x210 mm/slub.c:1600
  slab_free mm/slub.c:3161 [inline]
  kfree+0xe5/0x7f0 mm/slub.c:4215
  sg_start_req+0x1b33/0x24e0 drivers/scsi/sg.c:3106
  sg_common_write+0x5fd/0xe70 drivers/scsi/sg.c:1109
  sg_v3_submit+0x3b1/0x530 drivers/scsi/sg.c:797
  sg_ctl_sg_io drivers/scsi/sg.c:1785 [inline]
  sg_ioctl_common+0x3c86/0x97f0 drivers/scsi/sg.c:2014
  sg_ioctl+0x7c/0x110 drivers/scsi/sg.c:2229
  vfs_ioctl fs/ioctl.c:48 [inline]
  __do_sys_ioctl fs/ioctl.c:753 [inline]
  __se_sys_ioctl fs/ioctl.c:739 [inline]
  __x64_sys_ioctl+0x193/0x200

Re: [PATCH][next] scsi: sg: return -ENOMEM on out of memory error

2021-03-11 Thread Douglas Gilbert


On 2021-03-11 6:33 p.m., Colin King wrote:

From: Colin Ian King 

The sg_proc_seq_show_debug should return -ENOMEM on an
out of memory error rather than -1. Fix this.

Fixes: 94cda6cf2e44 ("scsi: sg: Rework debug info")
Signed-off-by: Colin Ian King 


Acked-by: Douglas Gilbert 

Thanks.


---
  drivers/scsi/sg.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 79f05afa4407..85e86cbc6891 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -4353,7 +4353,7 @@ sg_proc_seq_show_debug(struct seq_file *s, void *v)
if (!bp) {
seq_printf(s, "%s: Unable to allocate %d on heap, finish\n",
   __func__, bp_len);
-   return -1;
+   return -ENOMEM;
}
read_lock_irqsave(_index_lock, iflags);
sdp = it ? sg_lookup_dev(it->index) : NULL;

Re: [PATCH][next] scsi: sg: Fix use of pointer sfp after it has been kfree'd

2021-03-11 Thread Douglas Gilbert


On 2021-03-11 5:37 a.m., Colin King wrote:

From: Colin Ian King 

Currently SG_LOG is referencing sfp after it has been kfree'd which
is probably a bad thing to do. Fix this by kfree'ing sfp after
SG_LOG.

Addresses-Coverity: ("Use after free")
Fixes: af1fc95db445 ("scsi: sg: Replace rq array with xarray")
Signed-off-by: Colin Ian King 


Acked-by: Douglas Gilbert 

Thanks.


---
  drivers/scsi/sg.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 2d4bbc1a1727..79f05afa4407 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -3799,10 +3799,10 @@ sg_add_sfp(struct sg_device *sdp)
if (rbuf_len > 0) {
srp = sg_build_reserve(sfp, rbuf_len);
if (IS_ERR(srp)) {
-   kfree(sfp);
err = PTR_ERR(srp);
SG_LOG(1, sfp, "%s: build reserve err=%ld\n", __func__,
   -err);
+   kfree(sfp);
return ERR_PTR(err);
}
if (srp->sgat_h.buflen < rbuf_len) {

Re: linux-next: build failure after merge of the scsi-mkp tree

2021-01-27 Thread Douglas Gilbert


On 2021-01-27 2:01 a.m., Stephen Rothwell wrote:

Hi all,

On Mon, 25 Jan 2021 00:53:59 -0500 Douglas Gilbert  
wrote:


On 2021-01-24 11:13 p.m., Stephen Rothwell wrote:


After merging the scsi-mkp tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

drivers/scsi/sg.c: In function 'sg_find_srp_by_id':
drivers/scsi/sg.c:2908:4: error: expected '}' before 'else'
   2908 |else
|^~~~
drivers/scsi/sg.c:2902:16: warning: unused variable 'cptp' [-Wunused-variable]
   2902 |const char *cptp = "pack_id=";
|^~~~
drivers/scsi/sg.c:2896:5: error: label 'good' used but not defined
   2896 | goto good;
| ^~~~
drivers/scsi/sg.c: At top level:
drivers/scsi/sg.c:2913:2: error: expected identifier or '(' before 'return'
   2913 |  return NULL;
|  ^~
drivers/scsi/sg.c:2914:5: error: expected '=', ',', ';', 'asm' or 
'__attribute__' before ':' token
   2914 | good:
| ^
drivers/scsi/sg.c:2917:2: error: expected identifier or '(' before 'return'
   2917 |  return srp;
|  ^~
drivers/scsi/sg.c:2918:1: error: expected identifier or '(' before '}' token
   2918 | }
| ^
drivers/scsi/sg.c: In function 'sg_find_srp_by_id':
drivers/scsi/sg.c:2912:2: error: control reaches end of non-void function 
[-Werror=return-type]
   2912 |  }
|  ^

Caused by commit

7323ad3618b6 ("scsi: sg: Replace rq array with xarray")

SG_LOG() degenerates to "{}" in some configs ...

I have used the scsi-mkp tree from next-20210122 for today.


I sent a new patchset to the linux-scsi list about 4 hours ago to
fix that.

Doug Gilbert


I am still getting this build failure.


Hi,
I resent the original patch set, with fixes, against the linux-scsi
list yesterday but that was not the form that Martin Petersen needs
it in. That was against his 5.12/scsi-queue branch which is roughly
lk 5.11.0-rc2. He has referred me to his 5.12/scsi-staging branch
which looks half applied from the 45 patch set that I have been
sending to the linux-scsi list. Trying to find out if that was the
intention or a mistake.

The other issue is a large patchset that removes the first function
argument from blk_execute_rq_nowait() which is used by the sg driver.

Doug Gilbert

Re: linux-next: build failure after merge of the scsi-mkp tree

2021-01-24 Thread Douglas Gilbert


On 2021-01-24 11:13 p.m., Stephen Rothwell wrote:

Hi all,

After merging the scsi-mkp tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

drivers/scsi/sg.c: In function 'sg_find_srp_by_id':
drivers/scsi/sg.c:2908:4: error: expected '}' before 'else'
  2908 |else
   |^~~~
drivers/scsi/sg.c:2902:16: warning: unused variable 'cptp' [-Wunused-variable]
  2902 |const char *cptp = "pack_id=";
   |^~~~
drivers/scsi/sg.c:2896:5: error: label 'good' used but not defined
  2896 | goto good;
   | ^~~~
drivers/scsi/sg.c: At top level:
drivers/scsi/sg.c:2913:2: error: expected identifier or '(' before 'return'
  2913 |  return NULL;
   |  ^~
drivers/scsi/sg.c:2914:5: error: expected '=', ',', ';', 'asm' or 
'__attribute__' before ':' token
  2914 | good:
   | ^
drivers/scsi/sg.c:2917:2: error: expected identifier or '(' before 'return'
  2917 |  return srp;
   |  ^~
drivers/scsi/sg.c:2918:1: error: expected identifier or '(' before '}' token
  2918 | }
   | ^
drivers/scsi/sg.c: In function 'sg_find_srp_by_id':
drivers/scsi/sg.c:2912:2: error: control reaches end of non-void function 
[-Werror=return-type]
  2912 |  }
   |  ^

Caused by commit

   7323ad3618b6 ("scsi: sg: Replace rq array with xarray")

SG_LOG() degenerates to "{}" in some configs ...

I have used the scsi-mkp tree from next-20210122 for today.



Hi,
I sent a new patchset to the linux-scsi list about 4 hours ago to
fix that.

Doug Gilbert

[PATCH 0/3] scatterlist: sgl-sgl ops: copy, equal

2021-01-19 Thread Douglas Gilbert

Scatter-gather lists (sgl_s) are frequently used as data carriers in
the block layer. For example the SCSI and NVMe subsystems interchange
data with the block layer using sgl_s. The sgl API is declared in


The author has extended these transient sgl use cases to a store (i.e.
a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s
could be for the target subsystem. When this extra step is taken, the
need to copy between sgl_s becomes apparent. The patchset adds
sgl_copy_sgl(), sgl_equal_sgl() and sgl_memset().

Changes since v6 [posted 20210118]:
  - restarted with new patchset name, was "scatterlist: add new
capabilities"
  - drop correction patch "sgl_alloc_order: remove 4 GiB limit,
sgl_free() warning"; could be sent separately as a fix
  - rename sgl_compare_sgl() to sg_equal_sgl() and the helper
to sg_equal_sgl_idx()

Changes since v5 [posted 20201228]:
  - incorporate review requests from Jason Gunthorpe
  - replace integer overflow detection code in sgl_alloc_order()
with a pre-condition statement
  - rebase on lk 5.11.0-rc4

Changes since v4 [posted 20201105]:
  - rebase on lk 5.10.0-rc2

Changes since v3 [posted 20201019]:
  - re-instate check on integer overflow of nent calculation in
sgl_alloc_order(). Do it in such a way as to not limit the
overall sgl size to 4  GiB
  - introduce sgl_compare_sgl_idx() helper function that, if
requested and if a miscompare is detected, will yield the byte
index of the first miscompare.
  - add Reviewed-by tags from Bodo Stroesser
  - rebase on lk 5.10.0-rc2 [was on lk 5.9.0]

Changes since v2 [posted 20201018]:
  - remove unneeded lines from sgl_memset() definition.
  - change sg_zero_buffer() to call sgl_memset() as the former
is a subset.

Changes since v1 [posted 20201016]:
  - Bodo Stroesser pointed out a problem with the nesting of
kmap_atomic() [called via sg_miter_next()] and kunmap_atomic()
calls [called via sg_miter_stop()] and proposed a solution that
simplifies the previous code.

  - the new implementation of the three functions has shorter periods
when pre-emption is disabled (but has more them). This should
make operations on large sgl_s more pre-emption "friendly" with
a relatively small performance hit.

  - sgl_memset return type changed from void to size_t and is the
number of bytes actually (over)written. That number is needed
anyway internally so may as well return it as it may be useful to
the caller.

This patchset is against lk 5.11.0-rc4

Douglas Gilbert (3):
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_equal_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  32 -
 lib/scatterlist.c   | 233 
 2 files changed, 243 insertions(+), 22 deletions(-)

-- 
2.25.1

[PATCH 2/3] scatterlist: add sgl_equal_sgl() function

2021-01-19 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s), another
storage related operation is to compare two sgl_s for equality. This
new function is designed to partially implement NVMe's Compare
command and the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this
function begins scanning at the start (of each sgl) and returns
false on the first miscompare and stops comparing.

The sgl_equal_sgl_idx() function additionally yields the index (i.e.
byte position) of the first miscompare. The additional parameter,
miscompare_idx, is a pointer. If it is non-NULL and a miscompare is
detected (i.e. the function returns false) then the byte index of
the first miscompare is written to *miscompare_idx. Knowing the
location of the first miscompare is needed to implement properly
the SCSI COMPARE AND WRITE command.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |   8 +++
 lib/scatterlist.c   | 110 
 2 files changed, 118 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 22111ee21383..40449ce96a18 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -324,6 +324,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_equal_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+  struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+  size_t n_bytes);
+
+bool sgl_equal_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+  struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+  size_t n_bytes, size_t *miscompare_idx);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 782bcfe72c60..a8672bc6d883 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1132,3 +1132,113 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
return offset;
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
+
+/**
+ * sgl_equal_sgl_idx - check if x and y (both sgl_s) compare equal, report
+ *index for first unequal bytes
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The (maximum) number of bytes to compare
+ * @miscompare_idx: if return is false, index of first miscompare written
+ *  to this pointer (if non-NULL). Value will be < n_bytes
+ *
+ * Returns:
+ *   true if x and y compare equal before x, y or n_bytes is exhausted.
+ *   Otherwise on a miscompare, returns false (and stops comparing). If return
+ *   is false and miscompare_idx is non-NULL, then index of first miscompared
+ *   byte written to *miscompare_idx.
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+bool sgl_equal_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+  struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+  size_t n_bytes, size_t *miscompare_idx)
+{
+   bool equ = true;
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter, y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(x_iter.length, y_iter.length, n_bytes - offset);
+
+   equ = !memcmp(x_iter.addr, y_iter.addr, len);
+   if (!equ)
+   goto fini;
+   offset += len;
+   /* LIFO order is important when SG_MITER_ATOMIC is used */
+   y_iter.consumed = len;
+   sg_miter_stop(_iter);
+   x_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }

[PATCH 3/3] scatterlist: add sgl_memset()

2021-01-19 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive. For
example protection information (PI) blocks are usually initialized
to 0xff bytes. As its name suggests sgl_memset() is modelled on
memset(). One difference is the type of the val argument which is
u8 rather than int. Plus it returns the number of bytes (over)written.

Change implementation of sg_zero_buffer() to call this new function.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h | 20 +-
 lib/scatterlist.c   | 79 +
 2 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 40449ce96a18..04be80d1a07c 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -317,8 +317,6 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, 
unsigned int nents,
const void *buf, size_t buflen, off_t skip);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
  void *buf, size_t buflen, off_t skip);
-size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
-  size_t buflen, off_t skip);
 
 size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
@@ -332,6 +330,24 @@ bool sgl_equal_sgl_idx(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_
   struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
   size_t n_bytes, size_t *miscompare_idx);
 
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes);
+
+/**
+ * sg_zero_buffer - Zero-out a part of a SG list
+ * @sgl:   The SG list
+ * @nents: Number of SG entries
+ * @buflen:The number of bytes to zero out
+ * @skip:  Number of bytes to skip before zeroing
+ *
+ * Returns the number of bytes zeroed.
+ **/
+static inline size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int 
nents,
+   size_t buflen, off_t skip)
+{
+   return sgl_memset(sgl, nents, skip, 0, buflen);
+}
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a8672bc6d883..cb4d59111c78 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1024,41 +1024,6 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
 
-/**
- * sg_zero_buffer - Zero-out a part of a SG list
- * @sgl:The SG list
- * @nents:  Number of SG entries
- * @buflen: The number of bytes to zero out
- * @skip:   Number of bytes to skip before zeroing
- *
- * Returns the number of bytes zeroed.
- **/
-size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
-  size_t buflen, off_t skip)
-{
-   unsigned int offset = 0;
-   struct sg_mapping_iter miter;
-   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
-
-   sg_miter_start(, sgl, nents, sg_flags);
-
-   if (!sg_miter_skip(, skip))
-   return false;
-
-   while (offset < buflen && sg_miter_next()) {
-   unsigned int len;
-
-   len = min(miter.length, buflen - offset);
-   memset(miter.addr, 0, len);
-
-   offset += len;
-   }
-
-   sg_miter_stop();
-   return offset;
-}
-EXPORT_SYMBOL(sg_zero_buffer);
-
 /**
  * sgl_copy_sgl - Copy over a destination sgl from a source sgl
  * @d_sgl:  Destination sgl
@@ -1242,3 +1207,47 @@ bool sgl_equal_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_skip
return sgl_equal_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, 
y_skip, n_bytes, NULL);
 }
 EXPORT_SYMBOL(sgl_equal_sgl);
+
+/**
+ * sgl_memset - set byte 'val' up to n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The (maximum) number of bytes to modify
+ *
+ * Returns:
+ *   The number of bytes written.
+ *
+ * Notes:
+ *   Stops writing if either sgl or n_bytes is exhausted. If n_bytes is
+ *   set SIZE_MAX then val will be written to each byte until the end
+ *   of sgl.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+
+   if (n_bytes == 0)
+

[PATCH 1/3] scatterlist: add sgl_copy_sgl() function

2021-01-19 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a sgl to sgl copy. Stops
when the first of the number of requested bytes to copy, or the
source sgl, or the destination sgl is exhausted. So the
destination sgl will _not_ grow.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 74 +
 2 files changed, 78 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f70572b2938..22111ee21383 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -320,6 +320,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a59778946404..782bcfe72c60 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1058,3 +1058,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before starting
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before starting
+ * @n_bytes:The (maximum) number of bytes to copy
+ *
+ * Returns:
+ *   The number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if either d_sgl, s_sgl or n_bytes is exhausted.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   Large copies are broken into copy segments whose sizes may vary. Those
+ *   copy segment sizes are chosen by the min3() statement in the code below.
+ *   Since SG_MITER_ATOMIC is used for both sides, each copy segment is started
+ *   with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic()
+ *   [in sg_miter_stop()]. This means pre-emption is inhibited for relatively
+ *   short periods even in very large copies.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter, s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(d_iter.length, s_iter.length, n_bytes - offset);
+
+   memcpy(d_iter.addr, s_iter.addr, len);
+   offset += len;
+   /* LIFO order (stop d_iter before s_iter) needed with 
SG_MITER_ATOMIC */
+   d_iter.consumed = len;
+   sg_miter_stop(_iter);
+   s_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
-- 
2.25.1

Re: [PATCH v6 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2021-01-18 Thread Douglas Gilbert


On 2021-01-18 6:48 p.m., Jason Gunthorpe wrote:

On Mon, Jan 18, 2021 at 10:22:56PM +0100, Bodo Stroesser wrote:

On 18.01.21 21:24, Jason Gunthorpe wrote:

On Mon, Jan 18, 2021 at 03:08:51PM -0500, Douglas Gilbert wrote:

On 2021-01-18 1:28 p.m., Jason Gunthorpe wrote:

On Mon, Jan 18, 2021 at 11:30:03AM -0500, Douglas Gilbert wrote:


After several flawed attempts to detect overflow, take the fastest
route by stating as a pre-condition that the 'order' function argument
cannot exceed 16 (2^16 * 4k = 256 MiB).


That doesn't help, the point of the overflow check is similar to
overflow checks in kcalloc: to prevent the routine from allocating
less memory than the caller might assume.

For instance ipr_store_update_fw() uses request_firmware() (which is
controlled by userspace) to drive the length argument to
sgl_alloc_order(). If userpace gives too large a value this will
corrupt kernel memory.

So this math:

nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);


But that check itself overflows if order is too large (e.g. 65).


I don't reall care about order. It is always controlled by the kernel
and it is fine to just require it be low enough to not
overflow. length is the data under userspace control so math on it
must be checked for overflow.


Also note there is another pre-condition statement in that function's
definition, namely that length cannot be 0.


I don't see callers checking for that either, if it is true length 0
can't be allowed it should be blocked in the function

Jason



A already said, I also think there should be a check for length or
rather nent overflow.

I like the easy to understand check in your proposed code:

if (length >> (PAGE_SHIFT + order) >= UINT_MAX)
return NULL;


But I don't understand, why you open-coded the nent calculation:

nent = length >> (PAGE_SHIFT + order);
if (length & ((1ULL << (PAGE_SHIFT + order)) - 1))
nent++;


It is necessary to properly check for overflow, because the easy to
understand check doesn't prove that round_up will work, only that >>
results in something that fits in an int and that +1 won't overflow
the int.


Wouldn't it be better to keep the original line instead:

nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);


This can overflow inside the round_up


To protect against the "unsigned long long" length being too big why
not pick a large power of two and if someone can justify a larger
value, they can send a patch.

if (length > 64ULL * 1024 * 1024 * 1024)
return NULL;

So 64 GiB or a similar calculation involving PAGE_SIZE. Compiler does
the multiplication and at run time there is only a 64 bit comparison.


I tested 6 one GiB ramdisks on an 8 GiB machine, worked fine until
firefox was started. Then came the OOM killer ...

Doug Gilbert

Re: [PATCH v6 3/4] scatterlist: add sgl_compare_sgl() function

2021-01-18 Thread Douglas Gilbert


On 2021-01-18 6:27 p.m., David Disseldorp wrote:

On Mon, 18 Jan 2021 11:30:05 -0500, Douglas Gilbert wrote:


After enabling copies between scatter gather lists (sgl_s), another
storage related operation is to compare two sgl_s. This new function
is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1)
command. Like memcmp() this function returns false on the first
miscompare and stops comparing.

A helper function called sgl_compare_sgl_idx() is added. It takes an
additional parameter (miscompare_idx) which is a pointer. If that
pointer is non-NULL and a miscompare is detected (i.e. the function
returns false) then the byte index of the first miscompare is written
to *miscomapre_idx. Knowing the location of the first miscompare is
needed to implement the SCSI COMPARE AND WRITE command properly.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
  include/linux/scatterlist.h |   8 +++
  lib/scatterlist.c   | 109 
  2 files changed, 117 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 3f836a3246aa..71be65f9ebb5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -325,6 +325,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
  
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t x_skip,

+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx);



This patch looks good and works fine as a replacement for
compare_and_write_do_cmp(). One minor suggestion would be to name it
sgl_equal() or similar, to perhaps better reflect the bool return and
avoid memcmp() confusion. Either way:
Reviewed-by: David Disseldorp 


Thanks. NVMe calls the command that does this Compare and SCSI uses
COMPARE AND WRITE (and VERIFY(BYTCHK=1) ) but "equal" is fine with me.
There will be another patchset version (at least) so there is time
to change.

Do you want:
  - sgl_equal(...), or
  - sgl_equal_sgl(...) ?

Doug Gilbert

Re: [PATCH] checkpatch: Improve TYPECAST_INT_CONSTANT test message

2021-01-18 Thread Douglas Gilbert


On 2021-01-18 12:19 p.m., Joe Perches wrote:

Improve the TYPECAST_INT_CONSTANT test by showing the suggested
conversion for various type of uses like (unsigned int)1 to 1U.


The questionable code snippet was:
unsigned int nent, nalloc;


if (check_add_overflow(nent, (unsigned int)1, ))

where the check_add_overflow() macro [include/linux/overflow.h]
uses typeid to check its first and second arguments have the
same type. So it is likely others could meet this issue.

Doug Gilbert



Signed-off-by: Joe Perches 
---

Douglas Gilbert sent me a private email (and in that email said he
'loves to hate checkpatch' ;) complaining that checkpatch warned on the
use of the cast of '(unsigned int)1' so make it more obvious why the
message is emitted by always showing the suggested conversion.

  scripts/checkpatch.pl | 20 ++--
  1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 016115a62a9f..4f8494527139 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -6527,18 +6527,18 @@ sub process {
if ($line =~ /(\(\s*$C90_int_types\s*\)\s*)($Constant)\b/) {
my $cast = $1;
my $const = $2;
+   my $suffix = "";
+   my $newconst = $const;
+   $newconst =~ s/${Int_type}$//;
+   $suffix .= 'U' if ($cast =~ /\bunsigned\b/);
+   if ($cast =~ /\blong\s+long\b/) {
+   $suffix .= 'LL';
+   } elsif ($cast =~ /\blong\b/) {
+   $suffix .= 'L';
+   }
if (WARN("TYPECAST_INT_CONSTANT",
-"Unnecessary typecast of c90 int constant\n" . 
$herecurr) &&
+"Unnecessary typecast of c90 int constant - '$cast$const' 
could be '$const$suffix'\n" . $herecurr) &&
$fix) {
-   my $suffix = "";
-   my $newconst = $const;
-   $newconst =~ s/${Int_type}$//;
-   $suffix .= 'U' if ($cast =~ /\bunsigned\b/);
-   if ($cast =~ /\blong\s+long\b/) {
-   $suffix .= 'LL';
-   } elsif ($cast =~ /\blong\b/) {
-   $suffix .= 'L';
-   }
$fixed[$fixlinenr] =~ 
s/\Q$cast\E$const\b/$newconst$suffix/;
}
}

Re: [PATCH v6 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2021-01-18 Thread Douglas Gilbert


On 2021-01-18 1:28 p.m., Jason Gunthorpe wrote:

On Mon, Jan 18, 2021 at 11:30:03AM -0500, Douglas Gilbert wrote:


After several flawed attempts to detect overflow, take the fastest
route by stating as a pre-condition that the 'order' function argument
cannot exceed 16 (2^16 * 4k = 256 MiB).


That doesn't help, the point of the overflow check is similar to
overflow checks in kcalloc: to prevent the routine from allocating
less memory than the caller might assume.

For instance ipr_store_update_fw() uses request_firmware() (which is
controlled by userspace) to drive the length argument to
sgl_alloc_order(). If userpace gives too large a value this will
corrupt kernel memory.

So this math:

nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);


But that check itself overflows if order is too large (e.g. 65).
A pre-condition says that the caller must know or check a value
is sane, and if the user space can have a hand in the value passed
the caller _must_ check pre-conditions IMO. A pre-condition also
implies that the function's implementation will not have code to
check the pre-condition.

My "log of both sides" proposal at least got around the overflowing
left shift problem. And one reviewer, Bodo Stroesser, liked it.


Needs to be checked, add a precondition to order does not help. I
already proposed a straightforward algorithm you can use.


It does help, it stops your proposed check from being flawed :-)

Giving a false sense of security seems more dangerous than a
pre-condition statement IMO. Bart's original overflow check (in
the mainline) limits length to 4GB (due to wrapping inside a 32
bit unsigned).

Also note there is another pre-condition statement in that function's
definition, namely that length cannot be 0.

So perhaps you, Bart Van Assche and Bodo Stroesser, should compare
notes and come up with a solution that you are _all_ happy with.
The pre-condition works for me and is the fastest. The 'length'
argument might be large, say > 1 GB [I use 1 GB in testing but
did try 4GB and found the bug I'm trying to fix] but having
individual elements greater than say 32 MB each does not
seem very practical (and fails on the systems that I test with).
In my testing the largest element size is 4 MB.


Doug Gilbert

[PATCH v6 3/4] scatterlist: add sgl_compare_sgl() function

2021-01-18 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s), another
storage related operation is to compare two sgl_s. This new function
is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1)
command. Like memcmp() this function returns false on the first
miscompare and stops comparing.

A helper function called sgl_compare_sgl_idx() is added. It takes an
additional parameter (miscompare_idx) which is a pointer. If that
pointer is non-NULL and a miscompare is detected (i.e. the function
returns false) then the byte index of the first miscompare is written
to *miscomapre_idx. Knowing the location of the first miscompare is
needed to implement the SCSI COMPARE AND WRITE command properly.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |   8 +++
 lib/scatterlist.c   | 109 
 2 files changed, 117 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 3f836a3246aa..71be65f9ebb5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -325,6 +325,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c06f8caaff91..e3182de753d0 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1131,3 +1131,112 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
return offset;
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
+
+/**
+ * sgl_compare_sgl_idx - Compare x and y (both sgl_s)
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The (maximum) number of bytes to compare
+ * @miscompare_idx: if return is false, index of first miscompare written
+ *  to this pointer (if non-NULL). Value will be < n_bytes
+ *
+ * Returns:
+ *   true if x and y compare equal before x, y or n_bytes is exhausted.
+ *   Otherwise on a miscompare, returns false (and stops comparing). If return
+ *   is false and miscompare_idx is non-NULL, then index of first miscompared
+ *   byte written to *miscompare_idx.
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx)
+{
+   bool equ = true;
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter, y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(x_iter.length, y_iter.length, n_bytes - offset);
+
+   equ = !memcmp(x_iter.addr, y_iter.addr, len);
+   if (!equ)
+   goto fini;
+   offset += len;
+   /* LIFO order is important when SG_MITER_ATOMIC is used */
+   y_iter.consumed = len;
+   sg_miter_stop(_iter);
+   x_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   if (miscompare_idx && !equ) {
+   u8 *xp = x_iter.addr;
+   u8 *yp = y_iter.addr;
+   u8 *x_endp;
+
+   fo

[PATCH v6 4/4] scatterlist: add sgl_memset()

2021-01-18 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive. For
example protection information (PI) blocks are usually initialized
to 0xff bytes. As its name suggests sgl_memset() is modelled on
memset(). One difference is the type of the val argument which is
u8 rather than int. Plus it returns the number of bytes (over)written.

Change implementation of sg_zero_buffer() to call this new function.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h | 20 +-
 lib/scatterlist.c   | 79 +
 2 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 71be65f9ebb5..69e87280b44d 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -318,8 +318,6 @@ size_t sg_pcopy_from_buffer(struct scatterlist *sgl, 
unsigned int nents,
const void *buf, size_t buflen, off_t skip);
 size_t sg_pcopy_to_buffer(struct scatterlist *sgl, unsigned int nents,
  void *buf, size_t buflen, off_t skip);
-size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
-  size_t buflen, off_t skip);
 
 size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
@@ -333,6 +331,24 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, 
unsigned int x_nents, off_t
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes, size_t *miscompare_idx);
 
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes);
+
+/**
+ * sg_zero_buffer - Zero-out a part of a SG list
+ * @sgl:   The SG list
+ * @nents: Number of SG entries
+ * @buflen:The number of bytes to zero out
+ * @skip:  Number of bytes to skip before zeroing
+ *
+ * Returns the number of bytes zeroed.
+ **/
+static inline size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int 
nents,
+   size_t buflen, off_t skip)
+{
+   return sgl_memset(sgl, nents, skip, 0, buflen);
+}
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index e3182de753d0..7e6acc67e9f6 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1023,41 +1023,6 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 }
 EXPORT_SYMBOL(sg_pcopy_to_buffer);
 
-/**
- * sg_zero_buffer - Zero-out a part of a SG list
- * @sgl:The SG list
- * @nents:  Number of SG entries
- * @buflen: The number of bytes to zero out
- * @skip:   Number of bytes to skip before zeroing
- *
- * Returns the number of bytes zeroed.
- **/
-size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
-  size_t buflen, off_t skip)
-{
-   unsigned int offset = 0;
-   struct sg_mapping_iter miter;
-   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
-
-   sg_miter_start(, sgl, nents, sg_flags);
-
-   if (!sg_miter_skip(, skip))
-   return false;
-
-   while (offset < buflen && sg_miter_next()) {
-   unsigned int len;
-
-   len = min(miter.length, buflen - offset);
-   memset(miter.addr, 0, len);
-
-   offset += len;
-   }
-
-   sg_miter_stop();
-   return offset;
-}
-EXPORT_SYMBOL(sg_zero_buffer);
-
 /**
  * sgl_copy_sgl - Copy over a destination sgl from a source sgl
  * @d_sgl:  Destination sgl
@@ -1240,3 +1205,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
return sgl_compare_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, 
y_skip, n_bytes, NULL);
 }
 EXPORT_SYMBOL(sgl_compare_sgl);
+
+/**
+ * sgl_memset - set byte 'val' up to n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The (maximum) number of bytes to modify
+ *
+ * Returns:
+ *   The number of bytes written.
+ *
+ * Notes:
+ *   Stops writing if either sgl or n_bytes is exhausted. If n_bytes is
+ *   set SIZE_MAX then val will be written to each byte until the end
+ *   of sgl.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+
+   if (n

[PATCH v6 2/4] scatterlist: add sgl_copy_sgl() function

2021-01-18 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a sgl to sgl copy. Stops
when the first of the number of requested bytes to copy, or the
source sgl, or the destination sgl is exhausted. So the
destination sgl will _not_ grow.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 74 +
 2 files changed, 78 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 8adff41f7cfa..3f836a3246aa 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 24ea2d31a405..c06f8caaff91 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1057,3 +1057,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before starting
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before starting
+ * @n_bytes:The (maximum) number of bytes to copy
+ *
+ * Returns:
+ *   The number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if either d_sgl, s_sgl or n_bytes is exhausted.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   Large copies are broken into copy segments whose sizes may vary. Those
+ *   copy segment sizes are chosen by the min3() statement in the code below.
+ *   Since SG_MITER_ATOMIC is used for both sides, each copy segment is started
+ *   with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic()
+ *   [in sg_miter_stop()]. This means pre-emption is inhibited for relatively
+ *   short periods even in very large copies.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter, s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(d_iter.length, s_iter.length, n_bytes - offset);
+
+   memcpy(d_iter.addr, s_iter.addr, len);
+   offset += len;
+   /* LIFO order (stop d_iter before s_iter) needed with 
SG_MITER_ATOMIC */
+   d_iter.consumed = len;
+   sg_miter_stop(_iter);
+   s_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
-- 
2.25.1

[PATCH v6 0/4] scatterlist: add new capabilities

2021-01-18 Thread Douglas Gilbert

Scatter-gather lists (sgl_s) are frequently used as data carriers in
the block layer. For example the SCSI and NVMe subsystems interchange
data with the block layer using sgl_s. The sgl API is declared in


The author has extended these transient sgl use cases to a store (i.e.
a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s
could be for the target subsystem. When this extra step is taken, the
need to copy between sgl_s becomes apparent. The patchset adds
sgl_copy_sgl(), sgl_compare_sgl() and sgl_memset().

The existing sgl_alloc_order() function can be seen as a replacement
for vmalloc() for large, long-term allocations.  For what seems like
no good reason, sgl_alloc_order() currently restricts its total
allocation to less than or equal to 4 GiB. vmalloc() has no such
restriction.

Changes since v5 [posted 20201228]:
  - incorporate review requests from Jason Gunthorpe
  - replace integer overflow detection code in sgl_alloc_order()
with a pre-condition statement
  - rebase on lk 5.11.0-rc4

Changes since v4 [posted 20201105]:
  - rebase on lk 5.10.0-rc2

Changes since v3 [posted 20201019]:
  - re-instate check on integer overflow of nent calculation in
sgl_alloc_order(). Do it in such a way as to not limit the
overall sgl size to 4  GiB
  - introduce sgl_compare_sgl_idx() helper function that, if
requested and if a miscompare is detected, will yield the byte
index of the first miscompare.
  - add Reviewed-by tags from Bodo Stroesser
  - rebase on lk 5.10.0-rc2 [was on lk 5.9.0]

Changes since v2 [posted 20201018]:
  - remove unneeded lines from sgl_memset() definition.
  - change sg_zero_buffer() to call sgl_memset() as the former
is a subset.

Changes since v1 [posted 20201016]:
  - Bodo Stroesser pointed out a problem with the nesting of
kmap_atomic() [called via sg_miter_next()] and kunmap_atomic()
calls [called via sg_miter_stop()] and proposed a solution that
simplifies the previous code.

  - the new implementation of the three functions has shorter periods
when pre-emption is disabled (but has more them). This should
make operations on large sgl_s more pre-emption "friendly" with
a relatively small performance hit.

  - sgl_memset return type changed from void to size_t and is the
number of bytes actually (over)written. That number is needed
anyway internally so may as well return it as it may be useful to
the caller.

This patchset is against lk 5.11.0-rc4

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  33 -
 lib/scatterlist.c   | 253 +++-
 2 files changed, 253 insertions(+), 33 deletions(-)

-- 
2.25.1

[PATCH v6 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2021-01-18 Thread Douglas Gilbert

This patch fixes a check done by sgl_alloc_order() before it starts
any allocations. The comment in the original said: "Check for integer
overflow" but the check itself contained an integer overflow! The
right hand side (rhs) of the expression in the condition is resolved
as u32 so it could not exceed UINT32_MAX (4 GiB) which means 'length'
could not exceed that value. If that was the intention then the
comment above it could be dropped and the condition rewritten more
clearly as:
 if (length > UINT32_MAX) <>;

After several flawed attempts to detect overflow, take the fastest
route by stating as a pre-condition that the 'order' function argument
cannot exceed 16 (2^16 * 4k = 256 MiB).

This function may be used to replace vmalloc(unsigned long) for a
large allocation (e.g. a ramdisk). vmalloc has no limit at 4 GiB so
it seems unreasonable that:
sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order() have equally sized segments
placed in a scatter gather array. That allows O(1) navigation around
a big sgl using some simple integer arithmetic.

Revise some of this function's description to more accurately reflect
what this function is doing.

An earlier patch fixed a memory leak in sg_alloc_order() due to the
misuse of sgl_free(). Take the opportunity to put a one line comment
above sgl_free()'s declaration warning that it is not suitable when
order > 0 .

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  1 +
 lib/scatterlist.c   | 21 ++---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f70572b2938..8adff41f7cfa 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a59778946404..24ea2d31a405 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -554,13 +554,16 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages);
 #ifdef CONFIG_SGL_ALLOC
 
 /**
- * sgl_alloc_order - allocate a scatterlist and its pages
+ * sgl_alloc_order - allocate a scatterlist with equally sized elements each
+ *  of which has 2^@order continuous pages
  * @length: Length in bytes of the scatterlist. Must be at least one
- * @order: Second argument for alloc_pages()
+ * @order: Second argument for alloc_pages(). Each sgl element size will
+ *be (PAGE_SIZE*2^@order) bytes. @order must not exceed 16.
  * @chainable: Whether or not to allocate an extra element in the scatterlist
- * for scatterlist chaining purposes
+ *for scatterlist chaining purposes
  * @gfp: Memory allocation flags
- * @nent_p: [out] Number of entries in the scatterlist that have pages
+ * @nent_p: [out] Number of entries in the scatterlist that have pages.
+ *   Ignored if NULL is given.
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
@@ -574,15 +577,11 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
-   return NULL;
-   nalloc = nent;
if (chainable) {
-   /* Check for integer overflow */
-   if (nalloc + 1 < nalloc)
+   if (check_add_overflow(nent, 1U, ))
return NULL;
-   nalloc++;
+   } else {
+   nalloc = nent;
}
sgl = kmalloc_array(nalloc, sizeof(struct scatterlist),
gfp & ~GFP_DMA);
-- 
2.25.1

Re: [PATCH v5 4/4] scatterlist: add sgl_memset()

2021-01-09 Thread Douglas Gilbert


On 2021-01-07 12:46 p.m., Jason Gunthorpe wrote:

On Mon, Dec 28, 2020 at 06:49:55PM -0500, Douglas Gilbert wrote:

The existing sg_zero_buffer() function is a bit restrictive. For
example protection information (PI) blocks are usually initialized
to 0xff bytes. As its name suggests sgl_memset() is modelled on
memset(). One difference is the type of the val argument which is
u8 rather than int. Plus it returns the number of bytes (over)written.

Change implementation of sg_zero_buffer() to call this new function.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
  include/linux/scatterlist.h |  3 ++
  lib/scatterlist.c   | 65 +
  2 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 71be65f9ebb5..70d3f1f73df1 100644
+++ b/include/linux/scatterlist.h
@@ -333,6 +333,9 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, 
unsigned int x_nents, off_t
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes, size_t *miscompare_idx);
  
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,

+ u8 val, size_t n_bytes);
+
  /*
   * Maximum number of entries that will be allocated in one piece, if
   * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 9332365e7eb6..f06614a880c8 100644
+++ b/lib/scatterlist.c
@@ -1038,26 +1038,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer);
  size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip)
  {
-   unsigned int offset = 0;
-   struct sg_mapping_iter miter;
-   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
-
-   sg_miter_start(, sgl, nents, sg_flags);
-
-   if (!sg_miter_skip(, skip))
-   return false;
-
-   while (offset < buflen && sg_miter_next()) {
-   unsigned int len;
-
-   len = min(miter.length, buflen - offset);
-   memset(miter.addr, 0, len);
-
-   offset += len;
-   }
-
-   sg_miter_stop();
-   return offset;
+   return sgl_memset(sgl, nents, skip, 0, buflen);
  }
  EXPORT_SYMBOL(sg_zero_buffer);


May as well make this one liner a static inline in the header. Just
rename this function to sgl_memset so the diff is clearer


Yes, fine. I can roll a new version.

Doug Gilbert

Re: [PATCH v5 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2021-01-09 Thread Douglas Gilbert


On 2021-01-07 12:44 p.m., Jason Gunthorpe wrote:

On Mon, Dec 28, 2020 at 06:49:52PM -0500, Douglas Gilbert wrote:

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a59778946404..4986545beef9 100644
+++ b/lib/scatterlist.c
@@ -554,13 +554,15 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages);
  #ifdef CONFIG_SGL_ALLOC
  
  /**

- * sgl_alloc_order - allocate a scatterlist and its pages
+ * sgl_alloc_order - allocate a scatterlist with equally sized elements
   * @length: Length in bytes of the scatterlist. Must be at least one
- * @order: Second argument for alloc_pages()
+ * @order: Second argument for alloc_pages(). Each sgl element size will
+ *be (PAGE_SIZE*2^order) bytes
   * @chainable: Whether or not to allocate an extra element in the scatterlist
- * for scatterlist chaining purposes
+ *for scatterlist chaining purposes
   * @gfp: Memory allocation flags
- * @nent_p: [out] Number of entries in the scatterlist that have pages
+ * @nent_p: [out] Number of entries in the scatterlist that have pages.
+ *   Ignored if NULL is given.
   *
   * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
   */
@@ -574,8 +576,8 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
  
  	nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);

-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
+   /* Integer overflow if:  length > nent*2^(PAGE_SHIFT+order) */
+   if (ilog2(length) > ilog2(nent) + PAGE_SHIFT + order)
return NULL;
nalloc = nent;
if (chainable) {


This is a little bit too tortured now, how about this:

if (length >> (PAGE_SHIFT + order) >= UINT_MAX)
return NULL;
nent = length >> (PAGE_SHIFT + order);
if (length & ((1ULL << (PAGE_SHIFT + order)) - 1))
nent++;

if (chainable) {
if (check_add_overflow(nent, 1, ))
return NULL;
}
else
nalloc = nent;



And your proposal is less <> ?

I'm looking at performance, not elegance and I'm betting that two
ilog2() calls [which boil down to ffs()] are faster than two
right-shift-by-n_s and one left-shift-by-n . Perhaps an extra comment
could help my code by noting that mathematically:
  /* if n > m for positive n and m then: log(n) > log(m) */

My original preference was to drop the check all together but Bart
Van Assche (who wrote that function) wanted me to keep it. Any
function that takes 'order' (i.e. an exponent) can blow up given
a silly value.


The chainable check_add_overflow() call is new and an improvement.

Doug Gilbert

[PATCH v5 4/4] scatterlist: add sgl_memset()

2020-12-28 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive. For
example protection information (PI) blocks are usually initialized
to 0xff bytes. As its name suggests sgl_memset() is modelled on
memset(). One difference is the type of the val argument which is
u8 rather than int. Plus it returns the number of bytes (over)written.

Change implementation of sg_zero_buffer() to call this new function.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  3 ++
 lib/scatterlist.c   | 65 +
 2 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 71be65f9ebb5..70d3f1f73df1 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -333,6 +333,9 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, 
unsigned int x_nents, off_t
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes, size_t *miscompare_idx);
 
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 9332365e7eb6..f06614a880c8 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1038,26 +1038,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer);
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip)
 {
-   unsigned int offset = 0;
-   struct sg_mapping_iter miter;
-   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
-
-   sg_miter_start(, sgl, nents, sg_flags);
-
-   if (!sg_miter_skip(, skip))
-   return false;
-
-   while (offset < buflen && sg_miter_next()) {
-   unsigned int len;
-
-   len = min(miter.length, buflen - offset);
-   memset(miter.addr, 0, len);
-
-   offset += len;
-   }
-
-   sg_miter_stop();
-   return offset;
+   return sgl_memset(sgl, nents, skip, 0, buflen);
 }
 EXPORT_SYMBOL(sg_zero_buffer);
 
@@ -1243,3 +1224,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
return sgl_compare_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, 
y_skip, n_bytes, NULL);
 }
 EXPORT_SYMBOL(sgl_compare_sgl);
+
+/**
+ * sgl_memset - set byte 'val' up to n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The (maximum) number of bytes to modify
+ *
+ * Returns:
+ *   The number of bytes written.
+ *
+ * Notes:
+ *   Stops writing if either sgl or n_bytes is exhausted. If n_bytes is
+ *   set SIZE_MAX then val will be written to each byte until the end
+ *   of sgl.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG);
+   if (!sg_miter_skip(, skip))
+   goto fini;
+
+   while ((offset < n_bytes) && sg_miter_next()) {
+   len = min(miter.length, n_bytes - offset);
+   memset(miter.addr, val, len);
+   offset += len;
+   }
+fini:
+   sg_miter_stop();
+   return offset;
+}
+EXPORT_SYMBOL(sgl_memset);
+
-- 
2.25.1

[PATCH v5 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2020-12-28 Thread Douglas Gilbert

This patch fixes a check done by sgl_alloc_order() before it starts
any allocations. The comment in the original said: "Check for integer
overflow" but the check itself contained an integer overflow! The
right hand side (rhs) of the expression in the condition is resolved
as u32 so it could not exceed UINT32_MAX (4 GiB) which means 'length'
could not exceed that value. If that was the intention then the
comment above it could be dropped and the condition rewritten more
clearly as:
 if (length > UINT32_MAX) <>;

Get around the integer overflow problem in the rhs of the original
check by taking ilog2() of both sides.

This function may be used to replace vmalloc(unsigned long) for a
large allocation (e.g. a ramdisk). vmalloc has no limit at 4 GiB so
it seems unreasonable that:
sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order() have equally sized segments
placed in a scatter gather array. That allows O(1) navigation around
a big sgl using some simple integer arithmetic.

Revise some of this function's description to more accurately reflect
what this function is doing.

An earlier patch fixed a memory leak in sg_alloc_order() due to the
misuse of sgl_free(). Take the opportunity to put a one line comment
above sgl_free()'s declaration warning that it is not suitable when
order > 0 .

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  1 +
 lib/scatterlist.c   | 14 --
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f70572b2938..8adff41f7cfa 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a59778946404..4986545beef9 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -554,13 +554,15 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages);
 #ifdef CONFIG_SGL_ALLOC
 
 /**
- * sgl_alloc_order - allocate a scatterlist and its pages
+ * sgl_alloc_order - allocate a scatterlist with equally sized elements
  * @length: Length in bytes of the scatterlist. Must be at least one
- * @order: Second argument for alloc_pages()
+ * @order: Second argument for alloc_pages(). Each sgl element size will
+ *be (PAGE_SIZE*2^order) bytes
  * @chainable: Whether or not to allocate an extra element in the scatterlist
- * for scatterlist chaining purposes
+ *for scatterlist chaining purposes
  * @gfp: Memory allocation flags
- * @nent_p: [out] Number of entries in the scatterlist that have pages
+ * @nent_p: [out] Number of entries in the scatterlist that have pages.
+ *   Ignored if NULL is given.
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
@@ -574,8 +576,8 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
+   /* Integer overflow if:  length > nent*2^(PAGE_SHIFT+order) */
+   if (ilog2(length) > ilog2(nent) + PAGE_SHIFT + order)
return NULL;
nalloc = nent;
if (chainable) {
-- 
2.25.1

[PATCH v5 2/4] scatterlist: add sgl_copy_sgl() function

2020-12-28 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a sgl to sgl copy. Stops
when the first of the number of requested bytes to copy, or the
source sgl, or the destination sgl is exhausted. So the
destination sgl will _not_ grow.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 74 +
 2 files changed, 78 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 8adff41f7cfa..3f836a3246aa 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 4986545beef9..af9cd7b9dc19 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1060,3 +1060,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before starting
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before starting
+ * @n_bytes:The (maximum) number of bytes to copy
+ *
+ * Returns:
+ *   The number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if either d_sgl, s_sgl or n_bytes is exhausted.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   Large copies are broken into copy segments whose sizes may vary. Those
+ *   copy segment sizes are chosen by the min3() statement in the code below.
+ *   Since SG_MITER_ATOMIC is used for both sides, each copy segment is started
+ *   with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic()
+ *   [in sg_miter_stop()]. This means pre-emption is inhibited for relatively
+ *   short periods even in very large copies.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter, s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(d_iter.length, s_iter.length, n_bytes - offset);
+
+   memcpy(d_iter.addr, s_iter.addr, len);
+   offset += len;
+   /* LIFO order (stop d_iter before s_iter) needed with 
SG_MITER_ATOMIC */
+   d_iter.consumed = len;
+   sg_miter_stop(_iter);
+   s_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
-- 
2.25.1

[PATCH v5 3/4] scatterlist: add sgl_compare_sgl() function

2020-12-28 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s), another
storage related operation is to compare two sgl_s. This new function
is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1)
command. Like memcmp() this function returns false on the first
miscompare and stops comparing.

A helper function called sgl_compare_sgl_idx() is added. It takes an
additional parameter (miscompare_idx) which is a pointer. If that
pointer is non-NULL and a miscompare is detected (i.e. the function
returns false) then the byte index of the first miscompare is written
to *miscomapre_idx. Knowing the location of the first miscompare is
needed to implement the SCSI COMPARE AND WRITE command properly.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |   8 +++
 lib/scatterlist.c   | 109 
 2 files changed, 117 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 3f836a3246aa..71be65f9ebb5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -325,6 +325,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index af9cd7b9dc19..9332365e7eb6 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1134,3 +1134,112 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
return offset;
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
+
+/**
+ * sgl_compare_sgl_idx - Compare x and y (both sgl_s)
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The (maximum) number of bytes to compare
+ * @miscompare_idx: if return is false, index of first miscompare written
+ *  to this pointer (if non-NULL). Value will be < n_bytes
+ *
+ * Returns:
+ *   true if x and y compare equal before x, y or n_bytes is exhausted.
+ *   Otherwise on a miscompare, returns false (and stops comparing). If return
+ *   is false and miscompare_idx is non-NULL, then index of first miscompared
+ *   byte written to *miscompare_idx.
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx)
+{
+   bool equ = true;
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter, y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(x_iter.length, y_iter.length, n_bytes - offset);
+
+   equ = !memcmp(x_iter.addr, y_iter.addr, len);
+   if (!equ)
+   goto fini;
+   offset += len;
+   /* LIFO order is important when SG_MITER_ATOMIC is used */
+   y_iter.consumed = len;
+   sg_miter_stop(_iter);
+   x_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   if (miscompare_idx && !equ) {
+   u8 *xp = x_iter.addr;
+   u8 *yp = y_iter.addr;
+   u8 *x_endp;
+
+   fo

[PATCH v5 0/4] scatterlist: add new capabilities

2020-12-28 Thread Douglas Gilbert

Scatter-gather lists (sgl_s) are frequently used as data carriers in
the block layer. For example the SCSI and NVMe subsystems interchange
data with the block layer using sgl_s. The sgl API is declared in


The author has extended these transient sgl use cases to a store (i.e.
a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s
could be for the target subsystem. When this extra step is taken, the
need to copy between sgl_s becomes apparent. The patchset adds
sgl_copy_sgl(), sgl_compare_sgl() and sgl_memset().

The existing sgl_alloc_order() function can be seen as a replacement
for vmalloc() for large, long-term allocations.  For what seems like
no good reason, sgl_alloc_order() currently restricts its total
allocation to less than or equal to 4 GiB. vmalloc() has no such
restriction.

Changes since v3 [posted 20201105]:
  - rebase on lk 5.11.0-rc2

Changes since v3 [posted 20201019]:
  - re-instate check on integer overflow of nent calculation in
sgl_alloc_order(). Do it in such a way as to not limit the
overall sgl size to 4  GiB
  - introduce sgl_compare_sgl_idx() helper function that, if
requested and if a miscompare is detected, will yield the byte
index of the first miscompare.
  - add Reviewed-by tags from Bodo Stroesser
  - rebase on lk 5.10.0-rc2 [was on lk 5.9.0]

Changes since v2 [posted 20201018]:
  - remove unneeded lines from sgl_memset() definition.
  - change sg_zero_buffer() to call sgl_memset() as the former
is a subset.

Changes since v1 [posted 20201016]:
  - Bodo Stroesser pointed out a problem with the nesting of
kmap_atomic() [called via sg_miter_next()] and kunmap_atomic()
calls [called via sg_miter_stop()] and proposed a solution that
simplifies the previous code.

  - the new implementation of the three functions has shorter periods
when pre-emption is disabled (but has more them). This should
make operations on large sgl_s more pre-emption "friendly" with
a relatively small performance hit.

  - sgl_memset return type changed from void to size_t and is the
number of bytes actually (over)written. That number is needed
anyway internally so may as well return it as it may be useful to
the caller.

This patchset is against lk 5.10.0-rc2

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  16 +++
 lib/scatterlist.c   | 244 +---
 2 files changed, 243 insertions(+), 17 deletions(-)

-- 
2.25.1

Re: [PATCH] [v2] scsi: scsi_debug: Fix memleak in scsi_debug_init

2020-12-26 Thread Douglas Gilbert


On 2020-12-26 1:15 a.m., Dinghao Liu wrote:

When sdeb_zbc_model does not match BLK_ZONED_NONE,
BLK_ZONED_HA or BLK_ZONED_HM, we should free sdebug_q_arr
to prevent memleak. Also there is no need to execute
sdebug_erase_store() on failure of sdeb_zbc_model_str().

Signed-off-by: Dinghao Liu 


Acked-by: Douglas Gilbert 

Thanks.


---

Changelog:

v2: - Add missed assignment statement for ret.
---
  drivers/scsi/scsi_debug.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 24c0f7ec0351..4a08c450b756 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -6740,7 +6740,7 @@ static int __init scsi_debug_init(void)
k = sdeb_zbc_model_str(sdeb_zbc_model_s);
if (k < 0) {
ret = k;
-   goto free_vm;
+   goto free_q_arr;
}
sdeb_zbc_model = k;
switch (sdeb_zbc_model) {
@@ -6753,7 +6753,8 @@ static int __init scsi_debug_init(void)
break;
default:
pr_err("Invalid ZBC model\n");
-   return -EINVAL;
+   ret = -EINVAL;
+   goto free_q_arr;
}
}
if (sdeb_zbc_model != BLK_ZONED_NONE) {

Re: [PATCH v1 0/6] no-copy bvec

2020-12-24 Thread Douglas Gilbert


On 2020-12-24 1:41 a.m., Christoph Hellwig wrote:

On Wed, Dec 23, 2020 at 08:32:45PM +, Pavel Begunkov wrote:

On 23/12/2020 20:23, Douglas Gilbert wrote:

On 2020-12-23 11:04 a.m., James Bottomley wrote:

On Wed, 2020-12-23 at 15:51 +, Christoph Hellwig wrote:

On Wed, Dec 23, 2020 at 12:52:59PM +, Pavel Begunkov wrote:

Can scatterlist have 0-len entries? Those are directly translated
into bvecs, e.g. in nvme/target/io-cmd-file.c and
target/target_core_file.c. I've audited most of others by this
moment, they're fine.


For block layer SGLs we should never see them, and for nvme neither.
I think the same is true for the SCSI target code, but please double
check.


Right, no-one ever wants to see a 0-len scatter list entry.?? The reason
is that every driver uses the sgl to program the device DMA engine in
the way NVME does.?? a 0 length sgl would be a dangerous corner case:
some DMA engines would ignore it and others would go haywire, so if we
ever let a 0 length list down into the driver, they'd have to
understand the corner case behaviour of their DMA engine and filter it
accordingly, which is why we disallow them in the upper levels, since
they're effective nops anyway.


When using scatter gather lists at the far end (i.e. on the storage device)
the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
allow the "number of logical blocks" in their sgl_s to be zero and state
that it is _not_ to be considered an error.


It's fine for my case unless it leaks them out of device driver to the
net/block layer/etc. Is it?


None of the SCSI Command mentions above are supported by Linux,
nevermind mapped to struct scatterlist.



The POPULATE TOKEN / WRITE USING TOKEN pair can be viewed as a subset
of EXTENDED COPY (SPC-4) which also supports "range descriptors". It is
not clear if target_core_xcopy.c supports these range descriptors but
if it did, it would be trying to map them to struct scatterlist objects.

That said, it would be easy to skip the "number of logical blocks" == 0
case when translating range descriptors to sgl_s.

In my ddpt utility (a dd clone) I have generalized skip= and seek= to
optionally take sgl_s. If the last element in one of those sgl_s is
LBAn,0 then it is interpreted as "until the end of that device" which
is further restricted if the other sgl has a "hard" length or count=
is given. The point being a length of 0 can have meaning, a benefit
lost with NVMe's 0-based counts.

Doug Gilbert

Re: [PATCH v1 0/6] no-copy bvec

2020-12-23 Thread Douglas Gilbert


On 2020-12-23 11:04 a.m., James Bottomley wrote:

On Wed, 2020-12-23 at 15:51 +, Christoph Hellwig wrote:

On Wed, Dec 23, 2020 at 12:52:59PM +, Pavel Begunkov wrote:

Can scatterlist have 0-len entries? Those are directly translated
into bvecs, e.g. in nvme/target/io-cmd-file.c and
target/target_core_file.c. I've audited most of others by this
moment, they're fine.


For block layer SGLs we should never see them, and for nvme neither.
I think the same is true for the SCSI target code, but please double
check.


Right, no-one ever wants to see a 0-len scatter list entry.  The reason
is that every driver uses the sgl to program the device DMA engine in
the way NVME does.  a 0 length sgl would be a dangerous corner case:
some DMA engines would ignore it and others would go haywire, so if we
ever let a 0 length list down into the driver, they'd have to
understand the corner case behaviour of their DMA engine and filter it
accordingly, which is why we disallow them in the upper levels, since
they're effective nops anyway.


When using scatter gather lists at the far end (i.e. on the storage device)
the T10 examples (WRITE SCATTERED and POPULATE TOKEN in SBC-4) explicitly
allow the "number of logical blocks" in their sgl_s to be zero and state
that it is _not_ to be considered an error.

Doug Gilbert

Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Douglas Gilbert


On 2020-12-07 9:56 a.m., Hannes Reinecke wrote:

On 12/7/20 3:11 PM, Christoph Hellwig wrote:

So, I'm really worried about:

  a) a good use case.  GC in f2fs or btrfs seem like good use cases, as
 does accelating dm-kcopyd.  I agree with Damien that lifting dm-kcopyd
 to common code would also be really nice.  I'm not 100% sure it should
 be a requirement, but it sure would be nice to have
 I don't think just adding an ioctl is enough of a use case for complex
 kernel infrastructure.
  b) We had a bunch of different attempts at SCSI XCOPY support form IIRC
 Martin, Bart and Mikulas.  I think we need to pull them into this
 discussion, and make sure whatever we do covers the SCSI needs.

And we shouldn't forget that the main issue which killed all previous 
implementations was a missing QoS guarantee.
It's nice to have simply copy, but if the implementation is _slower_ than doing 
it by hand from the OS there is very little point in even attempting to do so.
I can't see any provisions for that in the TPAR, leading me to the assumption 
that NVMe simple copy will suffer from the same issue.


So if we can't address this I guess this attempt will fail, too.


I have been doing quite a lot of work and testing in my sg driver rewrite
in the copy and compare area. The baselines for performance are dd and
io_uring-cp (in liburing). There are lots of ways to improve on them. Here
are some:
   - the user data need never pass through the user space (could
 mmap it out during the READ if there is a good reason). Only the
 metadata (e.g. NVMe or SCSI commands) needs to come from the user
 space and errors, if any, reported back to the user space.
   - break a large copy (or compare) into segments, with each segment
 a "comfortable" size for the OS to handle, say 256 KB
   - there is one constraint: the READ in each segment must complete
 before its paired WRITE can commence
 - extra constraint for some zoned disks: WRITEs must be
   issued in order (assuming they are applied in that order, if
   not, need to wait until each WRITE completes)
   - arrange for READ WRITE pair in each segment to share the same bio
   - have multiple slots each holding a segment (i.e. a bio and
 metadata to process a READ-WRITE pair)
   - re-use each slot's bio for the following READ-WRITE pair
   - issue the READs in each slot asynchronously and do an interleaved
 (io)poll for completion. Then issue the paired WRITE
 asynchronously
   - the above "slot" algorithm runs in one thread, so there can be
 multiple threads doing the same algorithm. Segment manager needs
 to be locked (or use an atomics) so that each segment (identified
 by its starting LBAs) is issued once and only once when the
 next thread wants a segment to copy

Running multiple threads gives diminishing or even worsening returns.
Runtime metrics on lock contention and storage bus capacity may help
choosing the number of threads. A simpler approach might be add more
threads until the combined throughput increase is less than 10% say.


The 'compare' that I mention is based on the SCSI VERIFY(BYTCHK=1) command
(or NVMe NVM Compare command). Using dd logic, a disk to disk compare can
be implemented with not much more work than changing the WRITE to a VERIFY
command. This is a different approach to the Linux cmp utility which
READs in both sides and does a memcmp() type operation. Using ramdisks
(from the scsi_debug driver) the compare operation (max ~ 10 GB/s) was
actually faster than the copy (max ~ 7 GB/s). I put this down to WRITE
operations taking a write lock over the store while the VERIFY only
needs a read lock so many VERIFY operations can co-exist on the same
store. Unfortunately on real SAS and NVMe SSDs that I tested the
performance of the VERIFY and NVM Compare commands is underwhelming.
For comparison, using scsi_debug ramdisks, dd copy throughput was
< 1 GB/s and io_uring-cp was around 2-3 GB/s. The system was Ryzen
3600 based.

Doug Gilbert

Re: [PATCH] scsi: ses: Fix crash caused by kfree an invalid pointer

2020-11-28 Thread Douglas Gilbert


On 2020-11-28 6:27 p.m., James Bottomley wrote:

On Sat, 2020-11-28 at 20:23 +0800, Ding Hui wrote:

We can get a crash when disconnecting the iSCSI session,
the call trace like this:

   [2a00fb70] kfree at 0830e224
   [2a00fba0] ses_intf_remove at 01f200e4
   [2a00fbd0] device_del at 086b6a98
   [2a00fc50] device_unregister at 086b6d58
   [2a00fc70] __scsi_remove_device at 0870608c
   [2a00fca0] scsi_remove_device at 08706134
   [2a00fcc0] __scsi_remove_target at 087062e4
   [2a00fd10] scsi_remove_target at 087064c0
   [2a00fd70] __iscsi_unbind_session at 01c872c4
   [2a00fdb0] process_one_work at 0810f35c
   [2a00fe00] worker_thread at 0810f648
   [2a00fe70] kthread at 08116e98

In ses_intf_add, components count could be 0, and kcalloc 0 size
scomp,
but not saved in edev->component[i].scratch

In this situation, edev->component[0].scratch is an invalid pointer,
when kfree it in ses_intf_remove_enclosure, a crash like above would
happen
The call trace also could be other random cases when kfree cannot
catch
the invalid pointer

We should not use edev->component[] array when the components count
is 0
We also need check index when use edev->component[] array in
ses_enclosure_data_process

Tested-by: Zeng Zhicong 
Cc: stable  # 2.6.25+
Signed-off-by: Ding Hui 


This doesn't really look to be the right thing to do: an enclosure
which has no component can't usefully be controlled by the driver since
there's nothing for it to do, so what we should do in this situation is
refuse to attach like the proposed patch below.

It does seem a bit odd that someone would build an enclosure that
doesn't enclose anything, so would you mind running

sg_ses -e


'-e' is the short form of '--enumerate'. That will report the names
and abbreviations of the diagnostic pages that the utility itself
knows about (and supports). It won't show anything specific about
the environment that sg_ses is executed in.

You probably meant:
  sg_ses 

Examples of the likely forms are:
  sg_ses /dev/bsg/1:0:0:0
  sg_ses /dev/sg2
  sg_ses /dev/ses0

This from a nearby machine:

$ lsscsi -gs
[3:0:0:0]  disk  ATA  Samsung SSD 850  1B6Q  /dev/sda   /dev/sg0120GB
[4:0:0:0]  disk  IBM-207x HUSMM8020ASS20   J4B6  /dev/sdc   /dev/sg2200GB
[4:0:1:0]  disk  ATA  INTEL SSDSC2KW25 003C  /dev/sdd   /dev/sg3256GB
[4:0:2:0]  disk  SEAGATE  ST1NM0096E005  /dev/sde   /dev/sg4   10.0TB
[4:0:3:0]  enclosu Areca Te ARC-802801.37.69 0137  -/dev/sg5-
[4:0:4:0]  enclosu IntelRES2SV2400d00  -/dev/sg6-
[7:0:0:0]  diskKingston DataTravelerMini PMAP  /dev/sdb /dev/sg1   1.03GB
[N:0:0:1]  diskWDC WDS256G1X0C-00ENX0__1   /dev/nvme0n1  -  256GB

# sg_ses /dev/sg5
  Areca Te  ARC-802801.37.69  0137
Supported diagnostic pages:
  Supported Diagnostic Pages [sdp] [0x0]
  Configuration (SES) [cf] [0x1]
  Enclosure Status/Control (SES) [ec,es] [0x2]
  String In/Out (SES) [str] [0x4]
  Threshold In/Out (SES) [th] [0x5]
  Element Descriptor (SES) [ed] [0x7]
  Additional Element Status (SES-2) [aes] [0xa]
  Supported SES Diagnostic Pages (SES-2) [ssp] [0xd]
  Download Microcode (SES-2) [dm] [0xe]
  Subenclosure Nickname (SES-2) [snic] [0xf]
  Protocol Specific (SAS transport) [] [0x3f]

# sg_ses -p cf /dev/sg5
  Areca Te  ARC-802801.37.69  0137
Configuration diagnostic page:
  number of secondary subenclosures: 0
  generation code: 0x0
  enclosure descriptor list
Subenclosure identifier: 0 [primary]
  relative ES process id: 1, number of ES processes: 1
  number of type descriptor headers: 9
  enclosure logical identifier (hex): d5b401503fc0ec16
  enclosure vendor: Areca Te  product: ARC-802801.37.69  rev: 0137
  vendor-specific data:
11 22 33 44 55 00 00 00 ."3DU...

  type descriptor header and text list
Element type: Array device slot, subenclosure id: 0
  number of possible elements: 24
  text: ArrayDevicesInSubEnclsr0
Element type: Enclosure, subenclosure id: 0
  number of possible elements: 1
  text: EnclosureElementInSubEnclsr0
Element type: SAS expander, subenclosure id: 0
  number of possible elements: 1
  text: SAS Expander
Element type: Cooling, subenclosure id: 0
  number of possible elements: 5
  text: CoolingElementInSubEnclsr0
Element type: Temperature sensor, subenclosure id: 0
  number of possible elements: 2
  text: TempSensorsInSubEnclsr0
Element type: Voltage sensor, subenclosure id: 0
  number of possible elements: 2
  text: VoltageSensorsInSubEnclsr0
Element type: SAS connector, subenclosure id: 0
  number of possible elements: 3
  text: ConnectorsInSubEnclsr0
Element type: Power supply, subenclosure id: 0
  number of possible

[PATCH v4 4/4] scatterlist: add sgl_memset()

2020-11-05 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive. For
example protection information (PI) blocks are usually initialized
to 0xff bytes. As its name suggests sgl_memset() is modelled on
memset(). One difference is the type of the val argument which is
u8 rather than int. Plus it returns the number of bytes (over)written.

Change implementation of sg_zero_buffer() to call this new function.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  3 ++
 lib/scatterlist.c   | 65 +
 2 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 0f6d59bf66cb..8e4c050e6237 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -339,6 +339,9 @@ bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, 
unsigned int x_nents, off_t
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes, size_t *miscompare_idx);
 
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 9332365e7eb6..f06614a880c8 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1038,26 +1038,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer);
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip)
 {
-   unsigned int offset = 0;
-   struct sg_mapping_iter miter;
-   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
-
-   sg_miter_start(, sgl, nents, sg_flags);
-
-   if (!sg_miter_skip(, skip))
-   return false;
-
-   while (offset < buflen && sg_miter_next()) {
-   unsigned int len;
-
-   len = min(miter.length, buflen - offset);
-   memset(miter.addr, 0, len);
-
-   offset += len;
-   }
-
-   sg_miter_stop();
-   return offset;
+   return sgl_memset(sgl, nents, skip, 0, buflen);
 }
 EXPORT_SYMBOL(sg_zero_buffer);
 
@@ -1243,3 +1224,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
return sgl_compare_sgl_idx(x_sgl, x_nents, x_skip, y_sgl, y_nents, 
y_skip, n_bytes, NULL);
 }
 EXPORT_SYMBOL(sgl_compare_sgl);
+
+/**
+ * sgl_memset - set byte 'val' up to n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The (maximum) number of bytes to modify
+ *
+ * Returns:
+ *   The number of bytes written.
+ *
+ * Notes:
+ *   Stops writing if either sgl or n_bytes is exhausted. If n_bytes is
+ *   set SIZE_MAX then val will be written to each byte until the end
+ *   of sgl.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG);
+   if (!sg_miter_skip(, skip))
+   goto fini;
+
+   while ((offset < n_bytes) && sg_miter_next()) {
+   len = min(miter.length, n_bytes - offset);
+   memset(miter.addr, val, len);
+   offset += len;
+   }
+fini:
+   sg_miter_stop();
+   return offset;
+}
+EXPORT_SYMBOL(sgl_memset);
+
-- 
2.25.1

[PATCH v4 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2020-11-05 Thread Douglas Gilbert

This patch fixes a check done by sgl_alloc_order() before it starts
any allocations. The comment in the original said: "Check for integer
overflow" but the check itself contained an integer overflow! The
right hand side (rhs) of the expression in the condition is resolved
as u32 so it could not exceed UINT32_MAX (4 GiB) which means 'length'
could not exceed that value. If that was the intention then the
comment above it could be dropped and the condition rewritten more
clearly as:
 if (length > UINT32_MAX) <>;

Get around the integer overflow problem in the rhs of the original
check by taking ilog2() of both sides.

This function may be used to replace vmalloc(unsigned long) for a
large allocation (e.g. a ramdisk). vmalloc has no limit at 4 GiB so
it seems unreasonable that:
sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order() have equally sized segments
placed in a scatter gather array. That allows O(1) navigation around
a big sgl using some simple integer arithmetic.

Revise some of this function's description to more accurately reflect
what this function is doing.

An earlier patch fixed a memory leak in sg_alloc_order() due to the
misuse of sgl_free(). Take the opportunity to put a one line comment
above sgl_free()'s declaration warning that it is not suitable when
order > 0 .

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  1 +
 lib/scatterlist.c   | 14 --
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 36c47e7e66a2..d9443ebd0a8e 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -308,6 +308,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a59778946404..4986545beef9 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -554,13 +554,15 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages);
 #ifdef CONFIG_SGL_ALLOC
 
 /**
- * sgl_alloc_order - allocate a scatterlist and its pages
+ * sgl_alloc_order - allocate a scatterlist with equally sized elements
  * @length: Length in bytes of the scatterlist. Must be at least one
- * @order: Second argument for alloc_pages()
+ * @order: Second argument for alloc_pages(). Each sgl element size will
+ *be (PAGE_SIZE*2^order) bytes
  * @chainable: Whether or not to allocate an extra element in the scatterlist
- * for scatterlist chaining purposes
+ *for scatterlist chaining purposes
  * @gfp: Memory allocation flags
- * @nent_p: [out] Number of entries in the scatterlist that have pages
+ * @nent_p: [out] Number of entries in the scatterlist that have pages.
+ *   Ignored if NULL is given.
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
@@ -574,8 +576,8 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
+   /* Integer overflow if:  length > nent*2^(PAGE_SHIFT+order) */
+   if (ilog2(length) > ilog2(nent) + PAGE_SHIFT + order)
return NULL;
nalloc = nent;
if (chainable) {
-- 
2.25.1

[PATCH v4 3/4] scatterlist: add sgl_compare_sgl() function

2020-11-05 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s), another
storage related operation is to compare two sgl_s. This new function
is modelled on NVMe's Compare command and the SCSI VERIFY(BYTCHK=1)
command. Like memcmp() this function returns false on the first
miscompare and stops comparing.

A helper function called sgl_compare_sgl_idx() is added. It takes an
additional parameter (miscompare_idx) which is a pointer. If that
pointer is non-NULL and a miscompare is detected (i.e. the function
returns false) then the byte index of the first miscompare is written
to *miscomapre_idx. Knowing the location of the first miscompare is
needed to implement the SCSI COMPARE AND WRITE command properly.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |   8 +++
 lib/scatterlist.c   | 109 
 2 files changed, 117 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index f2922a34b140..0f6d59bf66cb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -331,6 +331,14 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index af9cd7b9dc19..9332365e7eb6 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1134,3 +1134,112 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
return offset;
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
+
+/**
+ * sgl_compare_sgl_idx - Compare x and y (both sgl_s)
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The (maximum) number of bytes to compare
+ * @miscompare_idx: if return is false, index of first miscompare written
+ *  to this pointer (if non-NULL). Value will be < n_bytes
+ *
+ * Returns:
+ *   true if x and y compare equal before x, y or n_bytes is exhausted.
+ *   Otherwise on a miscompare, returns false (and stops comparing). If return
+ *   is false and miscompare_idx is non-NULL, then index of first miscompared
+ *   byte written to *miscompare_idx.
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+bool sgl_compare_sgl_idx(struct scatterlist *x_sgl, unsigned int x_nents, 
off_t x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes, size_t *miscompare_idx)
+{
+   bool equ = true;
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter, y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(x_iter.length, y_iter.length, n_bytes - offset);
+
+   equ = !memcmp(x_iter.addr, y_iter.addr, len);
+   if (!equ)
+   goto fini;
+   offset += len;
+   /* LIFO order is important when SG_MITER_ATOMIC is used */
+   y_iter.consumed = len;
+   sg_miter_stop(_iter);
+   x_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   if (miscompare_idx && !equ) {
+   u8 *xp = x_iter.addr;
+   u8 *yp = y_iter.addr;
+   u8 *x_endp;
+
+   fo

[PATCH v4 2/4] scatterlist: add sgl_copy_sgl() function

2020-11-05 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a sgl to sgl copy. Stops
when the first of the number of requested bytes to copy, or the
source sgl, or the destination sgl is exhausted. So the
destination sgl will _not_ grow.

Reviewed-by: Bodo Stroesser 
Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 74 +
 2 files changed, 78 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index d9443ebd0a8e..f2922a34b140 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -327,6 +327,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 4986545beef9..af9cd7b9dc19 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1060,3 +1060,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before starting
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before starting
+ * @n_bytes:The (maximum) number of bytes to copy
+ *
+ * Returns:
+ *   The number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if either d_sgl, s_sgl or n_bytes is exhausted.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   Large copies are broken into copy segments whose sizes may vary. Those
+ *   copy segment sizes are chosen by the min3() statement in the code below.
+ *   Since SG_MITER_ATOMIC is used for both sides, each copy segment is started
+ *   with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic()
+ *   [in sg_miter_stop()]. This means pre-emption is inhibited for relatively
+ *   short periods even in very large copies.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter, s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(d_iter.length, s_iter.length, n_bytes - offset);
+
+   memcpy(d_iter.addr, s_iter.addr, len);
+   offset += len;
+   /* LIFO order (stop d_iter before s_iter) needed with 
SG_MITER_ATOMIC */
+   d_iter.consumed = len;
+   sg_miter_stop(_iter);
+   s_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
-- 
2.25.1

[PATCH v4 0/4] scatterlist: add new capabilities

2020-11-05 Thread Douglas Gilbert

This patchset was sent to the linux-block and linux-scsi lists a few
hours ago. If it is accepted that will probably be via the
linux-block maintainer. It has potential users in the target
sub-system and the scsi_debug driver. Other parts of the kernel
that use  may be interested which is why it
is now being sent to the linux-kernel list.


Scatter-gather lists (sgl_s) are frequently used as data carriers in
the block layer. For example the SCSI and NVMe subsystems interchange
data with the block layer using sgl_s. The sgl API is declared in


The author has extended these transient sgl use cases to a store (i.e.
a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s
could be for the target subsystem. When this extra step is taken, the
need to copy between sgl_s becomes apparent. The patchset adds
sgl_copy_sgl() and two other sgl operations.

The existing sgl_alloc_order() function can be seen as a replacement
for vmalloc() for large, long-term allocations.  For what seems like
no good reason, sgl_alloc_order() currently restricts its total
allocation to less than or equal to 4 GiB. vmalloc() has no such
restriction.

Changes since v3 [posted 20201019]:
  - re-instate check on integer overflow of nent calculation in
sgl_alloc_order(). Do it in such a way as to not limit the
overall sgl size to 4  GiB
  - introduce sgl_compare_sgl_idx() helper function that, if
requested and if a miscompare is detected, will yield the byte
index of the first miscompare.
  - add Reviewed-by tags from Bodo Stroesser
  - rebase on lk 5.10.0-rc2 [was on lk 5.9.0]

Changes since v2 [posted 20201018]:
  - remove unneeded lines from sgl_memset() definition.
  - change sg_zero_buffer() to call sgl_memset() as the former
is a subset.

Changes since v1 [posted 20201016]:
  - Bodo Stroesser pointed out a problem with the nesting of
kmap_atomic() [called via sg_miter_next()] and kunmap_atomic()
calls [called via sg_miter_stop()] and proposed a solution that
simplifies the previous code.

  - the new implementation of the three functions has shorter periods
when pre-emption is disabled (but has more them). This should
make operations on large sgl_s more pre-emption "friendly" with
a relatively small performance hit.

  - sgl_memset return type changed from void to size_t and is the
number of bytes actually (over)written. That number is needed
anyway internally so may as well return it as it may be useful to
the caller.

This patchset is against lk 5.10.0-rc2

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  16 +++
 lib/scatterlist.c   | 244 +---
 2 files changed, 243 insertions(+), 17 deletions(-)

-- 
2.25.1

Re: [PATCH v3 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2020-11-04 Thread Douglas Gilbert


On 2020-11-03 7:54 a.m., Bodo Stroesser wrote:

Am 19.10.20 um 21:19 schrieb Douglas Gilbert:

This patch removes a check done by sgl_alloc_order() before it starts
any allocations. The comment before the removed code says: "Check for
integer overflow" arguably gives a false sense of security. The right
hand side of the expression in the condition is resolved as u32 so
cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed
that amount. If that was the intention then the comment above it
could be dropped and the condition rewritten more clearly as:
   if (length > UINT32_MAX) <>;


I think the intention of the check is to reject calls, where length is so high, 
that calculation of nent overflows unsigned int nent/nalloc.
Consistently a similar check is done few lines later before incrementing nalloc 
due to chainable = true.
So I think the code tries to allow length values up to 4G << (PAGE_SHIFT + 
order).

That said I think instead of removing the check it better should be fixed, e.g. 
by adding an unsigned long long cast before nent

BTW: I don't know why there are two checks. I think one check after 
conditionally incrementing nalloc would be enough.


Okay, I'm working on a "v4" patchset. Apart from the above, my plan is
to extend sgl_compare_sgl() with a helper that additionally yields
the byte index of the first miscompare.

Doug Gilbert


The author's intention is to use sgl_alloc_order() to replace
vmalloc(unsigned long) for a large allocation (debug ramdisk).
vmalloc has no limit at 4 GiB so its seems unreasonable that:
  sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order(chainable=false) have equally
sized segments placed in a scatter gather array. That allows O(1)
navigation around a big sgl using some simple integer maths.

Having previously sent a patch to fix a memory leak in
sg_alloc_order() take the opportunity to put a one line comment above
sgl_free()'s declaration that it is not suitable when order > 0 . The
mis-use of sgl_free() when order > 0 was the reason for the memory
leak. The other users of sgl_alloc_order() in the kernel where
checked and found to handle free-ing properly.

Signed-off-by: Douglas Gilbert 
---
   include/linux/scatterlist.h | 1 +
   lib/scatterlist.c   | 3 ---
   2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..80178afc2a4a 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
   void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
   void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
   void sgl_free(struct scatterlist *sgl);
   #endif /* CONFIG_SGL_ALLOC */
   
diff --git a/lib/scatterlist.c b/lib/scatterlist.c

index c448642e0f78..d5770e7f1030 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
   
   	nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);

-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
-   return NULL;
nalloc = nent;
if (chainable) {
/* Check for integer overflow */

tools/perf: noise from check-headers.sh

2020-10-28 Thread Douglas Gilbert


Executing that script in linux-stable [lk 5.10.0-rc1] gives the following
output:

Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from 
latest version at 'include/uapi/drm/i915_drm.h'

diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/fscrypt.h' differs from 
latest version at 'include/uapi/linux/fscrypt.h'

diff -u tools/include/uapi/linux/fscrypt.h include/uapi/linux/fscrypt.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from 
latest version at 'include/uapi/linux/kvm.h'

diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/mount.h' differs from 
latest version at 'include/uapi/linux/mount.h'

diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs 
from latest version at 'include/uapi/linux/perf_event.h'

diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/prctl.h' differs from 
latest version at 'include/uapi/linux/prctl.h'

diff -u tools/include/uapi/linux/prctl.h include/uapi/linux/prctl.h
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/disabled-features.h' 
differs from latest version at 'arch/x86/include/asm/disabled-features.h'
diff -u tools/arch/x86/include/asm/disabled-features.h 
arch/x86/include/asm/disabled-features.h
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/required-features.h' 
differs from latest version at 'arch/x86/include/asm/required-features.h'
diff -u tools/arch/x86/include/asm/required-features.h 
arch/x86/include/asm/required-features.h
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs 
from latest version at 'arch/x86/include/asm/cpufeatures.h'

diff -u tools/arch/x86/include/asm/cpufeatures.h 
arch/x86/include/asm/cpufeatures.h
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/msr-index.h' differs 
from latest version at 'arch/x86/include/asm/msr-index.h'

diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h
Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/kvm.h' differs 
from latest version at 'arch/x86/include/uapi/asm/kvm.h'

diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
Warning: Kernel ABI header at 'tools/arch/x86/include/uapi/asm/svm.h' differs 
from latest version at 'arch/x86/include/uapi/asm/svm.h'

diff -u tools/arch/x86/include/uapi/asm/svm.h arch/x86/include/uapi/asm/svm.h
Warning: Kernel ABI header at 'tools/arch/s390/include/uapi/asm/sie.h' differs 
from latest version at 'arch/s390/include/uapi/asm/sie.h'

diff -u tools/arch/s390/include/uapi/asm/sie.h arch/s390/include/uapi/asm/sie.h
Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs 
from latest version at 'arch/arm64/include/uapi/asm/kvm.h'

diff -u tools/arch/arm64/include/uapi/asm/kvm.h 
arch/arm64/include/uapi/asm/kvm.h
Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs 
from latest version at 'include/uapi/asm-generic/unistd.h'

diff -u tools/include/uapi/asm-generic/unistd.h 
include/uapi/asm-generic/unistd.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/mman.h' differs from 
latest version at 'include/uapi/linux/mman.h'

diff -u tools/include/uapi/linux/mman.h include/uapi/linux/mman.h
Warning: Kernel ABI header at 
'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version 
at 'arch/x86/entry/syscalls/syscall_64.tbl'
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
arch/x86/entry/syscalls/syscall_64.tbl
Warning: Kernel ABI header at 'tools/perf/util/hashmap.h' differs from latest 
version at 'tools/lib/bpf/hashmap.h'

diff -u tools/perf/util/hashmap.h tools/lib/bpf/hashmap.h
Warning: Kernel ABI header at 'tools/perf/util/hashmap.c' differs from latest 
version at 'tools/lib/bpf/hashmap.c'

diff -u tools/perf/util/hashmap.c tools/lib/bpf/hashmap.c


There was a bit of noise in lk 5.9.0-rc1 but it is considerably worse now.

Doug Gilbert

[PATCH v3 2/4] scatterlist: add sgl_copy_sgl() function

2020-10-19 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a sgl to sgl copy. Stops
when the first of the number of requested bytes to copy, or the
source sgl, or the destination sgl is exhausted. So the
destination sgl will _not_ grow.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 75 +
 2 files changed, 79 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 80178afc2a4a..6649414c0749 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5770e7f1030..1f9e093ad7da 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -974,3 +974,78 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before starting
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before starting
+ * @n_bytes:The (maximum) number of bytes to copy
+ *
+ * Returns:
+ *   The number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if either d_sgl, s_sgl or n_bytes is exhausted.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   Large copies are broken into copy segments whose sizes may vary. Those
+ *   copy segment sizes are chosen by the min3() statement in the code below.
+ *   Since SG_MITER_ATOMIC is used for both sides, each copy segment is started
+ *   with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic()
+ *   [in sg_miter_stop()]. This means pre-emption is inhibited for relatively
+ *   short periods even in very large copies.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter, s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(d_iter.length, s_iter.length, n_bytes - offset);
+
+   memcpy(d_iter.addr, s_iter.addr, len);
+   offset += len;
+   /* LIFO order (stop d_iter before s_iter) needed with 
SG_MITER_ATOMIC */
+   d_iter.consumed = len;
+   sg_miter_stop(_iter);
+   s_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
+
-- 
2.25.1

[PATCH v3 4/4] scatterlist: add sgl_memset()

2020-10-19 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive.
For example protection information (PI) blocks are usually
initialized to 0xff bytes. As its name suggests sgl_memset()
is modelled on memset(). One difference is the type of the
val argument which is u8 rather than int. Plus it returns
the number of bytes (over)written.

Change implementation of sg_zero_buffer() to call this new
function.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  3 ++
 lib/scatterlist.c   | 65 +
 2 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ae260dc5fedb..a40012c8a4e6 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -329,6 +329,9 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes);
 
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 49185536acba..6b430f7293e0 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -952,26 +952,7 @@ EXPORT_SYMBOL(sg_pcopy_to_buffer);
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip)
 {
-   unsigned int offset = 0;
-   struct sg_mapping_iter miter;
-   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
-
-   sg_miter_start(, sgl, nents, sg_flags);
-
-   if (!sg_miter_skip(, skip))
-   return false;
-
-   while (offset < buflen && sg_miter_next()) {
-   unsigned int len;
-
-   len = min(miter.length, buflen - offset);
-   memset(miter.addr, 0, len);
-
-   offset += len;
-   }
-
-   sg_miter_stop();
-   return offset;
+   return sgl_memset(sgl, nents, skip, 0, buflen);
 }
 EXPORT_SYMBOL(sg_zero_buffer);
 
@@ -1110,3 +1091,47 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
return equ;
 }
 EXPORT_SYMBOL(sgl_compare_sgl);
+
+/**
+ * sgl_memset - set byte 'val' up to n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The (maximum) number of bytes to modify
+ *
+ * Returns:
+ *   The number of bytes written.
+ *
+ * Notes:
+ *   Stops writing if either sgl or n_bytes is exhausted. If n_bytes is
+ *   set SIZE_MAX then val will be written to each byte until the end
+ *   of sgl.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG);
+   if (!sg_miter_skip(, skip))
+   goto fini;
+
+   while ((offset < n_bytes) && sg_miter_next()) {
+   len = min(miter.length, n_bytes - offset);
+   memset(miter.addr, val, len);
+   offset += len;
+   }
+fini:
+   sg_miter_stop();
+   return offset;
+}
+EXPORT_SYMBOL(sgl_memset);
+
-- 
2.25.1

[PATCH v3 0/4] scatterlist: add new capabilities

2020-10-19 Thread Douglas Gilbert

Scatter-gather lists (sgl_s) are frequently used as data carriers in
the block layer. For example the SCSI and NVMe subsystems interchange
data with the block layer using sgl_s. The sgl API is declared in


The author has extended these transient sgl use cases to a store (i.e.
a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s
could be for caches. When this extra step is taken, the need to copy
between sgl_s becomes apparent. The patchset adds sgl_copy_sgl() and
two other sgl operations.

The existing sgl_alloc_order() function can be seen as a replacement
for vmalloc() for large, long-term allocations.  For what seems like
no good reason, sgl_alloc_order() currently restricts its total
allocation to less than or equal to 4 GiB. vmalloc() has no such
restriction.

Changes since v2 [posted 20201018]:
  - remove unneeded lines from sgl_memset() definition.
  - change sg_zero_buffer() to call sgl_memset() as the former
is a subset.

Changes since v1 [posted 20201016]:
  - Bodo Stroesser pointed out a problem with the nesting of
kmap_atomic() [called via sg_miter_next()] and kunmap_atomic()
calls [called via sg_miter_stop()] and proposed a solution that
simplifies the previous code.

  - the new implementation of the three functions has shorter periods
when pre-emption is disabled (but has more them). This should
make operations on large sgl_s more pre-emption "friendly" with
a relatively small performance hit.

  - sgl_memset return type changed from void to size_t and is the
number of bytes actually (over)written. That number is needed
anyway internally so may as well return it as it may be useful to
the caller.

This patchset is against lk 5.9.0

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  12 +++
 lib/scatterlist.c   | 186 +---
 2 files changed, 184 insertions(+), 14 deletions(-)

-- 
2.25.1

[PATCH v3 3/4] scatterlist: add sgl_compare_sgl() function

2020-10-19 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s),
another storage related operation is to compare two sgl_s.
This new function is modelled on NVMe's Compare command and
the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function
returns false on the first miscompare and stops comparing.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 +++
 lib/scatterlist.c   | 61 +
 2 files changed, 65 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6649414c0749..ae260dc5fedb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -325,6 +325,10 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 1f9e093ad7da..49185536acba 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1049,3 +1049,64 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
 
+/**
+ * sgl_compare_sgl - Compare x and y (both sgl_s)
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The (maximum) number of bytes to compare
+ *
+ * Returns:
+ *   true if x and y compare equal before x, y or n_bytes is exhausted.
+ *   Otherwise on a miscompare, returns false (and stops comparing).
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes)
+{
+   bool equ = true;
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter, y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   while (equ && offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(x_iter.length, y_iter.length, n_bytes - offset);
+
+   equ = !memcmp(x_iter.addr, y_iter.addr, len);
+   offset += len;
+   /* LIFO order is important when SG_MITER_ATOMIC is used */
+   y_iter.consumed = len;
+   sg_miter_stop(_iter);
+   x_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return equ;
+}
+EXPORT_SYMBOL(sgl_compare_sgl);
-- 
2.25.1

[PATCH v3 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2020-10-19 Thread Douglas Gilbert

This patch removes a check done by sgl_alloc_order() before it starts
any allocations. The comment before the removed code says: "Check for
integer overflow" arguably gives a false sense of security. The right
hand side of the expression in the condition is resolved as u32 so
cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed
that amount. If that was the intention then the comment above it
could be dropped and the condition rewritten more clearly as:
 if (length > UINT32_MAX) <>;

The author's intention is to use sgl_alloc_order() to replace
vmalloc(unsigned long) for a large allocation (debug ramdisk).
vmalloc has no limit at 4 GiB so its seems unreasonable that:
sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order(chainable=false) have equally
sized segments placed in a scatter gather array. That allows O(1)
navigation around a big sgl using some simple integer maths.

Having previously sent a patch to fix a memory leak in
sg_alloc_order() take the opportunity to put a one line comment above
sgl_free()'s declaration that it is not suitable when order > 0 . The
mis-use of sgl_free() when order > 0 was the reason for the memory
leak. The other users of sgl_alloc_order() in the kernel where
checked and found to handle free-ing properly.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h | 1 +
 lib/scatterlist.c   | 3 ---
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..80178afc2a4a 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c448642e0f78..d5770e7f1030 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
-   return NULL;
nalloc = nent;
if (chainable) {
/* Check for integer overflow */
-- 
2.25.1

[PATCH v2 2/4] scatterlist: add sgl_copy_sgl() function

2020-10-18 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a copy.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 74 +
 2 files changed, 78 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 80178afc2a4a..6649414c0749 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5770e7f1030..a0a86059c10e 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -974,3 +974,77 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before starting
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before starting
+ * @n_bytes:The (maximum) number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if either d_sgl, s_sgl or n_bytes is exhausted.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   Large copies are broken into copy segments whose sizes may vary. Those
+ *   copy segment sizes are chosen by the min3() statement in the code below.
+ *   Since SG_MITER_ATOMIC is used for both sides, each copy segment is started
+ *   with kmap_atomic() [in sg_miter_next()] and completed with kunmap_atomic()
+ *   [in sg_miter_stop()]. This means pre-emption is inhibited for relatively
+ *   short periods even in very large copies.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter, s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+
+   while (offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(d_iter.length, s_iter.length, n_bytes - offset);
+
+   memcpy(d_iter.addr, s_iter.addr, len);
+   offset += len;
+   /* LIFO order (stop d_iter before s_iter) needed with 
SG_MITER_ATOMIC */
+   d_iter.consumed = len;
+   sg_miter_stop(_iter);
+   s_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
+
-- 
2.25.1

[PATCH v2 0/4] scatterlist: add new capabilities

2020-10-18 Thread Douglas Gilbert

Scatter-gather lists (sgl_s) are frequently used as data carriers in
the block layer. For example the SCSI and NVMe subsystems interchange
data with the block layer using sgl_s. The sgl API is declared in


The author has extended these transient sgl use cases to a store (i.e.
a ramdisk) in the scsi_debug driver. Other new potential uses of sgl_s
could be for caches. When this extra step is taken, the need to copy
between sgl_s becomes apparent. The patchset adds sgl_copy_sgl() and
two other sgl operations.

The existing sgl_alloc_order() function can be seen as a replacement
for vmalloc() for large, long-term allocations.  For what seems like
no good reason, sgl_alloc_order() currently restricts its total
allocation to less than or equal to 4 GiB. vmalloc() has no such
restriction.

Changes since v1 [posted 20201016]:
  - Bodo Stroesser pointed out a problem with the nesting of
kmap_atomic() [called via sg_miter_next()] and kunmap_atomic()
calls [called via sg_miter_stop()] and proposed a solution that
simplifies the previous code.

  - the new implementation of the three functions has shorter periods
when pre-emption is disabled (but has more them). This should
make operations on large sgl_s more pre-emption "friendly" with
a relatively small performance hit.

  - sgl_memset return type changed from void to size_t and is the
number of bytes actually (over)written. That number is needed
anyway internally so may as well return it as it may be useful to
the caller.

This patchset is against lk 5.9.0

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  12 +++
 lib/scatterlist.c   | 204 +++-
 2 files changed, 213 insertions(+), 3 deletions(-)

-- 
2.25.1


*** BLURB HERE ***

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  12 +++
 lib/scatterlist.c   | 185 +++-
 2 files changed, 194 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH v2 4/4] scatterlist: add sgl_memset()

2020-10-18 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive.
For example protection information (PI) blocks are usually
initialized to 0xff bytes. As its name suggests sgl_memset()
is modelled on memset(). One difference is the type of the
val argument which is u8 rather than int. Plus it returns
the number of bytes (over)written.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  3 +++
 lib/scatterlist.c   | 54 ++---
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ae260dc5fedb..a40012c8a4e6 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -329,6 +329,9 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes);
 
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d910776a4c96..a704039ab54d 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -985,7 +985,8 @@ EXPORT_SYMBOL(sg_zero_buffer);
  * @s_skip: Number of bytes to skip in source before starting
  * @n_bytes:The (maximum) number of bytes to copy
  *
- * Returns the number of copied bytes.
+ * Returns:
+ *   The number of copied bytes.
  *
  * Notes:
  *   Destination arguments appear before the source arguments, as with 
memcpy().
@@ -1058,8 +1059,9 @@ EXPORT_SYMBOL(sgl_copy_sgl);
  * @y_skip: Number of bytes to skip in y (right) before starting
  * @n_bytes:The (maximum) number of bytes to compare
  *
- * Returns true if x and y compare equal before x, y or n_bytes is exhausted.
- * Otherwise on a miscompare, returns false (and stops comparing).
+ * Returns:
+ *   true if x and y compare equal before x, y or n_bytes is exhausted.
+ *   Otherwise on a miscompare, returns false (and stops comparing).
  *
  * Notes:
  *   x and y are symmetrical: they can be swapped and the result is the same.
@@ -1108,3 +1110,49 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
return equ;
 }
 EXPORT_SYMBOL(sgl_compare_sgl);
+
+/**
+ * sgl_memset - set byte 'val' up to n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The (maximum) number of bytes to modify
+ *
+ * Returns:
+ *   The number of bytes written.
+ *
+ * Notes:
+ *   Stops writing if either sgl or n_bytes is exhausted. If n_bytes is
+ *   set SIZE_MAX then val will be written to each byte until the end
+ *   of sgl.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+size_t sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+ u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(, sgl, nents, SG_MITER_ATOMIC | SG_MITER_TO_SG);
+   if (!sg_miter_skip(, skip))
+   goto fini;
+
+   while ((offset < n_bytes) && sg_miter_next()) {
+   len = min(miter.length, n_bytes - offset);
+   memset(miter.addr, val, len);
+   offset += len;
+   miter.consumed = len;
+   sg_miter_stop();
+   }
+fini:
+   sg_miter_stop();
+   return offset;
+}
+EXPORT_SYMBOL(sgl_memset);
+
-- 
2.25.1

[PATCH v2 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2020-10-18 Thread Douglas Gilbert

This patch removes a check done by sgl_alloc_order() before it starts
any allocations. The comment before the removed code says: "Check for
integer overflow" arguably gives a false sense of security. The right
hand side of the expression in the condition is resolved as u32 so
cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed
that amount. If that was the intention then the comment above it
could be dropped and the condition rewritten more clearly as:
 if (length > UINT32_MAX) <>;

The author's intention is to use sgl_alloc_order() to replace
vmalloc(unsigned long) for a large allocation (debug ramdisk).
vmalloc has no limit at 4 GiB so its seems unreasonable that:
sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order(chainable=false) have equally
sized segments placed in a scatter gather array. That allows O(1)
navigation around a big sgl using some simple integer maths.

Having previously sent a patch to fix a memory leak in
sg_alloc_order() take the opportunity to put a one line comment above
sgl_free()'s declaration that it is not suitable when order > 0 . The
mis-use of sgl_free() when order > 0 was the reason for the memory
leak. The other users of sgl_alloc_order() in the kernel where
checked and found to handle free-ing properly.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h | 1 +
 lib/scatterlist.c   | 3 ---
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..80178afc2a4a 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c448642e0f78..d5770e7f1030 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
-   return NULL;
nalloc = nent;
if (chainable) {
/* Check for integer overflow */
-- 
2.25.1

[PATCH v2 3/4] scatterlist: add sgl_compare_sgl() function

2020-10-18 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s),
another storage related operation is to compare two sgl_s.
This new function is modelled on NVMe's Compare command and
the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function
returns false on the first miscompare and stop comparing.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 +++
 lib/scatterlist.c   | 60 +
 2 files changed, 64 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6649414c0749..ae260dc5fedb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -325,6 +325,10 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index a0a86059c10e..d910776a4c96 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1048,3 +1048,63 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
 
+/**
+ * sgl_compare_sgl - Compare x and y (both sgl_s)
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The (maximum) number of bytes to compare
+ *
+ * Returns true if x and y compare equal before x, y or n_bytes is exhausted.
+ * Otherwise on a miscompare, returns false (and stops comparing).
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   The notes in sgl_copy_sgl() about large sgl_s _applies here as well.
+ *
+ **/
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes)
+{
+   bool equ = true;
+   size_t len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter, y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   while (equ && offset < n_bytes) {
+   if (!sg_miter_next(_iter))
+   break;
+   if (!sg_miter_next(_iter))
+   break;
+   len = min3(x_iter.length, y_iter.length, n_bytes - offset);
+
+   equ = !memcmp(x_iter.addr, y_iter.addr, len);
+   offset += len;
+   /* LIFO order is important when SG_MITER_ATOMIC is used */
+   y_iter.consumed = len;
+   sg_miter_stop(_iter);
+   x_iter.consumed = len;
+   sg_miter_stop(_iter);
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return equ;
+}
+EXPORT_SYMBOL(sgl_compare_sgl);
-- 
2.25.1

Re: [PATCH 2/4] scatterlist: add sgl_copy_sgl() function

2020-10-16 Thread Douglas Gilbert


On 2020-10-16 7:17 a.m., Bodo Stroesser wrote:

Hi Douglas,

AFAICS this patch - and also patch 3 - are not correct.
When started with SG_MITER_ATOMIC, sg_miter_next and sg_miter_stop use
the k(un)map_atomic calls. But these have to be used strictly nested
according to docu and code.
The below code uses the atomic mappings in overlapping mode.


That being the case, I'll add d_flags and s_flags arguments that are
expected to take either 0 or SG_MITER_ATOMIC and re-test. There probably
should be a warning in the notes not to set both d_flags and s_flags
to SG_MITER_ATOMIC.

My testing to date has not been in irq or soft interrupt state. I
should be able to rig a test for the latter.

Thanks
Doug Gilbert


Am 16.10.20 um 06:52 schrieb Douglas Gilbert:

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a copy.

Signed-off-by: Douglas Gilbert 
---
  include/linux/scatterlist.h |  4 ++
  lib/scatterlist.c   | 86 +
  2 files changed, 90 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 80178afc2a4a..6649414c0749 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,

  size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
 size_t buflen, off_t skip);
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,

+    struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip,
+    size_t n_bytes);
+
  /*
   * Maximum number of entries that will be allocated in one piece, if
   * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5770e7f1030..1ec2c909c8d4 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -974,3 +974,89 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,

  return offset;
  }
  EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl: Destination sgl
+ * @d_nents: Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before copying
+ * @s_sgl: Source sgl
+ * @s_nents: Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before copying
+ * @n_bytes: The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if the end of d_sgl or s_sgl is reached.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,

+    struct scatterlist *s_sgl, unsigned int s_nents, off_t s_skip,
+    size_t n_bytes)
+{
+    size_t d_off, s_off, len, d_len, s_len;
+    size_t offset = 0;
+    struct sg_mapping_iter d_iter;
+    struct sg_mapping_iter s_iter;
+
+    if (n_bytes == 0)
+    return 0;
+    sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | SG_MITER_TO_SG);
+    sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+    if (!sg_miter_skip(_iter, d_skip))
+    goto fini;
+    if (!sg_miter_skip(_iter, s_skip))
+    goto fini;
+
+    for (d_off = 0, s_off = 0; true ; ) {
+    /* Assume d_iter.length and s_iter.length can never be 0 */
+    if (d_off == 0) {
+    if (!sg_miter_next(_iter))
+    break;
+    d_len = d_iter.length;
+    } else {
+    d_len = d_iter.length - d_off;
+    }
+    if (s_off == 0) {
+    if (!sg_miter_next(_iter))
+    break;
+    s_len = s_iter.length;
+    } else {
+    s_len = s_iter.length - s_off;
+    }
+    len = min3(d_len, s_len, n_bytes - offset);
+
+    memcpy(d_iter.addr + d_off, s_iter.addr + s_off, len);
+    offset += len;
+    if (offset >= n_bytes)
+    break;
+    if (d_len == s_len) {
+    d_off = 0;
+    s_off = 0;
+

[PATCH 3/4] scatterlist: add sgl_compare_sgl() function

2020-10-15 Thread Douglas Gilbert

After enabling copies between scatter gather lists (sgl_s),
another storage related operation is to compare two sgl_s.
This new function is modelled on NVMe's Compare command and
the SCSI VERIFY(BYTCHK=1) command. Like memcmp() this function
returns false on the first miscompare and stop comparing.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 84 -
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6649414c0749..ae260dc5fedb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -325,6 +325,10 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
size_t n_bytes);
 
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 1ec2c909c8d4..344725990b9d 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -979,10 +979,10 @@ EXPORT_SYMBOL(sg_zero_buffer);
  * sgl_copy_sgl - Copy over a destination sgl from a source sgl
  * @d_sgl:  Destination sgl
  * @d_nents:Number of SG entries in destination sgl
- * @d_skip: Number of bytes to skip in destination before copying
+ * @d_skip: Number of bytes to skip in destination before starting
  * @s_sgl:  Source sgl
  * @s_nents:Number of SG entries in source sgl
- * @s_skip: Number of bytes to skip in source before copying
+ * @s_skip: Number of bytes to skip in source before starting
  * @n_bytes:The number of bytes to copy
  *
  * Returns the number of copied bytes.
@@ -1060,3 +1060,83 @@ size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned 
int d_nents, off_t d_ski
 }
 EXPORT_SYMBOL(sgl_copy_sgl);
 
+/**
+ * sgl_compare_sgl - Compare x and y (both sgl_s)
+ * @x_sgl:  x (left) sgl
+ * @x_nents:Number of SG entries in x (left) sgl
+ * @x_skip: Number of bytes to skip in x (left) before starting
+ * @y_sgl:  y (right) sgl
+ * @y_nents:Number of SG entries in y (right) sgl
+ * @y_skip: Number of bytes to skip in y (right) before starting
+ * @n_bytes:The number of bytes to compare
+ *
+ * Returns true if x and y compare equal before x, y or n_bytes is exhausted.
+ * Otherwise on a miscompare, returns false (and stops comparing).
+ *
+ * Notes:
+ *   x and y are symmetrical: they can be swapped and the result is the same.
+ *
+ *   Implementation is based on memcmp(). x and y segments may overlap.
+ *
+ *   Same comment from sgl_copy_sgl() about large _skip arguments applies here
+ *   as well.
+ *
+ **/
+bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
+struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+size_t n_bytes)
+{
+   bool equ = true;
+   size_t x_off, y_off, len, x_len, y_len;
+   size_t offset = 0;
+   struct sg_mapping_iter x_iter;
+   struct sg_mapping_iter y_iter;
+
+   if (n_bytes == 0)
+   return true;
+   sg_miter_start(_iter, x_sgl, x_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   sg_miter_start(_iter, y_sgl, y_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, x_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, y_skip))
+   goto fini;
+
+   for (x_off = 0, y_off = 0; true ; ) {
+   /* Assume x_iter.length and y_iter.length can never be 0 */
+   if (x_off == 0) {
+   if (!sg_miter_next(_iter))
+   break;
+   x_len = x_iter.length;
+   } else {
+   x_len = x_iter.length - x_off;
+   }
+   if (y_off == 0) {
+   if (!sg_miter_next(_iter))
+   break;
+   y_len = y_iter.length;
+   } else {
+   y_len = y_iter.length - y_off;
+   }
+   len = min3(x_len, y_len, n_bytes - offset);
+
+   equ = memcmp(x_iter.addr + x_off, y_iter.addr + y_off, len) == 
0;
+   offset += len;
+   if (!equ || offset >= n_bytes)
+   break;
+   if (x_len == y_len) {
+   x_off = 0;
+   y_off = 0;
+   } else if (x_len <

[PATCH 4/4] scatterlist: add sgl_memset()

2020-10-15 Thread Douglas Gilbert

The existing sg_zero_buffer() function is a bit restrictive.
For example protection information (PI) blocks are usually
initialized to 0xff bytes. As its name suggests sgl_memset()
is modelled on memset(). One difference is the type of the
val argument which is u8 rather than int.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  3 +++
 lib/scatterlist.c   | 39 +++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ae260dc5fedb..e50dc9a6d887 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -329,6 +329,9 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
 struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
 size_t n_bytes);
 
+void sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+   u8 val, size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 344725990b9d..3ca66f0c949f 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -1083,8 +1083,8 @@ EXPORT_SYMBOL(sgl_copy_sgl);
  *
  **/
 bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned int x_nents, off_t 
x_skip,
-struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
-size_t n_bytes)
+   struct scatterlist *y_sgl, unsigned int y_nents, off_t 
y_skip,
+   size_t n_bytes)
 {
bool equ = true;
size_t x_off, y_off, len, x_len, y_len;
@@ -1140,3 +1140,38 @@ bool sgl_compare_sgl(struct scatterlist *x_sgl, unsigned 
int x_nents, off_t x_sk
return equ;
 }
 EXPORT_SYMBOL(sgl_compare_sgl);
+
+/**
+ * sgl_memset - set byte 'val' n_bytes times on SG list
+ * @sgl:The SG list
+ * @nents:  Number of SG entries in sgl
+ * @skip:   Number of bytes to skip before starting
+ * @val:byte value to write to sgl
+ * @n_bytes:The number of bytes to modify
+ *
+ * Notes:
+ *   Writes val n_bytes times or until sgl is exhausted.
+ *
+ **/
+void sgl_memset(struct scatterlist *sgl, unsigned int nents, off_t skip,
+   u8 val, size_t n_bytes)
+{
+   size_t offset = 0;
+   size_t len;
+   struct sg_mapping_iter miter;
+   unsigned int sg_flags = SG_MITER_ATOMIC | SG_MITER_TO_SG;
+
+   if (n_bytes == 0)
+   return;
+   sg_miter_start(, sgl, nents, sg_flags);
+   if (!sg_miter_skip(, skip))
+   goto fini;
+
+   while ((offset < n_bytes) && sg_miter_next()) {
+   len = min(miter.length, n_bytes - offset);
+   memset(miter.addr, val, len);
+   offset += len;
+   }
+fini:
+   sg_miter_stop();
+}
-- 
2.25.1

[PATCH 0/4] scatterlist: add new capabilities

2020-10-15 Thread Douglas Gilbert

Scatter-gather lists (sgl_s) are frequently used as data
carriers in the block layer. For example the SCSI and NVMe
subsystems interchange data with the block layer using
sgl_s. The sgl API is declared in 

The author has extended these transient sgl use cases to
a store (i.e. ramdisk) in the scsi_debug driver. Other new
potential uses of sgl_s could be for caches. When this extra
step is taken, the need to copy between sgl_s becomes apparent.
The patchset adds sgl_copy_sgl() and a few other sgl
operations.

The existing sgl_alloc_order() function can be seen as a
replacement for vmalloc() for large, long-term allocations.
For what seems like no good reason, sgl_alloc_order()
currently restricts its total allocation to less than or
equal to 4 GiB. vmalloc() has no such restriction.

This patchset is against lk 5.9.0

Douglas Gilbert (4):
  sgl_alloc_order: remove 4 GiB limit, sgl_free() warning
  scatterlist: add sgl_copy_sgl() function
  scatterlist: add sgl_compare_sgl() function
  scatterlist: add sgl_memset()

 include/linux/scatterlist.h |  12 +++
 lib/scatterlist.c   | 204 +++-
 2 files changed, 213 insertions(+), 3 deletions(-)

-- 
2.25.1

[PATCH 1/4] sgl_alloc_order: remove 4 GiB limit, sgl_free() warning

2020-10-15 Thread Douglas Gilbert

This patch removes a check done by sgl_alloc_order() before it starts
any allocations. The comment before the removed code says: "Check for
integer overflow" arguably gives a false sense of security. The right
hand side of the expression in the condition is resolved as u32 so
cannot exceed UINT32_MAX (4 GiB) which means 'length' cannot exceed
that amount. If that was the intention then the comment above it
could be dropped and the condition rewritten more clearly as:
 if (length > UINT32_MAX) <>;

The author's intention is to use sgl_alloc_order() to replace
vmalloc(unsigned long) for a large allocation (debug ramdisk).
vmalloc has no limit at 4 GiB so its seems unreasonable that:
sgl_alloc_order(unsigned long long length, )
does. sgl_s made with sgl_alloc_order(chainable=false) have equally
sized segments placed in a scatter gather array. That allows O(1)
navigation around a big sgl using some simple integer maths.

Having previously sent a patch to fix a memory leak in
sg_alloc_order() take the opportunity to put a one line comment above
sgl_free()'s declaration that it is not suitable when order > 0 . The
mis-use of sgl_free() when order > 0 was the reason for the memory
leak. The other users of sgl_alloc_order() in the kernel where
checked and found to handle free-ing properly.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h | 1 +
 lib/scatterlist.c   | 3 ---
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..80178afc2a4a 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -302,6 +302,7 @@ struct scatterlist *sgl_alloc(unsigned long long length, 
gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
+/* Only use sgl_free() when order is 0 */
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c448642e0f78..d5770e7f1030 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -493,9 +493,6 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
-   /* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
-   return NULL;
nalloc = nent;
if (chainable) {
/* Check for integer overflow */
-- 
2.25.1

[PATCH 2/4] scatterlist: add sgl_copy_sgl() function

2020-10-15 Thread Douglas Gilbert

Both the SCSI and NVMe subsystems receive user data from the block
layer in scatterlist_s (aka scatter gather lists (sgl) which are
often arrays). If drivers in those subsystems represent storage
(e.g. a ramdisk) or cache "hot" user data then they may also
choose to use scatterlist_s. Currently there are no sgl to sgl
operations in the kernel. Start with a copy.

Signed-off-by: Douglas Gilbert 
---
 include/linux/scatterlist.h |  4 ++
 lib/scatterlist.c   | 86 +
 2 files changed, 90 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 80178afc2a4a..6649414c0749 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -321,6 +321,10 @@ size_t sg_pcopy_to_buffer(struct scatterlist *sgl, 
unsigned int nents,
 size_t sg_zero_buffer(struct scatterlist *sgl, unsigned int nents,
   size_t buflen, off_t skip);
 
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes);
+
 /*
  * Maximum number of entries that will be allocated in one piece, if
  * a list larger than this is required then chaining will be utilized.
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5770e7f1030..1ec2c909c8d4 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -974,3 +974,89 @@ size_t sg_zero_buffer(struct scatterlist *sgl, unsigned 
int nents,
return offset;
 }
 EXPORT_SYMBOL(sg_zero_buffer);
+
+/**
+ * sgl_copy_sgl - Copy over a destination sgl from a source sgl
+ * @d_sgl:  Destination sgl
+ * @d_nents:Number of SG entries in destination sgl
+ * @d_skip: Number of bytes to skip in destination before copying
+ * @s_sgl:  Source sgl
+ * @s_nents:Number of SG entries in source sgl
+ * @s_skip: Number of bytes to skip in source before copying
+ * @n_bytes:The number of bytes to copy
+ *
+ * Returns the number of copied bytes.
+ *
+ * Notes:
+ *   Destination arguments appear before the source arguments, as with 
memcpy().
+ *
+ *   Stops copying if the end of d_sgl or s_sgl is reached.
+ *
+ *   Since memcpy() is used, overlapping copies (where d_sgl and s_sgl belong
+ *   to the same sgl and the copy regions overlap) are not supported.
+ *
+ *   If d_skip is large, potentially spanning multiple d_nents then some
+ *   integer arithmetic to adjust d_sgl may improve performance. For example
+ *   if d_sgl is built using sgl_alloc_order(chainable=false) then the sgl
+ *   will be an array with equally sized segments facilitating that
+ *   arithmetic. The suggestion applies to s_skip, s_sgl and s_nents as well.
+ *
+ **/
+size_t sgl_copy_sgl(struct scatterlist *d_sgl, unsigned int d_nents, off_t 
d_skip,
+   struct scatterlist *s_sgl, unsigned int s_nents, off_t 
s_skip,
+   size_t n_bytes)
+{
+   size_t d_off, s_off, len, d_len, s_len;
+   size_t offset = 0;
+   struct sg_mapping_iter d_iter;
+   struct sg_mapping_iter s_iter;
+
+   if (n_bytes == 0)
+   return 0;
+   sg_miter_start(_iter, d_sgl, d_nents, SG_MITER_ATOMIC | 
SG_MITER_TO_SG);
+   sg_miter_start(_iter, s_sgl, s_nents, SG_MITER_ATOMIC | 
SG_MITER_FROM_SG);
+   if (!sg_miter_skip(_iter, d_skip))
+   goto fini;
+   if (!sg_miter_skip(_iter, s_skip))
+   goto fini;
+
+   for (d_off = 0, s_off = 0; true ; ) {
+   /* Assume d_iter.length and s_iter.length can never be 0 */
+   if (d_off == 0) {
+   if (!sg_miter_next(_iter))
+   break;
+   d_len = d_iter.length;
+   } else {
+   d_len = d_iter.length - d_off;
+   }
+   if (s_off == 0) {
+   if (!sg_miter_next(_iter))
+   break;
+   s_len = s_iter.length;
+   } else {
+   s_len = s_iter.length - s_off;
+   }
+   len = min3(d_len, s_len, n_bytes - offset);
+
+   memcpy(d_iter.addr + d_off, s_iter.addr + s_off, len);
+   offset += len;
+   if (offset >= n_bytes)
+   break;
+   if (d_len == s_len) {
+   d_off = 0;
+   s_off = 0;
+   } else if (d_len < s_len) {
+   d_off = 0;
+   s_off += len;
+   } else {
+   d_off += len;
+   s_off = 0;
+   }
+   }
+fini:
+   sg_miter_stop(_iter);
+   sg_miter_stop(_iter);
+   return offset;
+}
+EXPORT_SYMBOL(sgl_copy_sgl);
+
-- 
2.25.1

[RESEND PATCH] sgl_alloc_order: fix memory leak

2020-10-15 Thread Douglas Gilbert

sgl_alloc_order() can fail when 'length' is large on a memory
constrained system. When order > 0 it will potentially be
making several multi-page allocations with the later ones more
likely to fail than the earlier one. So it is important that
sgl_alloc_order() frees up any pages it has obtained before
returning NULL. In the case when order > 0 it calls the wrong
free page function and leaks. In testing the leak was
sufficient to bring down my 8 GiB laptop with OOM.

Reviewed-by: Bart Van Assche 
Signed-off-by: Douglas Gilbert 
---
 lib/scatterlist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 5d63a8857f36..c448642e0f78 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -514,7 +514,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
elem_len = min_t(u64, length, PAGE_SIZE << order);
page = alloc_pages(gfp, order);
if (!page) {
-   sgl_free(sgl);
+   sgl_free_order(sgl, order);
return NULL;
}
 
-- 
2.25.1

Re: [question] What happens when dd writes data to a missing device?

2020-10-11 Thread Douglas Gilbert


On 2020-10-11 3:46 p.m., Mikhail Gavrilov wrote:

Hi folks!
I have a question.
What happens when dd writes data to a missing device?

For example:
# dd 
if=/home/mikhail/Downloads/Fedora-Workstation-Live-x86_64-Rawhide-20201010.n.0.iso
of=/dev/adb

Today I and wrongly entered /dev/adb instead of /dev/sdb,
and what my surprise was when the data began to be written to the
/dev/adb device without errors.

But my surprise was even greater when cat /dev/adb started to display
the written data.

I have a question:
Where the data was written and could it damage the stored data in
memory or on disk?


Others have answered your direct question.

You may find 'oflag=nocreat' helpful if you (or others) do _not_ want
a regular file created in /dev ; for example: if you have misspelt a
device name.
That flag may also be helpful in unstable systems (e.g. where device
nodes are disappearing and re-appearing) as it can be a real pain
if you manage to create a regular file with a name like /dev/sdc when
the disk usually occupying that node is temporarily offline. When
that disk comes back online then regular file '/dev/sdc' will stop
device node '/dev/sdc' from being created.

The solution is to remove the regular file /dev/sdc and you probably
need to power cycle that disk. If this becomes a regular event then
'oflag=nocreat' is your friend [see 'man dd' for a little more
information, it really should be expanded].

Doug Gilbert

Re: [PATCH] lib/scatterlist: Fix memory leak in sgl_alloc_order()

2020-09-20 Thread Douglas Gilbert


On 2020-09-20 4:11 p.m., Markus Elfring wrote:

Noticed that when sgl_alloc_order() failed with order > 0 that
free memory on my machine shrank. That function shouldn't call
sgl_free() on its error path since that is only correct when
order==0 .


* Would an imperative wording become helpful for the change description?

…

… and the term "imperative wording" rings no
bells in my grammatical education. …


I suggest to take another look at the published Linux development documentation.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=bdcf11de8f776152c82d2197b255c2d04603f976#n151



* How do you think about to add the tag “Fixes” to the commit message?r


In the workflow I'm used to, others (closer to LT) make that decision.
Why waste my time?


I find another bit of guidance relevant.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=bdcf11de8f776152c82d2197b255c2d04603f976#n183



* Will an other patch subject be more appropriate?


Twas testing a 6 GB allocation with said function on my 8 GB laptop.
It failed and free told me 5 GB had disappeared …

…

Have we got any different expectations for the canonical patch subject line?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=bdcf11de8f776152c82d2197b255c2d04603f976#n684

I am curious how the software will evolve further also according to your
system test experiences.


Sorry, I didn't come down in the last shower, it's not my first bug fix.
Try consulting 'git log' and look for my name or the MAINTAINERS file.
The culprits are usually happy as was the case with this patch. It's
ack-ed and I would be very surprised if Jens Axboe doesn't accept it.

It is an obvious flaw. Fix it and move on. Alternatively supply your own
patch that ticks all the above boxes.


If you want to talk about something substantial, then why do we have a
function named sgl_free() that only works properly if, for example, the
sgl_alloc_order() function creating the sgl used order==0 ? IMO sgl_free()
should be removed or renamed.

Doug Gilbert


BTW The "imperative mood" stuff in that document is nonsense, at least
in English. Wikipedia maps that term back to "the imperative" as in
"Get thee to a nunnery" and "Et tu, Brute".

Re: [PATCH] lib/scatterlist: Fix memory leak in sgl_alloc_order()

2020-09-20 Thread Douglas Gilbert


On 2020-09-20 1:09 p.m., Markus Elfring wrote:

Noticed that when sgl_alloc_order() failed with order > 0 that
free memory on my machine shrank. That function shouldn't call
sgl_free() on its error path since that is only correct when
order==0 .


* Would an imperative wording become helpful for the change description?


No passive tense there. Or do you mean usage like: "Go to hell" or
"Fix memory leak in ..."? I studied French and Latin at school; at a
guess, my mother tongue got its grammar from the former. My mother
taught English grammar and the term "imperative wording" rings no
bells in my grammatical education. Google agrees with me.
Please define: "imperative wording".

* How do you think about to add the tag “Fixes” to the commit message?r


In the workflow I'm used to, others (closer to LT) make that decision.
Why waste my time?


* Will an other patch subject be more appropriate?


Twas testing a 6 GB allocation with said function on my 8 GB laptop.
It failed and free told me 5 GB had disappeared (and
'cat /sys/kernel/debug/kmemleak' told me _nothing_). Umm, it is
potentially a HUGE f@#$ing memory LEAK! Best to call a spade a spade.

Doug Gilbert

[PATCH] sgl_alloc_order: memory leak

2020-09-19 Thread Douglas Gilbert

Noticed that when sgl_alloc_order() failed with order > 0 that
free memory on my machine shrank. That function shouldn't call
sgl_free() on its error path since that is only correct when
order==0 .

Signed-off-by: Douglas Gilbert 
---
 lib/scatterlist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 5d63a8857f36..c448642e0f78 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -514,7 +514,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
elem_len = min_t(u64, length, PAGE_SIZE << order);
page = alloc_pages(gfp, order);
if (!page) {
-   sgl_free(sgl);
+   sgl_free_order(sgl, order);
return NULL;
}
 
-- 
2.25.1

[PATCH] tools/io_uring: fix compile breakage

2020-09-14 Thread Douglas Gilbert

It would seem none of the kernel continuous integration does this:
$ cd tools/io_uring
$ make

Otherwise it may have noticed:
   cc -Wall -Wextra -g -D_GNU_SOURCE   -c -o io_uring-bench.o
 io_uring-bench.c
io_uring-bench.c:133:12: error: static declaration of ‘gettid’
 follows non-static declaration
  133 | static int gettid(void)
  |^~
In file included from /usr/include/unistd.h:1170,
 from io_uring-bench.c:27:
/usr/include/x86_64-linux-gnu/bits/unistd_ext.h:34:16: note:
 previous declaration of ‘gettid’ was here
   34 | extern __pid_t gettid (void) __THROW;
  |^~
make: *** [: io_uring-bench.o] Error 1

The problem on Ubuntu 20.04 (with lk 5.9.0-rc5) is that unistd.h
already defines gettid(). So prefix the local definition with
"lk_".

Signed-off-by: Douglas Gilbert 
---
 tools/io_uring/io_uring-bench.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/io_uring/io_uring-bench.c b/tools/io_uring/io_uring-bench.c
index 0f257139b003..7703f0118385 100644
--- a/tools/io_uring/io_uring-bench.c
+++ b/tools/io_uring/io_uring-bench.c
@@ -130,7 +130,7 @@ static int io_uring_register_files(struct submitter *s)
s->nr_files);
 }
 
-static int gettid(void)
+static int lk_gettid(void)
 {
return syscall(__NR_gettid);
 }
@@ -281,7 +281,7 @@ static void *submitter_fn(void *data)
struct io_sq_ring *ring = >sq_ring;
int ret, prepped;
 
-   printf("submitter=%d\n", gettid());
+   printf("submitter=%d\n", lk_gettid());
 
srand48_r(pthread_self(), >rand);
 
-- 
2.25.1

Re: [PATCH] scsi: clear UAC before sending SG_IO

2020-09-10 Thread Douglas Gilbert


On 2020-09-10 6:15 a.m., Randall Huang wrote:

Make sure UAC is clear before sending SG_IO.

Signed-off-by: Randall Huang 


This patch just looks wrong. Imagine if every LLD front loaded some LLD
specific code before each invocation of ioctl(SG_IO). Is UAC Unit Attention
Condition? If so the mid-level notes them as they fly past.

Haven't looked at the rest of the patchset but I suspect the "wlun_clr_uac"
work needs a rethink. If that is the REPORT LUNS Well known LUN then perhaps
it could be handled in the mid-level scanning code. Otherwise it should
be handled in the LLD/UFS.

Also users of ioctl(SG_IO) should be capable of handling UAs, even if they
are irrelevant, and repeat the invocation. Finally ioctl(sg_dev, SG_IO) is
not the only way to send a pass-through command, there are also
  - write(sg_dev, ...)
  - ioctl(bsg_dev, SG_IO, ...)
  - ioctl(most_blk_devs, SG_IO, ...)
  - ioctl(st_dev, SG_IO, ...)

Hopefully I have convinced you by now not to take this route.

Doug Gilbert


---
  drivers/scsi/sg.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 20472aaaf630..ad11bca47ae8 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -922,6 +922,7 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd 
*sfp,
int result, val, read_only;
Sg_request *srp;
unsigned long iflags;
+   int _cmd;
  
  	SCSI_LOG_TIMEOUT(3, sg_printk(KERN_INFO, sdp,

   "sg_ioctl: cmd=0x%x\n", (int) cmd_in));
@@ -933,6 +934,13 @@ sg_ioctl_common(struct file *filp, Sg_device *sdp, Sg_fd 
*sfp,
return -ENODEV;
if (!scsi_block_when_processing_errors(sdp->device))
return -ENXIO;
+
+   _cmd = SCSI_UFS_REQUEST_SENSE;
+   if (sdp->device->host->wlun_clr_uac) {
+   sdp->device->host->hostt->ioctl(sdp->device, _cmd, 
NULL);
+   sdp->device->host->wlun_clr_uac = false;
+   }
+
result = sg_new_write(sfp, filp, p, SZ_SG_IO_HDR,
 1, read_only, 1, );
if (result < 0)

Re: [PATCH v8 00/18] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

2020-09-03 Thread Douglas Gilbert


On 2020-08-19 11:20 a.m., John Garry wrote:

Hi all,

Here is v8 of the patchset.

In this version of the series, we keep the shared sbitmap for driver tags,
and introduce changes to fix up the tag budgeting across request queues.
We also have a change to count requests per-hctx for when an elevator is
enabled, as an optimisation. I also dropped the debugfs changes - more on
that below.

Some performance figures:

Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
but it is not always an appropriate scheduler to use.

Tag depth   4000 (default)  260**

Baseline (v5.9-rc1):
none sched: 2094K IOPS  513K
mq-deadline sched:  2145K IOPS  1336K

Final, host_tagset=0 in LLDD *, ***:
none sched: 2120K IOPS  550K
mq-deadline sched:  2121K IOPS  1309K

Final ***:
none sched: 2132K IOPS  1185
mq-deadline sched:  2145K IOPS  2097

* this is relevant as this is the performance in supporting but not
   enabling the feature
** depth=260 is relevant as some point where we are regularly waiting for
tags to be available. Figures were are a bit unstable here.
*** Included "[PATCH V4] scsi: core: only re-run queue in
 scsi_end_request() if device queue is busy"

A copy of the patches can be found here:
https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-v8

The hpsa patch depends on:
https://lore.kernel.org/linux-scsi/20200430131904.5847-1-h...@suse.de/

And the smartpqi patch is not to be accepted.

Comments (and testing) welcome, thanks!


I tested this v8 patchset on MKP's 5.10/scsi-queue branch together with
my rewritten sg driver on my laptop and a Ryzen 5 3600 machine. Since I
don't have same hardware, I use the scsi_debug driver as the target:

   modprobe scsi_debug dev_size_mb=1024 sector_size=512 add_host=7 
per_host_store=1 ndelay=1000 random=1 submit_queues=12


My test is a script which runs these three commands many times with
differing parameters:

sg_mrq_dd iflag=random bs=512 of=/dev/sg8 thr=64 time=2
time to transfer data was 0.312705 secs, 3433.72 MB/sec
2097152+0 records in
2097152+0 records out

sg_mrq_dd bpt=256 thr=64 mrq=36 time=2 if=/dev/sg8 bs=512 of=/dev/sg9
time to transfer data was 0.212090 secs, 5062.67 MB/sec
2097152+0 records in
2097152+0 records out

sg_mrq_dd --verify if=/dev/sg8 of=/dev/sg9 bs=512 bpt=256 thr=64 mrq=36 time=2
Doing verify/cmp rather than copy
time to transfer data was 0.184563 secs, 5817.75 MB/sec
2097152+0 records in
2097152+0 records verified

The above is the output from last section of the my script run on the Ryzen 5.

So the three steps are:
   1) produce random data on /dev/sg8
   2) copy /dev/sg8 to /dev/sg9
   3) verify /dev/sg8 and /dev/sg9 are the same.

The latter step is done with a sequence of READ(/dev/sg8) and
VERIFY(BYTCHK=1 on /dev/sg9). The "mrq" stands for multiple requests (in
one invocation; the bsg driver did that before its write(2) command was
removed.
The SCSI devices on the Ryzen 5 machine are:

# lsscsi -gs
[2:0:0:0]  diskIBM-207x HUSMM8020ASS20   J4B6  /dev/sda   /dev/sg0   200GB
[2:0:1:0]  diskSEAGATE  ST200FM0073  0007  /dev/sdb   /dev/sg1   200GB
[2:0:2:0]  enclosu Areca Te ARC-802801.37.69 0137  -  /dev/sg2   -
[3:0:0:0]  diskLinuxscsi_debug   0190  /dev/sdc   /dev/sg3  1.07GB
[4:0:0:0]  diskLinuxscsi_debug   0190  /dev/sdd   /dev/sg4  1.07GB
[5:0:0:0]  diskLinuxscsi_debug   0190  /dev/sde   /dev/sg5  1.07GB
[6:0:0:0]  diskLinuxscsi_debug   0190  /dev/sdf   /dev/sg6  1.07GB
[7:0:0:0]  diskLinuxscsi_debug   0190  /dev/sdg   /dev/sg7  1.07GB
[8:0:0:0]  diskLinuxscsi_debug   0190  /dev/sdh   /dev/sg8  1.07GB
[9:0:0:0]  diskLinuxscsi_debug   0190  /dev/sdi   /dev/sg9  1.07GB
[N:0:1:1]  diskWDC WDS250G2B0C-00PXH0__1   /dev/nvme0n1  -  250GB

My script took 17m12 and the highest throughput (on a copy) was 7.5 GB/sec.
Then I reloaded the scsi_debug module, this time with an additional
'host_max_queue=128' parameter. The script run time was 5 seconds shorter
and the maximum throughput was around 7.6 GB/sec. [Average throughput is
around 4 GB/sec.]

For comparison:

# time liburing/examples/io_uring-cp /dev/sdh /dev/sdi
real0m1.542s
user0m0.004s
sys 0m1.027s

Umm, that's less then 1 GB/sec. In its defence io_uring-cp is an
extremely simple, single threaded, proof-of-concept copy program,
at least compared to sg_mrq_dd . As used by the sg_mrq_dd the
rewritten sg driver bypasses moving 1 GB to and from the _user_
space while doing the above copy and verify steps.

So:

Tested-by: Douglas Gilbert 


Differences to v7:
- Add null_blk and scsi_debug support
- Drop debugfs tags patch - it's too difficult to be the same between
hostw

Re: rework check_disk_change()

2020-09-02 Thread Douglas Gilbert


On 2020-09-02 10:11 a.m., Christoph Hellwig wrote:

Hi Jens,

this series replaced the not very nice check_disk_change() function with
a new bdev_media_changed that avoids having the ->revalidate_disk call
at its end.  As a result ->revalidate_disk can be removed from a lot of
drivers.



For over 20 years the sg driver has been carrying this snippet that hangs
off the completion callback:

   if (driver_stat & DRIVER_SENSE) {
struct scsi_sense_hdr ssh;

if (scsi_normalize_sense(sbp, sense_len, )) {
if (!scsi_sense_is_deferred()) {
if (ssh.sense_key == UNIT_ATTENTION) {
if (sdp->device->removable)
sdp->device->changed = 1;
}
}
}
}

Is it needed? The unit attention (UA) may not be associated with the
device changing. Shouldn't the SCSI mid-level monitor UAs if they
impact the state of a scsi_device object?

Doug Gilbert

Re: [PATCH] scsi: sd: add runtime pm to open / release

2020-07-29 Thread Douglas Gilbert


On 2020-07-29 10:32 a.m., Alan Stern wrote:

On Wed, Jul 29, 2020 at 04:12:22PM +0200, Martin Kepplinger wrote:

On 28.07.20 22:02, Alan Stern wrote:

On Tue, Jul 28, 2020 at 09:02:44AM +0200, Martin Kepplinger wrote:

Hi Alan,

Any API cleanup is of course welcome. I just wanted to remind you that
the underlying problem: broken block device runtime pm. Your initial
proposed fix "almost" did it and mounting works but during file access,
it still just looks like a runtime_resume is missing somewhere.


Well, I have tested that proposed fix several times, and on my system
it's working perfectly.  When I stop accessing a drive it autosuspends,
and when I access it again it gets resumed and works -- as you would
expect.


that's weird. when I mount, everything looks good, "sda1". But as soon
as I cd to the mountpoint and do "ls" (on another SD card "ls" works but
actual file reading leads to the exact same errors), I get:

[   77.474632] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result:
hostbyte=0x00 driverbyte=0x08 cmd_age=0s
[   77.474647] sd 0:0:0:0: [sda] tag#0 Sense Key : 0x6 [current]
[   77.474655] sd 0:0:0:0: [sda] tag#0 ASC=0x28 ASCQ=0x0
[   77.474667] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x28 28 00 00 00 60
40 00 00 01 00


This error report comes from the SCSI layer, not the block layer.


SCSI's first 11 byte command! I'm guessing the first byte is being
repeated and it's actually:
28 00 00 00 60 40 00 00 01 00  [READ(10)]

That should be fixed. It should be something like: "...CDB in hex: 28 00 ...".

Doug Gilbert


[   77.474678] blk_update_request: I/O error, dev sda, sector 24640 op
0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   77.485836] sd 0:0:0:0: [sda] tag#0 device offline or changed
[   77.491628] blk_update_request: I/O error, dev sda, sector 24641 op
0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   77.502275] sd 0:0:0:0: [sda] tag#0 device offline or changed
[   77.508051] blk_update_request: I/O error, dev sda, sector 24642 op
0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   77.518651] sd 0:0:0:0: [sda] tag#0 device offline or changed
(...)
[   77.947653] sd 0:0:0:0: [sda] tag#0 device offline or changed
[   77.953434] FAT-fs (sda1): Directory bread(block 16448) failed
[   77.959333] sd 0:0:0:0: [sda] tag#0 device offline or changed
[   77.965118] FAT-fs (sda1): Directory bread(block 16449) failed
[   77.971014] sd 0:0:0:0: [sda] tag#0 device offline or changed
[   77.976802] FAT-fs (sda1): Directory bread(block 16450) failed
[   77.982698] sd 0:0:0:0: [sda] tag#0 device offline or changed
(...)
[   78.384929] FAT-fs (sda1): Filesystem has been set read-only
[  103.070973] sd 0:0:0:0: [sda] tag#0 device offline or changed
[  103.076751] print_req_error: 118 callbacks suppressed
[  103.076760] blk_update_request: I/O error, dev sda, sector 9748 op
0x1:(WRITE) flags 0x10 phys_seg 1 prio class 0
[  103.087428] Buffer I/O error on dev sda1, logical block 1556, lost
async page write
[  103.095309] sd 0:0:0:0: [sda] tag#0 device offline or changed
[  103.101123] blk_update_request: I/O error, dev sda, sector 17162 op
0x1:(WRITE) flags 0x10 phys_seg 1 prio class 0
[  103.111883] Buffer I/O error on dev sda1, logical block 8970, lost
async page write


I can't tell why you're getting that error.  In one of my tests the
device returned the same kind of error status (Sense Key = 6, ASC =
0x28) but the operation was then retried successfully.  Perhaps the
problem lies in the device you are testing.


As we need to have that working at some point, I might look into it, but
someone who has experience in the block layer can surely do it more
efficiently.


I suspect that any problems you still face are caused by something else.



I then formatted sda1 to ext2 (on the runtime suspend system testing
your patch) and that seems to have worked!

Again accessing the mountpoint then yield the very same "device offline
or changed" errors.

What kind of device are you testing? You should be easily able to
reproduce this using an "sd" device.


I tested two devices: a SanDisk Cruzer USB flash drive and a
g-mass-storage gadget running under dummy-hcd.  They each showed up as
/dev/sdb on my system.

I haven't tried testing with an SD card.  If you have any specific
sequence of commands you would like me to run, let me know.


The problems must lie in the different other drivers we use I guess.


Or the devices.  Have you tried testing with a USB flash drive?

Alan Stern

Re: [RFC][PATCHES] drivers/scsi/sg.c uaccess cleanups/fixes

2019-10-17 Thread Douglas Gilbert


On 2019-10-17 9:36 p.m., Al Viro wrote:

On Wed, Oct 16, 2019 at 09:25:40PM +0100, Al Viro wrote:


FWIW, callers of __copy_from_user() remaining in the generic code:



6) drivers/scsi/sg.c nest: sg_read() ones are memdup_user() in disguise
(i.e. fold with immediately preceding kmalloc()s).  sg_new_write() -
fold with access_ok() into copy_from_user() (for both call sites).
sg_write() - lose access_ok(), use copy_from_user() (both call sites)
and get_user() (instead of the solitary __get_user() there).


Turns out that there'd been outright redundant access_ok() calls (not
even warranted by __copy_...) *and* several __put_user()/__get_user()
with no checking of return value (access_ok() was there, handling of
unmapped addresses wasn't).  The latter go back at least to 2.1.early...

I've got a series that presumably fixes and cleans the things up
in that area; it didn't get any serious testing (the kernel builds
and boots, smartctl works as well as it used to, but that's not
worth much - all it says is that SG_IO doesn't fail terribly;
I don't have any test setup for really working with /dev/sg*).

IOW, it needs more review and testing - this is _not_ a pull request.
It's in vfs.git#work.sg; individual patches are in followups.
Shortlog/diffstat:
Al Viro (8):
   sg_ioctl(): fix copyout handling
   sg_new_write(): replace access_ok() + __copy_from_user() with 
copy_from_user()
   sg_write(): __get_user() can fail...
   sg_read(): simplify reading ->pack_id of userland sg_io_hdr_t
   sg_new_write(): don't bother with access_ok
   sg_read(): get rid of access_ok()/__copy_..._user()
   sg_write(): get rid of access_ok()/__copy_from_user()/__get_user()
   SG_IO: get rid of access_ok()

  drivers/scsi/sg.c | 98 

  1 file changed, 32 insertions(+), 66 deletions(-)


Al,
I am aware of these and have a 23 part patchset on the linux-scsi list
for review (see https://marc.info/?l=linux-scsi=157052102631490=2 )
that amongst other things fixes all of these. It also re-adds the
functionality removed from the bsg driver last year. Unfortunately that
review process is going very slowly, so I have no objections if you
apply these now.

It is unlikely that these changes will introduce any bugs (they didn't in
my testing). If you want to do more testing you may find the sg3_utils
package helpful, especially in the testing directory:
https://github.com/hreinecke/sg3_utils

Doug Gilbert

Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default

2019-07-08 Thread Douglas Gilbert


On 2019-07-08 2:01 a.m., Hannes Reinecke wrote:

On 7/5/19 7:53 PM, Douglas Gilbert wrote:

On 2019-07-05 3:22 a.m., Hannes Reinecke wrote:

[ .. ]

As mentioned, rescan-scsi-bus.sh is keeping references to /proc/scsi as
a fall back only, as it's meant to work kernel independent. Per default
it'll be using /sys, and will happily work without /proc/scsi.

So it's really only /proc/scsi/sg which carries some meaningful
information; maybe we should move/copy it to somewhere else.

I personally like getting rid of /proc/scsi.


/proc/scsi/device_info doesn't seem to be in sysfs.

Could the contents of /proc/scsi/sg/* be placed in
/sys/class/scsi_generic/* ? Currently that directory only has symlinks
to the sg devices.


The sg parameters are already available in /sys/module/sg/parameters;
so from that perspective I feel we're good.


# ls /sys/module/sg/parameters/
allow_dio  def_reserved_size  scatter_elem_sz

# ls /proc/scsi/sg/
allow_dio  debug  def_reserved_size  device_hdr  devices  device_strs
red_debug  version

So that doesn't work, what are in 'parameters' are passed in at
module/driver initialization. Back to my original question: Could the
contents of /proc/scsi/sg/* be placed in /sys/class/scsi_generic/* ?


Problem is /proc/scsi/device_info, for which we currently don't have any
other location to store it at.
Hmm.


Doug Gilbert

Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default

2019-07-05 Thread Douglas Gilbert


On 2019-07-05 3:22 a.m., Hannes Reinecke wrote:

On 6/18/19 7:43 PM, Elliott, Robert (Servers) wrote:




-Original Message-
From: linux-kernel-ow...@vger.kernel.org 
[mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Bart
Van Assche
Sent: Monday, June 17, 2019 10:28 PM
To: dgilb...@interlog.com; Marc Gonzalez ; James 
Bottomley
; Martin Petersen 
Cc: SCSI ; LKML ; 
Christoph Hellwig

Subject: Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default

On 6/17/19 5:35 PM, Douglas Gilbert wrote:

For sg3_utils:

$ find . -name '*.c' -exec grep "/proc/scsi" {} \; -print
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_read.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgp_dd.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgm_dd.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_dd.c
      "'echo 1 > /proc/scsi/sg/allow_dio'\n", q_len,
dirio_count);
./testing/sg_tst_bidi.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./examples/sgq_dd.c

That is 6 (not 38) by my count.


Hi Doug,

This is the command I ran:

$ git grep /proc/scsi | wc -l
38

I think your query excludes scripts/rescan-scsi-bus.sh.

Bart.


Here's the full list to ensure the discussion doesn't overlook anything:

sg3_utils-1.44$ grep -R /proc/scsi .
./src/sg_read.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgp_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgm_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_dd.c:static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./scripts/rescan-scsi-bus.sh:# Return hosts. /proc/scsi/HOSTADAPTER/? must exist
./scripts/rescan-scsi-bus.sh:  for driverdir in /proc/scsi/*; do
./scripts/rescan-scsi-bus.sh:driver=${driverdir#/proc/scsi/}
./scripts/rescan-scsi-bus.sh:  name=${hostdir#/proc/scsi/*/}
./scripts/rescan-scsi-bus.sh:# Get /proc/scsi/scsi info for device 
$host:$channel:$id:$lun
./scripts/rescan-scsi-bus.sh:SCSISTR=$(grep -A "$LN" -e "$grepstr" 
/proc/scsi/scsi)
./scripts/rescan-scsi-bus.sh:DRV=`grep 'Attached drivers:' /proc/scsi/scsi 
2>/dev/null`
./scripts/rescan-scsi-bus.sh:  echo "scsi report-devs 1" >/proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:  DRV=`grep 'Attached drivers:' /proc/scsi/scsi 
2>/dev/null`
./scripts/rescan-scsi-bus.sh:  echo "scsi report-devs 0" >/proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:# Outputs description from /proc/scsi/scsi (unless 
arg passed)
./scripts/rescan-scsi-bus.sh:echo "scsi remove-single-device $devnr" > 
/proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:  echo "scsi add-single-device $devnr" > 
/proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:  echo "scsi add-single-device $devnr" > 
/proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:  echo "scsi add-single-device $devnr" > 
/proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:  echo "scsi add-single-device $host $channel $id 
$SCAN_WILD_CARD" > /proc/scsi/scsi
./scripts/rescan-scsi-bus.sh:if test ! -d /sys/class/scsi_host/ -a ! -d 
/proc/scsi/; then
./ChangeLog:/proc/scsi/sg/allow_dio is '0'
./ChangeLog:  - change sg_debug to call system("cat /proc/scsi/sg/debug");
./suse/sg3_utils.changes:  * Support systems without /proc/scsi
./examples/sgq_dd.c:static const char * proc_allow_dio = 
"/proc/scsi/sg/allow_dio";
./doc/sg_read.8:If direct IO is selected and /proc/scsi/sg/allow_dio
./doc/sg_read.8:"echo 1 > /proc/scsi/sg/allow_dio". An alternate way to avoid 
the
./doc/sg_map.8:observing the output of the command: "cat /proc/scsi/scsi".
./doc/sgp_dd.8:at completion. If direct IO is selected and 
/proc/scsi/sg/allow_dio
./doc/sgp_dd.8:this at completion. If direct IO is selected and 
/proc/scsi/sg/allow_dio
./doc/sgp_dd.8:mapping to SCSI block devices should be checked with 'cat 
/proc/scsi/scsi'
./doc/sg_dd.8:notes this at completion. If direct IO is selected and 
/proc/scsi/sg/allow_dio
./doc/sg_dd.8:this at completion. If direct IO is selected and 
/proc/scsi/sg/allow_dio
./doc/sg_dd.8:with 'echo 1 > /proc/scsi/sg/allow_dio'.
./doc/sg_dd.8:mapping to SCSI block devices should be checked with 'cat 
/proc/scsi/scsi',



As mentioned, rescan-scsi-bus.sh is keeping references to /proc/scsi as
a fall back only, as it's meant to work kernel independent. Per default
it'll be using /sys, and will happily work without /proc/scsi.

So it's really only /proc/scsi/sg which carries some meaningful
information; maybe we should move/copy it to somewhere else.

I personally like getting rid of /proc/scsi.


/proc/scsi/device_info doesn't seem to be in sysfs.

Could the contents of /proc/scsi/sg/* be placed in
/sys/class/scsi_generic/* ? Currently that directory only has symlinks
to the sg devices.

Doug Gilbert

Re: [PATCH 0/2] scsi: add support for request batching

2019-06-26 Thread Douglas Gilbert


On 2019-06-26 9:51 a.m., Paolo Bonzini wrote:

On 30/05/19 13:28, Paolo Bonzini wrote:

This allows a list of requests to be issued, with the LLD only writing
the hardware doorbell when necessary, after the last request was prepared.
This is more efficient if we have lists of requests to issue, particularly
on virtualized hardware, where writing the doorbell is more expensive than
on real hardware.

This applies to any HBA, either singlequeue or multiqueue; the second
patch implements it for virtio-scsi.

Paolo

Paolo Bonzini (2):
   scsi_host: add support for request batching
   virtio_scsi: implement request batching

  drivers/scsi/scsi_lib.c| 37 ++---
  drivers/scsi/virtio_scsi.c | 55 +++---
  include/scsi/scsi_cmnd.h   |  1 +
  include/scsi/scsi_host.h   | 16 +--
  4 files changed, 89 insertions(+), 20 deletions(-)




Ping?  Are there any more objections?


I have no objections, just a few questions.

To implement this is the scsi_debug driver, a per device queue would
need to be added, correct? Then a 'commit_rqs' call would be expected
at some later point and it would drain that queue and submit each
command. Or is the queue draining ongoing in the LLD and 'commit_rqs'
means: don't return until that queue is empty?

So does that mean in the normal (i.e. non request batching) case
there are two calls to the LLD for each submitted command? Or is
'commit_rqs' optional, a sync-ing type command?

Doug Gilbert

Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default

2019-06-19 Thread Douglas Gilbert


On 2019-06-19 5:42 a.m., Marc Gonzalez wrote:

On 18/06/2019 17:31, Douglas Gilbert wrote:


On 2019-06-18 3:29 a.m., Marc Gonzalez wrote:


Please note that I am _in no way_ suggesting that we remove any code.

I just think it might be time to stop forcing CONFIG_SCSI_PROC_FS into
every config, and instead require one to explicitly request the aging
feature (which makes CONFIG_SCSI_PROC_FS show up in a defconfig).

Maybe we could add CONFIG_SCSI_PROC_FS to arch/x86/configs/foo ?
(For which foo? In a separate patch or squashed with this one?)


Since current sg driver usage seems to depend more on SCSI_PROC_FS
being "y" than other parts of the SCSI subsystem then if
SCSI_PROC_FS is to default to "n" in the future then a new
CONFIG_SG_PROC_FS variable could be added.

If CONFIG_CHR_DEV_SG is "*" or "m" then default CONFIG_SG_PROC_FS
to "y"; if CONFIG_SCSI_PROC_FS is "y" then default CONFIG_SG_PROC_FS
to "y"; else default CONFIG_SG_PROC_FS to "n". Obviously the
sg driver would need to be changed to use CONFIG_SG_PROC_FS instead
of CONFIG_SCSI_PROC_FS .


I like your idea, and I think it might even be made slightly simpler.

I assume sg3_utils requires CHR_DEV_SG. Is it the case?

If so, we would just need to enable SCSI_PROC_FS when CHR_DEV_SG is enabled.

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 73bce9b6d037..642ca0e7d363 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -54,14 +54,12 @@ config SCSI_NETLINK
  config SCSI_PROC_FS
bool "legacy /proc/scsi/ support"
depends on SCSI && PROC_FS
-   default y
+   default CHR_DEV_SG
---help---
  This option enables support for the various files in
  /proc/scsi.  In Linux 2.6 this has been superseded by
  files in sysfs but many legacy applications rely on this.
  
-	  If unsure say Y.

-
  comment "SCSI support type (disk, tape, CD-ROM)"
depends on SCSI
  


Would that work for you?
I checked that SCSI_PROC_FS=y whether CHR_DEV_SG=y or m
I can spin a v2, with a blurb about how sg3_utils relies on SCSI_PROC_FS.


Yes, but (see below) ...


Does that defeat the whole purpose of your proposal or could it be
seen as a partial step in that direction? What is the motivation
for this proposal?


The rationale was just to look for "special-purpose" options that are
enabled by default, and change the default wherever possible, as a
matter of uniformity.


BTW We still have the non-sg related 'cat /proc/scsi/scsi' usage
and 'cat /proc/scsi/device_info'. And I believe the latter one is
writable even though its permissions say otherwise.


Any relation between SG and BSG?


Only in the sense that writing to /proc/scsi/device_info changes the
way the SCSI mid-level handles the identified device. So that is
in common with, and hence the same relation as,  sd, sr, st, ses, etc
have with the identified device (e.g. a specialized USB dongle).


Example of use of /proc/scsi/scsi

$ cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: LinuxModel: scsi_debug   Rev: 0188
  Type:   Direct-AccessANSI  SCSI revision: 07
Host: scsi0 Channel: 00 Id: 00 Lun: 01
  Vendor: LinuxModel: scsi_debug   Rev: 0188
  Type:   Direct-AccessANSI  SCSI revision: 07
Host: scsi0 Channel: 00 Id: 00 Lun: 02
  Vendor: LinuxModel: scsi_debug   Rev: 0188
  Type:   Direct-AccessANSI  SCSI revision: 07

Which can be replaced by:

$ lsscsi
[0:0:0:0]diskLinuxscsi_debug   0188  /dev/sda
[0:0:0:1]diskLinuxscsi_debug   0188  /dev/sdb
[0:0:0:2]diskLinuxscsi_debug   0188  /dev/sdc
[N:0:1:1]diskINTEL SSDPEKKF256G7L__1/dev/nvme0n1

Or if one really likes the "classic" look:

$ lsscsi -c
Attached devices:
Host: scsi0 Channel: 00 Target: 00 Lun: 00
  Vendor: LinuxModel: scsi_debug   Rev: 0188
  Type:   Direct-AccessANSI SCSI revision: 07
Host: scsi0 Channel: 00 Target: 00 Lun: 01
  Vendor: LinuxModel: scsi_debug   Rev: 0188
  Type:   Direct-AccessANSI SCSI revision: 07
Host: scsi0 Channel: 00 Target: 00 Lun: 02
  Vendor: LinuxModel: scsi_debug   Rev: 0188
  Type:   Direct-AccessANSI SCSI revision: 07


Now looking at /proc/scsi/device_info

IMO unless there is a replacement for /proc/scsi/device_info
then your patch should not go ahead . If it does, any reasonable
distro should override it.

$ cat /proc/scsi/device_info
'Aashima' 'IMAGERY 2400SP' 0x1
'CHINON' 'CD-ROM CDS-431' 0x1
'CHINON' 'CD-ROM CDS-535' 0x1
'DENON' 'DRD-25X' 0x1
...
'XYRATEX' 'RS' 0x240
'Zzyzx' 'RocketStor 500S' 0x40
'Zzyzx' 'RocketStor 2000' 0x40


That is a black (or quirks) list that can be added to by writing an
entry to /proc/scsi/device_info . So if

Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default

2019-06-18 Thread Douglas Gilbert


On 2019-06-18 3:29 a.m., Marc Gonzalez wrote:

On 18/06/2019 03:08, Finn Thain wrote:


On Mon, 17 Jun 2019, Douglas Gilbert wrote:


On 2019-06-17 5:11 p.m., Bart Van Assche wrote:


On 6/12/19 6:59 AM, Marc Gonzalez wrote:


According to the option's help message, SCSI_PROC_FS has been
superseded for ~15 years. Don't select it by default anymore.

Signed-off-by: Marc Gonzalez 
---
   drivers/scsi/Kconfig | 3 ---
   1 file changed, 3 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 73bce9b6d037..8c95e9ad6470 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -54,14 +54,11 @@ config SCSI_NETLINK
   config SCSI_PROC_FS
   bool "legacy /proc/scsi/ support"
   depends on SCSI && PROC_FS
-default y
   ---help---
 This option enables support for the various files in
 /proc/scsi.  In Linux 2.6 this has been superseded by
 files in sysfs but many legacy applications rely on this.
-  If unsure say Y.
-
   comment "SCSI support type (disk, tape, CD-ROM)"
   depends on SCSI


Hi Doug,

If I run grep "/proc/scsi" over the sg3_utils source code then grep reports
38 matches for that string. Does sg3_utils break with SCSI_PROC_FS=n?


First, the sg driver. If placing
#undef CONFIG_SCSI_PROC_FS

prior to the includes in sg.c is a valid way to test that then the
answer is no. Ah, but you are talking about sg3_utils .

Or are you? For sg3_utils:

$ find . -name '*.c' -exec grep "/proc/scsi" {} \; -print
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_read.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgp_dd.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgm_dd.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_dd.c
 "'echo 1 > /proc/scsi/sg/allow_dio'\n", q_len, dirio_count);
./testing/sg_tst_bidi.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./examples/sgq_dd.c


That is 6 (not 38) by my count. Those 6 are all for direct IO
(see below) which is off by default. I suspect old scanning
utilities like sg_scan and sg_map might also use /proc/scsi/* .
That is one reason why I wrote lsscsi. However I can't force folks
to use lsscsi. As a related example, I still get bug reports for
sginfo which I inherited from Eric Youngdale.

If I was asked to debug a problem with the sg driver in a
system without CONFIG_SCSI_PROC_FS defined, I would decline.

The absence of /proc/scsi/sg/debug would be my issue. Can this
be set up to do the same thing:
 cat /sys/class/scsi_generic/debug
   Is that breaking any sysfs rules?


Also folks who rely on this to work:
cat /proc/scsi/sg/devices
0   0   0   0   0   1   255 0   1
0   0   0   1   0   1   255 0   1
0   0   0   2   0   1   255 0   1

would be disappointed. Further I note that setting allow_dio via
/proc/scsi/sg/allow_dio can also be done via /sys/module/sg/allow_dio .
So that would be an interface breakage, but with an alternative.


You can grep for /proc/scsi/ across all Debian packages:
https://codesearch.debian.net/

This reveals that /proc/scsi/sg/ appears in smartmontools and other
packages, for example.


Hello everyone,

Please note that I am _in no way_ suggesting that we remove any code.

I just think it might be time to stop forcing CONFIG_SCSI_PROC_FS into
every config, and instead require one to explicitly request the aging
feature (which makes CONFIG_SCSI_PROC_FS show up in a defconfig).

Maybe we could add CONFIG_SCSI_PROC_FS to arch/x86/configs/foo ?
(For which foo? In a separate patch or squashed with this one?)


Marc,
Since current sg driver usage seems to depend more on SCSI_PROC_FS
being "y" than other parts of the SCSI subsystem then if
SCSI_PROC_FS is to default to "n" in the future then a new
CONFIG_SG_PROC_FS variable could be added.

If CONFIG_CHR_DEV_SG is "*" or "m" then default CONFIG_SG_PROC_FS
to "y"; if CONFIG_SCSI_PROC_FS is "y" then default CONFIG_SG_PROC_FS
to "y"; else default CONFIG_SG_PROC_FS to "n". Obviously the
sg driver would need to be changed to use CONFIG_SG_PROC_FS instead
of CONFIG_SCSI_PROC_FS .


Does that defeat the whole purpose of your proposal or could it be
seen as a partial step in that direction? What is the motivation
for this proposal?

Doug Gilbert


BTW We still have the non-sg related 'cat /proc/scsi/scsi' usage
and 'cat /proc/scsi/device_info'. And I believe the latter one is
writable even though its permissions say otherwise.

Re: [PATCH v1] scsi: Don't select SCSI_PROC_FS by default

2019-06-17 Thread Douglas Gilbert


On 2019-06-17 5:11 p.m., Bart Van Assche wrote:

On 6/12/19 6:59 AM, Marc Gonzalez wrote:

According to the option's help message, SCSI_PROC_FS has been
superseded for ~15 years. Don't select it by default anymore.

Signed-off-by: Marc Gonzalez 
---
  drivers/scsi/Kconfig | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 73bce9b6d037..8c95e9ad6470 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -54,14 +54,11 @@ config SCSI_NETLINK
  config SCSI_PROC_FS
  bool "legacy /proc/scsi/ support"
  depends on SCSI && PROC_FS
-    default y
  ---help---
    This option enables support for the various files in
    /proc/scsi.  In Linux 2.6 this has been superseded by
    files in sysfs but many legacy applications rely on this.
-  If unsure say Y.
-
  comment "SCSI support type (disk, tape, CD-ROM)"
  depends on SCSI


Hi Doug,

If I run grep "/proc/scsi" over the sg3_utils source code then grep reports 38 
matches for that string. Does sg3_utils break with SCSI_PROC_FS=n?


First, the sg driver. If placing
#undef CONFIG_SCSI_PROC_FS

prior to the includes in sg.c is a valid way to test that then the
answer is no. Ah, but you are talking about sg3_utils .

Or are you? For sg3_utils:

$ find . -name '*.c' -exec grep "/proc/scsi" {} \; -print
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_read.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgp_dd.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sgm_dd.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./src/sg_dd.c
"'echo 1 > /proc/scsi/sg/allow_dio'\n", q_len, dirio_count);
./testing/sg_tst_bidi.c
static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";
./examples/sgq_dd.c


That is 6 (not 38) by my count. Those 6 are all for direct IO
(see below) which is off by default. I suspect old scanning
utilities like sg_scan and sg_map might also use /proc/scsi/* .
That is one reason why I wrote lsscsi. However I can't force folks
to use lsscsi. As a related example, I still get bug reports for
sginfo which I inherited from Eric Youngdale.

If I was asked to debug a problem with the sg driver in a
system without CONFIG_SCSI_PROC_FS defined, I would decline.

The absence of /proc/scsi/sg/debug would be my issue. Can this
be set up to do the same thing:
cat /sys/class/scsi_generic/debug
? Is that breaking any sysfs rules?


Also folks who rely on this to work:
   cat /proc/scsi/sg/devices
0   0   0   0   0   1   255 0   1
0   0   0   1   0   1   255 0   1
0   0   0   2   0   1   255 0   1

would be disappointed. Further I note that setting allow_dio via
/proc/scsi/sg/allow_dio can also be done via /sys/module/sg/allow_dio .
So that would be an interface breakage, but with an alternative.

Doug Gilbert

Re: [PATCH] sg: Fix a double-fetch bug in drivers/scsi/sg.c

2019-06-05 Thread Douglas Gilbert


On 2019-06-05 2:00 a.m., Jiri Slaby wrote:

On 23. 05. 19, 4:38, Gen Zhang wrote:

In sg_write(), the opcode of the command is fetched the first time from
the userspace by __get_user(). Then the whole command, the opcode
included, is fetched again from userspace by __copy_from_user().
However, a malicious user can change the opcode between the two fetches.
This can cause inconsistent data and potential errors as cmnd is used in
the following codes.

Thus we should check opcode between the two fetches to prevent this.

Signed-off-by: Gen Zhang 
---
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index d3f1531..a2971b8 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -694,6 +694,8 @@ sg_write(struct file *filp, const char __user *buf, size_t 
count, loff_t * ppos)
hp->flags = input_size;  /* structure abuse ... */
hp->pack_id = old_hdr.pack_id;
hp->usr_ptr = NULL;
+   if (opcode != cmnd[0])
+   return -EINVAL;


Isn't it too early to check cmnd which is copied only here:


if (__copy_from_user(cmnd, buf, cmd_size))
return -EFAULT;
/*
---



Hi,
Yes, it is too early. It needs to be after that __copy_from_user(cmnd,
buf, cmd_size) call.

To put this in context, this is a very old interface; dating from 1992
and deprecated for almost 20 years. The fact that the first byte of
the SCSI cdb needs to be read first to work out that size of the
following SCSI command and optionally the offset of a data-out
buffer that may follow the command; is one reason why that interface
was replaced. Also the implementation did not handle SCSI variable
length cdb_s.

Then there is the question of whether this double-fetch is exploitable?
I cannot think of an example, but there might be (e.g. turning a READ
command into a WRITE). But the "double-fetch" issue may be more wide
spread. The replacement interface passes the command and data-in/-out as
pointers while their corresponding lengths are placed in the newer
interface structure. This assumes that the cdb and data-out won't
change in the user space between when the write(2) is called and
before or while the driver, using those pointers, reads the data.
All drivers that use pointers to pass data have this "feature".

Also I'm looking at this particular double-fetch from the point of view
of the driver rewrite I have done and is currently in the early stages
of review [linux-scsi list: "[PATCH 00/19] sg: v4 interface, rq sharing
+ multiple rqs"] and this problem is more difficult to fix since the
full cdb read is delayed to a common point further along the submit
processing path. To detect a change in cbd[0] my current code would
need to be altered to carry cdb[0] through to that common point. So
is it worth it for such an old, deprecated and replaced interface??
What cdb/user_permissions checking that is done, is done _after_
the full cdb is read. So trying to get around a user exclusion of
say WRITE(10) by first using the first byte of READ(10), won't succeed.

Doug Gilbert

Re: [PATCH] scsi: ses: Fix out-of-bounds memory access in ses_enclosure_data_process()

2019-05-20 Thread Douglas Gilbert


On 2019-05-20 12:05 p.m., Martin K. Petersen wrote:


James,


Please.  What I'm interested in is whether this is simply a bug in the
array firmware, in which case the fix is sufficient, or whether
there's some problem with the parser, like mismatched expectations
over added trailing nulls or something.


Our support folks have been looking at this for a while. We have seen
problems with devices from several vendors. To the extent that I gave up
the idea of blacklisting all of them.

I am collecting "bad" SES pages from these devices. I have added support
for RECEIVE DIAGNOSTICS to scsi_debug and added a bunch of deliberately
broken SES pages so we could debug this


Patches ??


It appears to be very common for devices to return inconsistent or
invalid data. So pretty much all of the ses.c parsing needs to have
sanity checking heuristics added to prevent KASAN hiccups.


And it is not just SES device implementations that were broken. The
relationship between Additional Element Status diagnostic page (dpage)
and the Enclosure Status dpage was under-specified in SES-2 and that
led to the EIIOE field being introduced during the SES-3 revisions.
And the meaning of EIIOE was tweaked several times *** before SES-3 was
standardized. Anyone interested in the adventures of EIIOE can see
the code of sg_ses.c in sg3_utils. The sg_ses utility is many times
more complex than anything else in the sg3_utils package.

And that complexity led me to suspect that the Linux SES driver was
broken. It should be 3 or 4 times larger than it is! It simply doesn't
do enough checking.

So yes Martin, you are on the right track.

Doug Gilbert


BTW the NVME Management Interface folks have decided to use SES-3 for
NVME enclosure management rather than invent their own can of worms :-)

*** For example EIIOE started life as a 1 bit field, but two cases
wasn't enough, so it became a 2 bit field and now uses all
four possibilities.

Re: [PATCH 21/24] sg: switch to SPDX tags

2019-05-02 Thread Douglas Gilbert


On 2019-05-01 6:14 p.m., Christoph Hellwig wrote:

Use the the GPLv2+ SPDX tag instead of verbose boilerplate text.


IOWs replace 3.5 lines with 1.



Signed-off-by: Christoph Hellwig 


Acked-by: Douglas Gilbert 


---
  drivers/scsi/sg.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index d3f15319b9b3..bcdc28e5ede7 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0+
  /*
   *  History:
   *  Started: Aug 9 by Lawrence Foard (entr...@world.std.com),
@@ -8,12 +9,6 @@
   *Copyright (C) 1992 Lawrence Foard
   * Version 2 and 3 extensions to driver:
   *Copyright (C) 1998 - 2014 Douglas Gilbert
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2, or (at your option)
- * any later version.
- *
   */
  
  static int sg_version_num = 30536;	/* 2 digits for each component */

Re: Recent removal of bsg read/write support

2019-02-01 Thread Douglas Gilbert


Updated reply, see below.

On 2018-09-03 4:34 a.m., Dror Levin wrote:

On Sun, Sep 2, 2018 at 8:55 PM Linus Torvalds
 wrote:


On Sun, Sep 2, 2018 at 4:44 AM Richard Weinberger
 wrote:


CC'ing relevant people. Otherwise your mail might get lost.


Indeed.


Sorry for that.


On Sun, Sep 2, 2018 at 1:37 PM Dror Levin  wrote:


We have an internal tool that uses the bsg read/write interface to
issue SCSI commands as part of a test suite for a storage device.

After recently reading on LWN that this interface is to be removed we
tried porting our code to use sg instead. However, that raises new
issues - mainly getting ENOMEM over iSCSI for unknown reasons.


Is there any chance that you can make more data available?


Sure, I can try.

We use writev() to send up to SG_MAX_QUEUE tasks at a time. Occasionally not
all tasks are written at which point we wait for tasks to return before
sending more, but then writev() fails with ENOMEM and we see this in the syslog:

Sep  1 20:58:14 gdc-qa-io-017 kernel: sd 441:0:0:5: [sg73]
sg_common_write: start_req err=-12

Failing tasks are reads of 128KiB.


This is the block layer running out of resources. The sg driver is a
relatively thin shim and when it gets a "no can do" from the layers
below it, the driver has little option than to return said errno.


I'd rather fix the sg interface (which while also broken garbage, we
can't get rid of) than re-surrect the bsg interface.

That said, the removed bsg code looks a hell of a lot prettier than
the nasty sg interface code does, although it also lacks ansolutely
_any_ kind of security checking.


For us the bsg interface also has several advantages over sg:
1. The device name is its HCTL which is nicer than an arbitrary integer.


Not much the sg driver can do about that. The minor number the sg driver
uses and HCT are all arbitrary integers (with the L coming from the
storage device), but I agree the HCTL is more widely used. The
ioctl(, SG_GET_SCSI_ID) fills a structure which includes HCTL. In
my sg v4 driver rewrite the L (LUN) has been tweaked to additionally
send back the 8 byte T10 LUN representation.

The lsscsi utility will show the relationship between HCTL and sg driver
device name with 'lsscsi -g'. It uses sysfs datamining.


2. write() supports writing more than one sg_io_v4 struct so we don't have
to resort to writev().


In my sg v4 rewrite the sg_io_v4 interface can only be sent through
ioctl(SG_IO) [for sync usage] and ioctl(SG_IOSUBMIT) [for async usage].
So it can't be sent through write(2). SG_IOSUBMIT is new and uses the
_IOWR macro which encodes the expected length into the SG_IOSUBMIT value
and that is the size of sg_io_v4. So you can't send an arbitrary number of
sg_io_v4 objects through that ioctl directly. If need be, that can be
cured with another level of indirection (e.g. with a new flag the data-out
can be interpreted as an array sg_io_v4 objects).


3. Queue size is the device's queue depth and not SG_MAX_QUEUE which is 16.


That limit is gone in the sg v4 driver rewrite.


Because of this we would like to continue using the bsg interface,
even if some changes are required to meet security concerns.


I wonder if we could at least try to unify the bsg/sg code - possibly
by making sg use the prettier bsg code (but definitely have to add all
the security measures).

And dammit, the SCSI people need to get their heads out of their
arses. This whole "stream random commands over read/write" needs to go
the f*ck away.

Could we perhaps extend the SG_IO interace to have an async mode?
Instead of "read/write", have "SG_IOSUBMIT" and "SG_IORECEIVE" and
have the SG_IO ioctl just be a shorthand of "both".


Done.


Just my two cents - having an interface other than read/write won't allow
users to treat this fd as a regular file with epoll() and read(). This is
a major bonus for this interface - an sg/bsg device can be used just like
a socket or pipe in any reactor (we use boost asio for example).


Well poll() certainly works (see sg3_utils beta rev 809 testing/sgs_dd.c and
testing/sgh_dd.c) and I can't see why epoll() won't work. These calls work
against the file descriptor and the sg driver keeps the same context around
sg device file descriptors as it has always done. [And that is the major
design flaw in the bsg driver: it doesn't keep proper file descriptor context.]

It is the security folks who don't like the sg inspired (there in lk 1.0.0
from 1992) write(2)/read(2) asynchronous interface. Also, ideally we need
two streams: one for metadata (e.g. commands and responses (status and sense
data)) and another for user data. Protection information could be a third
stream, between the other two. Jamming that all into one stream is a bit ugly.

References:
  sg v3 driver rewrite, description and downloads:
 http://sg.danny.cz/sg/sg_v40.html
  sg3_utils version 1.45 beta, rev 809, link at the top of this page:
 http://sg.danny.cz/sg

Doug Gilbert

Re: [ANNOUNCE] v4 sg driver: ready for testing

2019-01-16 Thread Douglas Gilbert


There is an update to the SCSI Generic (sg) v4 driver adding synchronous
and asynchronous bidi command support. Plus lots of fixes and some minor
improvements. See:
http://sg.danny.cz/sg/sg_v40.html

The kernel code is split in two in the tarball below, one targeting
lk 5.0 and the other targeting lk 4.20 and earlier ***. Each section
contains the 3 files that represent the sg v4 driver plus a meandering
17 part patchset. Those patchsets reflect the driver's rewrite rather
than a logical progression.
http://sg.danny.cz/sg/p/sgv4_20190116.tgz

Plus there are updated testing utilities in sg3_utils-1.45 (beta,
revision 807) at the top of this page:
   http://sg.danny.cz/sg/index.html

Doug Gilbert


*** the reason for the split is the tree wide change to the access_ok()
function.


On 2018-12-25 2:39 a.m., Douglas Gilbert wrote:

There is an update to the sg v4 driver with some error fixes, SIGIO and
RT signals work plus single READ, multiple WRITE sharing support. See:
     http://sg.danny.cz/sg/sg_v40.html

with testing utilities in sg3_utils-1.45 (beta, revision 802) on the main
page:
     http://sg.danny.cz/sg/index.html

Doug Gilbert


On 2018-12-18 6:41 p.m., Douglas Gilbert wrote:

After an underwhelming response to my intermediate level patchsets to
modernize the sg driver in October this year (see "[PATCH 0/8] sg: major
cleanup, remove max_queue limit" followed by v2 and v3 between 20181019
and 20181028), I decided to move ahead and add the functionality proposed
for the version 4 sg driver. That means accepting interface objects of
type 'struct sg_io_v4' (as found in include/uapi/linux/bsg) plus two new
ioctls: SG_IOSUBMIT and SG_IORECEIVE as proposed by Linus Torvalds to
replace the unloved write(2)/read(2) asynchronous interface . There
is a new feature called "sharing" explained in the web page (see below).

Yes, there is a patchset available (14 part and growing) but even without
explanatory comments at the top of each patch, that patchset is 4 times
larger than the v4 sg driver (i.e. the finished product) and over 6
times larger than the original v3 sg driver! Part of the reason for
the patchset size is the multiple backtracks and rewrites associated
with a real development process. The cleanest patchset would have 3
parts:
   1) split the current include/scsi/sg.h into the end product headers:
  include/uapi/scsi/sg.h and include/scsi/sg.h
   2) delete drivers/scsi/sg.c
   3) add the v4 drivers/scsi/sg.c

After part 2) you could build a kernel and I can guarantee that no-one
will be able to find any sg driver bugs but some users might get upset
(but not the Linux security folks).

So there is a working v4 sg driver discussed here, with a download:
 http://sg.danny.cz/sg/sg_v40.html

I will keep that page up to date while the driver is in this phase.
There is a sg3_utils beta of 1.45 (revision 799) package in the News
section at the top of the main page:
 http://sg.danny.cz/sg/index.html

That sg3_utils beta package will use the v4 sg interface via sg devices
if the v4 driver is detected. There are also three test utilities in
the 'testing' directory designed to exercise the v4 extensions.

The degree of backward compatibility with the v3 driver should be high
but there are limits to backward compatibility. As an example, it is
possible that there are user apps that depend on hitting the 16
outstanding command limit (per fd) in the v3 driver and go "wild"
when v4 removes that ceiling. If so, a "high_v3_compat" driver option
could be added to put that ceiling back.

The only way to find out is for folks to try and if there is a failure,
contact me, or send mail to this list. Code reviews welcome as well.

Doug Gilbert


 I felt this was a better use of my time than trying to invent a new
  debug/trace mechanism for the whole SCSI subsystem. That is what
  _SCSI_ system maintainers are for, I'll stick to the sg driver (and
  scsi_debug). Add user space tools and there is more than enough work
  there ...

Re: [PATCH v2] rbtree: fix the red root

2019-01-14 Thread Douglas Gilbert


On 2019-01-14 12:58 p.m., Qian Cai wrote:

Unfortunately, I could not trigger any of those here both in a bare-metal and
virtual machines. All I triggered were hung tasks and soft-lockup due to fork 
bomb.

The only other thing I can think of is to setup kdump to capture a vmcore when
either GPF or BUG() happens, and then share the vmcore somewhere, so I might
pork around to see where the memory corruption looks like.


Another question that I forgot to ask, what type of device is /dev/sg0 ?
On a prior occasion (KASAN, throw spaghetti ...) it was a SATA device
and the problem was in libata.

Doug Gilbert

Re: [PATCH] scsi: wd719x Replace GFP_KERNEL with GFP_ATOMIC in wd719x_chip_init

2019-01-14 Thread Douglas Gilbert


On 2019-01-14 10:29 a.m., Christoph Hellwig wrote:

On Mon, Jan 14, 2019 at 11:24:49PM +0800, wangbo wrote:

wd719x_host_reset get spinlock first then call wd719x_chip_init,
so replace GFP_KERNEL with GFP_ATOMIC in wd719x_chip_init.


Please move the allocation outside the lock instead.  GFP_ATOMIC
DMA allocations are generally a bad idea and should be avoided where
we can.

More importantly we should never actually trigger the allocation
under the lock as far as fw_virt will always be set already
in that case.

So I think you can safely move the request firmware + allocation
+ memcpy from wd719x_chip_init to wd719x_board_found, but I'd rather
have Ondrej review that plan.


Further to this, the result of holding a lock (probably with _irqsave()
tacked onto it) during a GFP_KERNEL is a message like this in the log:
   hrtimer: interrupt took 1084 ns

It is not always easy to find since it is a "_once" message. The sg v3
driver (the one in production) produces these. I have been able to stamp
them out by taking care in the sg v4 driver (in testing) around
allocations. It also meant adding a new state in my state machine to
fend off "bad things" happening to that object while it is unlocked.
So there may be a cost to dropping the lock.

Doug Gilbert

Re: [PATCH v2] rbtree: fix the red root

2019-01-13 Thread Douglas Gilbert


On 2019-01-13 10:59 p.m., Esme wrote:

‐‐‐ Original Message ‐‐‐
On Sunday, January 13, 2019 10:52 PM, Douglas Gilbert  
wrote:


On 2019-01-13 10:07 p.m., Esme wrote:


‐‐‐ Original Message ‐‐‐
On Sunday, January 13, 2019 9:33 PM, Qian Cai c...@lca.pw wrote:


On 1/13/19 9:20 PM, David Lechner wrote:


On 1/11/19 8:58 PM, Michel Lespinasse wrote:


On Fri, Jan 11, 2019 at 3:47 PM David Lechner da...@lechnology.com wrote:


On 1/11/19 2:58 PM, Qian Cai wrote:


A GPF was reported,
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
   kasan_die_handler.cold.22+0x11/0x31
   notifier_call_chain+0x17b/0x390
   atomic_notifier_call_chain+0xa7/0x1b0
   notify_die+0x1be/0x2e0
   do_general_protection+0x13e/0x330
   general_protection+0x1e/0x30
   rb_insert_color+0x189/0x1480
   create_object+0x785/0xca0
   kmemleak_alloc+0x2f/0x50
   kmem_cache_alloc+0x1b9/0x3c0
   getname_flags+0xdb/0x5d0
   getname+0x1e/0x20
   do_sys_open+0x3a1/0x7d0
   __x64_sys_open+0x7e/0xc0
   do_syscall_64+0x1b3/0x820
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
It turned out,
gparent = rb_red_parent(parent);
tmp = gparent->rb_right; <-- GPF was triggered here.
Apparently, "gparent" is NULL which indicates "parent" is rbtree's root
which is red. Otherwise, it will be treated properly a few lines above.
/*
    * If there is a black parent, we are done.
    * Otherwise, take some corrective action as,
    * per 4), we don't want a red root or two
    * consecutive red nodes.
    */
if(rb_is_black(parent))
    break;
Hence, it violates the rule #1 (the root can't be red) and need a fix
up, and also add a regression test for it. This looks like was
introduced by 6d58452dc06 where it no longer always paint the root as
black.
Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only
when necessary)
Reported-by: Esme espl...@protonmail.ch
Tested-by: Joey Pabalinas joeypabali...@gmail.com
Signed-off-by: Qian Cai c...@lca.pw


Tested-by: David Lechner da...@lechnology.com
FWIW, this fixed the following crash for me:
Unable to handle kernel NULL pointer dereference at virtual address 0004


Just to clarify, do you have a way to reproduce this crash without the fix ?


I am starting to suspect that my crash was caused by some new code
in the drm-misc-next tree that might be causing a memory corruption.
It threw me off that the stack trace didn't contain anything related
to drm.
See: https://patchwork.freedesktop.org/patch/276719/


It may be useful for those who could reproduce this issue to turn on those
memory corruption debug options to narrow down a bit.
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_SLUB_DEBUG_ON=y


I have been on SLAB, I configured SLAB DEBUG with a fresh pull from github. 
Linux syzkaller 5.0.0-rc2 #9 SMP Sun Jan 13 21:57:40 EST 2019 x86_64
...
In an effort to get a different stack into the kernel, I felt that nothing 
works better than fork bomb? :)
Let me know if that helps.
root@syzkaller:~# gcc -o test3 test3.c
root@syzkaller:~# while : ; do ./test3 & done


And is test3 the same multi-threaded program that enters the kernel via
/dev/sg0 and then calls SCSI_IOCTL_SEND_COMMAND which goes to the SCSI
mid-level and thence to the block layer?

And please remind me, does it also fail on lk 4.20.2 ?

Doug Gilbert


Yes, the same C repro from the earlier thread.  It was a 4.20.0 kernel where it 
was first detected.  I can move to 4.20.2 and see if that changes anything.


Hi,
I don't think there is any need to check lk 4.20.2 (as it would
be very surprising if it didn't also have this "feature").

More interesting might be: has "test3" been run on lk 4.19 or
any earlier kernel?

Doug Gilbert

Re: [PATCH v2] rbtree: fix the red root

2019-01-13 Thread Douglas Gilbert


On 2019-01-13 10:07 p.m., Esme wrote:

‐‐‐ Original Message ‐‐‐
On Sunday, January 13, 2019 9:33 PM, Qian Cai  wrote:


On 1/13/19 9:20 PM, David Lechner wrote:


On 1/11/19 8:58 PM, Michel Lespinasse wrote:


On Fri, Jan 11, 2019 at 3:47 PM David Lechner da...@lechnology.com wrote:


On 1/11/19 2:58 PM, Qian Cai wrote:


A GPF was reported,
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
   kasan_die_handler.cold.22+0x11/0x31
   notifier_call_chain+0x17b/0x390
   atomic_notifier_call_chain+0xa7/0x1b0
   notify_die+0x1be/0x2e0
   do_general_protection+0x13e/0x330
   general_protection+0x1e/0x30
   rb_insert_color+0x189/0x1480
   create_object+0x785/0xca0
   kmemleak_alloc+0x2f/0x50
   kmem_cache_alloc+0x1b9/0x3c0
   getname_flags+0xdb/0x5d0
   getname+0x1e/0x20
   do_sys_open+0x3a1/0x7d0
   __x64_sys_open+0x7e/0xc0
   do_syscall_64+0x1b3/0x820
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
It turned out,
gparent = rb_red_parent(parent);
tmp = gparent->rb_right; <-- GPF was triggered here.
Apparently, "gparent" is NULL which indicates "parent" is rbtree's root
which is red. Otherwise, it will be treated properly a few lines above.
/*
    * If there is a black parent, we are done.
    * Otherwise, take some corrective action as,
    * per 4), we don't want a red root or two
    * consecutive red nodes.
    */
if(rb_is_black(parent))
    break;
Hence, it violates the rule #1 (the root can't be red) and need a fix
up, and also add a regression test for it. This looks like was
introduced by 6d58452dc06 where it no longer always paint the root as
black.

Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only
when necessary)
Reported-by: Esme espl...@protonmail.ch
Tested-by: Joey Pabalinas joeypabali...@gmail.com
Signed-off-by: Qian Cai c...@lca.pw

-


Tested-by: David Lechner da...@lechnology.com
FWIW, this fixed the following crash for me:
Unable to handle kernel NULL pointer dereference at virtual address 0004


Just to clarify, do you have a way to reproduce this crash without the fix ?


I am starting to suspect that my crash was caused by some new code
in the drm-misc-next tree that might be causing a memory corruption.
It threw me off that the stack trace didn't contain anything related
to drm.
See: https://patchwork.freedesktop.org/patch/276719/


It may be useful for those who could reproduce this issue to turn on those
memory corruption debug options to narrow down a bit.

CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_SLUB_DEBUG_ON=y


I have been on SLAB, I configured SLAB DEBUG with a fresh pull from github. 
Linux syzkaller 5.0.0-rc2 #9 SMP Sun Jan 13 21:57:40 EST 2019 x86_64
...

In an effort to get a different stack into the kernel, I felt that nothing 
works better than fork bomb? :)

Let me know if that helps.

root@syzkaller:~# gcc -o test3 test3.c
root@syzkaller:~# while : ; do ./test3 & done


And is test3 the same multi-threaded program that enters the kernel via
/dev/sg0 and then calls SCSI_IOCTL_SEND_COMMAND which goes to the SCSI
mid-level and thence to the block layer?

And please remind me, does it also fail on lk 4.20.2 ?

Doug Gilbert

Re: [PATCH] scsi: associate bio write hint with WRITE CDB

2019-01-03 Thread Douglas Gilbert


On 2019-01-03 4:47 a.m., Randall Huang wrote:

On Wed, Jan 02, 2019 at 11:51:33PM -0800, Christoph Hellwig wrote:

On Wed, Dec 26, 2018 at 12:15:04PM +0800, Randall Huang wrote:

In SPC-3, WRITE(10)/(16) support grouping function.
Let's associate bio write hint with group number for
enabling StreamID or Turbo Write feature.

Signed-off-by: Randall Huang 
---
  drivers/scsi/sd.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4b49cb67617e..28bfa9ed2b54 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1201,7 +1201,12 @@ static int sd_setup_read_write_cmnd(struct scsi_cmnd 
*SCpnt)
SCpnt->cmnd[11] = (unsigned char) (this_count >> 16) & 0xff;
SCpnt->cmnd[12] = (unsigned char) (this_count >> 8) & 0xff;
SCpnt->cmnd[13] = (unsigned char) this_count & 0xff;
-   SCpnt->cmnd[14] = SCpnt->cmnd[15] = 0;
+   if (rq_data_dir(rq) == WRITE) {
+   SCpnt->cmnd[14] = rq->bio->bi_write_hint & 0x3f;
+   } else {
+   SCpnt->cmnd[14] = 0;
+   }


No need for braces here.

Already send a new version


But what I'm more worried about is devices not recognizing the feature
throwing up on the field.  Can you check what SBC version first
references these or come up with some other decently smart conditional?

My reference is SCSI Block Commands – 3 (SBC-3) Revision 25.
Section 5.32 WRITE (10) and 5.34 WRITE (16)


Maybe Martin has a good idea, too.




That is the GROUP NUMBER field. Also found in READ(16) at the same
location within its cdb. The proposed code deserves at least an
explanatory comment.

Since it is relatively recent, perhaps the above should only be done iff:
   - the REPORT SUPPORTED OPERATION CODES (RSOC) command is supported, and
   - in the RSOC entry for WRITE(16), the CDB USAGE DATA field (a bit mask)
 indicates the GROUP NUMBER field is supported

That check can be done once, at disk attachment time where there is already
code to fetch RSOC.


Is there a bi_read_hint ? If not then the bi_write_hint should also be applied
to READ(16). Makes that variable naming look pretty silly though.

Doug Gilbert

Re: [PATCH] scsi: avoid a double-fetch and a redundant copy

2018-12-25 Thread Douglas Gilbert


On 2018-12-25 3:15 p.m., Kangjie Lu wrote:

What we need is only "pack_id", so do not create a heap object or copy
the whole object in. The fix efficiently copies "pack_id" only.


Now this looks like a worthwhile optimization, in some pretty tricky
code. I can't see a security angle in it. Did you test it?

Well the code as presented doesn't compile and the management takes a
dim view of that.


Signed-off-by: Kangjie Lu 
---
  drivers/scsi/sg.c | 12 ++--
  1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index c6ad00703c5b..4dacbfffd113 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -446,16 +446,8 @@ sg_read(struct file *filp, char __user *buf, size_t count, 
loff_t * ppos)
}
if (old_hdr->reply_len < 0) {
if (count >= SZ_SG_IO_HDR) {
-   sg_io_hdr_t *new_hdr;
-   new_hdr = kmalloc(SZ_SG_IO_HDR, GFP_KERNEL);
-   if (!new_hdr) {
-   retval = -ENOMEM;
-   goto free_old_hdr;
-   }
-   retval =__copy_from_user
-   (new_hdr, buf, SZ_SG_IO_HDR);
-   req_pack_id = new_hdr->pack_id;
-   kfree(new_hdr);
+   retval = get_user(req_pack_id,
+   &((sg_io_hdr_t *)buf->pack_id));


The '->' binds higher then the cast and since buf is a 'char *' it doesn't
have a member called pack_id .

Hopefully your drive to remove redundancy went a little too far and removed
the required (but missing) parentheses binding the cast to 'buf'.


if (retval) {
retval = -EFAULT;
goto free_old_hdr;



Good work, silly mistake, but its got me thinking, the heap allocation can be
replaced by stack since its short. The code in this area is more tricky in
the v4 driver because I want to specifically exclude the sg_io_v4 (aka v4)
interface being sent through write(2)/read(2). The way to do that is to read
the first 32 bit integer which should be 'S' or v3, 'Q' for v4.


Hmm, just looking further along my mailer I see the kbuild test robot
has picked up the error and you have presented another patch which also
won't compile. Please stop doing that; apply your patch to kernel source
and compile it _before_ sending it to this list.

Doug Gilbert

Re: [PATCH] scsi: fix a double-fetch bug in sg_write

2018-12-25 Thread Douglas Gilbert


On 2018-12-25 3:24 p.m., Kangjie Lu wrote:

"opcode" has been copied in from user space and checked. We should not
copy it in again, which may have been modified by malicous
multi-threading user programs through race conditions. The fix uses the
opcode fetched in the first copy.

Signed-off-by: Kangjie Lu 

 Acked-by: Douglas Gilbert 

Also applied to my sg v4 driver code. The v1 and v2 interfaces (based on
struct sg_header) did not provide a command length field. The sg driver
needed to read the first byte of the command (the "opcode") to determine
the full command's length prior to actually reading it in full.

Hard to think of an example of an exploit based on this double read.


---
  drivers/scsi/sg.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 4dacbfffd113..41774e4f9508 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -686,7 +686,8 @@ sg_write(struct file *filp, const char __user *buf, size_t 
count, loff_t * ppos)
hp->flags = input_size;  /* structure abuse ... */
hp->pack_id = old_hdr.pack_id;
hp->usr_ptr = NULL;
-   if (__copy_from_user(cmnd, buf, cmd_size))
+   cmnd[0] = opcode;
+   if (__copy_from_user(cmnd + 1, buf + 1, cmd_size - 1))
return -EFAULT;
/*
 * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV,

Re: remove exofs, the T10 OSD code and block/scsi bidi support V3

2018-12-19 Thread Douglas Gilbert


On 2018-12-19 9:43 a.m., Christoph Hellwig wrote:

On Mon, Nov 26, 2018 at 07:11:10PM +0200, Boaz Harrosh wrote:

On 11/11/18 15:32, Christoph Hellwig wrote:

The only real user of the T10 OSD protocol, the pNFS object layout
driver never went to the point of having shipping products, and we
removed it 1.5 years ago.  Exofs is just a simple example without
real life users.



You have failed to say what is your motivation for this patchset? What
is it you are trying to fix/improve.


Drop basically unused support, which allows us to

  1) reduce the size of every kernel with block layer support, and
 even more for every kernel with scsi support


By proposing the removal of bidi support from the block layer, it isn't
just the SCSI subsystem that will be impacted. Those NVMe documents
that you referred me to earlier in the year, in the command tables
in 1.3c and earlier you have noticed the 2 bit direction field and
what 11b means? Even if there aren't any bidi NVMe commands *** yet,
the fact that NVMe's 64 byte command format has provision for 4
(not 2) independent data transfers (data + meta, for each direction).
Surely NVMe will sooner or later take advantage of those ... a
command like READ GATHERED comes to mind.


  2) reduce the size of the critical struct request structure by
 128 bits, thus reducing the memory used by every blk-mq driver
 significantly, never mind the cache effects


Hmm, one pointer (that is null in the non-bidi case) should be enough,
that's 64 or 32 bits.


  3) stop having the maintainance overhead for this code in the
 block layer, which has been rather painful at times


You won't get any sympathy from me :-) The sg driver is trying to
inject _SCSI_ commands into the SCSI mid-level for onward processing
by SCSI LLDs. So WTF does it have to deal with the block layer.


While on the subject of bidi, the order of transfers: is the data-out
(to the target) always before the data-in or is it the target device
that decides (depending on the semantics of the command) who is first?

Doug Gilbert

 *** there could already be vendor specific bidi NVMe commands out
there (ditto for SCSI)

Re: Recent removal of bsg read/write support

2018-09-03 Thread Douglas Gilbert


On 2018-09-03 10:34 AM, Dror Levin wrote:

On Sun, Sep 2, 2018 at 8:55 PM Linus Torvalds
 wrote:


On Sun, Sep 2, 2018 at 4:44 AM Richard Weinberger
 wrote:


CC'ing relevant people. Otherwise your mail might get lost.


Indeed.


Sorry for that.


On Sun, Sep 2, 2018 at 1:37 PM Dror Levin  wrote:


We have an internal tool that uses the bsg read/write interface to
issue SCSI commands as part of a test suite for a storage device.

After recently reading on LWN that this interface is to be removed we
tried porting our code to use sg instead. However, that raises new
issues - mainly getting ENOMEM over iSCSI for unknown reasons.


Is there any chance that you can make more data available?


Sure, I can try.

We use writev() to send up to SG_MAX_QUEUE tasks at a time. Occasionally not
all tasks are written at which point we wait for tasks to return before
sending more, but then writev() fails with ENOMEM and we see this in the syslog:

Sep  1 20:58:14 gdc-qa-io-017 kernel: sd 441:0:0:5: [sg73]
sg_common_write: start_req err=-12

Failing tasks are reads of 128KiB.


I'd rather fix the sg interface (which while also broken garbage, we
can't get rid of) than re-surrect the bsg interface.

That said, the removed bsg code looks a hell of a lot prettier than
the nasty sg interface code does, although it also lacks ansolutely
_any_ kind of security checking.


For us the bsg interface also has several advantages over sg:
1. The device name is its HCTL which is nicer than an arbitrary integer.
2. write() supports writing more than one sg_io_v4 struct so we don't have
to resort to writev().
3. Queue size is the device's queue depth and not SG_MAX_QUEUE which is 16.


Because of this we would like to continue using the bsg interface,
even if some changes are required to meet security concerns.


I wonder if we could at least try to unify the bsg/sg code - possibly
by making sg use the prettier bsg code (but definitely have to add all
the security measures).

And dammit, the SCSI people need to get their heads out of their
arses. This whole "stream random commands over read/write" needs to go
the f*ck away.

Could we perhaps extend the SG_IO interace to have an async mode?
Instead of "read/write", have "SG_IOSUBMIT" and "SG_IORECEIVE" and
have the SG_IO ioctl just be a shorthand of "both".


Just my two cents - having an interface other than read/write won't allow
users to treat this fd as a regular file with epoll() and read(). This is
a major bonus for this interface - an sg/bsg device can be used just like
a socket or pipe in any reactor (we use boost asio for example).


The advantage of having two ioctls is that they can both pass (meta-)data
bidirectionally. That is hard to do with standard read() and write() calls.
The command tag is the piece if meta-data that goes against the flow:
returned from SG_IOSUBMIT, optionally given to SG_IORECEIVE (which might have
a 'cancel command' flag).

The sg v1, v2 and v3 interfaces could keep their write()/read() interfaces
for backward compatibility (to Linux 1.0.0, March 1994 for sg v1). New, clean
submit and receive paths could be added to the sg driver for the v3 and
v4 twin ioctl interface. Previously the sg v4 interface was only supported
by the bsg driver. One advantage of sg v4 over v3 is support for bidi
commands. Not sure if epoll/poll works with an ioctl, if not we could add a
"dummy" read() call that notionally returned SCSI status. The SG_IORECEIVE
ioctl would still be needed to "clean up" the command, and optionally
transfer the data-in buffer.

Tony Battersby has also requested twin ioctls saying that it is extremely
tedious ploughing through logs full of SG_IO calls and that clearly
separating submits from receives would make things somewhat better.

Doug Gilbert

Re: Recent removal of bsg read/write support

2018-09-03 Thread Douglas Gilbert


On 2018-09-03 10:34 AM, Dror Levin wrote:

On Sun, Sep 2, 2018 at 8:55 PM Linus Torvalds
 wrote:


On Sun, Sep 2, 2018 at 4:44 AM Richard Weinberger
 wrote:


CC'ing relevant people. Otherwise your mail might get lost.


Indeed.


Sorry for that.


On Sun, Sep 2, 2018 at 1:37 PM Dror Levin  wrote:


We have an internal tool that uses the bsg read/write interface to
issue SCSI commands as part of a test suite for a storage device.

After recently reading on LWN that this interface is to be removed we
tried porting our code to use sg instead. However, that raises new
issues - mainly getting ENOMEM over iSCSI for unknown reasons.


Is there any chance that you can make more data available?


Sure, I can try.

We use writev() to send up to SG_MAX_QUEUE tasks at a time. Occasionally not
all tasks are written at which point we wait for tasks to return before
sending more, but then writev() fails with ENOMEM and we see this in the syslog:

Sep  1 20:58:14 gdc-qa-io-017 kernel: sd 441:0:0:5: [sg73]
sg_common_write: start_req err=-12

Failing tasks are reads of 128KiB.


I'd rather fix the sg interface (which while also broken garbage, we
can't get rid of) than re-surrect the bsg interface.

That said, the removed bsg code looks a hell of a lot prettier than
the nasty sg interface code does, although it also lacks ansolutely
_any_ kind of security checking.


For us the bsg interface also has several advantages over sg:
1. The device name is its HCTL which is nicer than an arbitrary integer.
2. write() supports writing more than one sg_io_v4 struct so we don't have
to resort to writev().
3. Queue size is the device's queue depth and not SG_MAX_QUEUE which is 16.


Because of this we would like to continue using the bsg interface,
even if some changes are required to meet security concerns.


I wonder if we could at least try to unify the bsg/sg code - possibly
by making sg use the prettier bsg code (but definitely have to add all
the security measures).

And dammit, the SCSI people need to get their heads out of their
arses. This whole "stream random commands over read/write" needs to go
the f*ck away.

Could we perhaps extend the SG_IO interace to have an async mode?
Instead of "read/write", have "SG_IOSUBMIT" and "SG_IORECEIVE" and
have the SG_IO ioctl just be a shorthand of "both".


Just my two cents - having an interface other than read/write won't allow
users to treat this fd as a regular file with epoll() and read(). This is
a major bonus for this interface - an sg/bsg device can be used just like
a socket or pipe in any reactor (we use boost asio for example).


The advantage of having two ioctls is that they can both pass (meta-)data
bidirectionally. That is hard to do with standard read() and write() calls.
The command tag is the piece if meta-data that goes against the flow:
returned from SG_IOSUBMIT, optionally given to SG_IORECEIVE (which might have
a 'cancel command' flag).

The sg v1, v2 and v3 interfaces could keep their write()/read() interfaces
for backward compatibility (to Linux 1.0.0, March 1994 for sg v1). New, clean
submit and receive paths could be added to the sg driver for the v3 and
v4 twin ioctl interface. Previously the sg v4 interface was only supported
by the bsg driver. One advantage of sg v4 over v3 is support for bidi
commands. Not sure if epoll/poll works with an ioctl, if not we could add a
"dummy" read() call that notionally returned SCSI status. The SG_IORECEIVE
ioctl would still be needed to "clean up" the command, and optionally
transfer the data-in buffer.

Tony Battersby has also requested twin ioctls saying that it is extremely
tedious ploughing through logs full of SG_IO calls and that clearly
separating submits from receives would make things somewhat better.

Doug Gilbert

Re: 4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63

2018-08-30 Thread Douglas Gilbert


On 2018-08-30 02:03 PM, Ulf Hansson wrote:

On 28 August 2018 at 23:47, Douglas Gilbert  wrote:

I usually boot my Lenovo X270 with a SD card in its:
   # lspci
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI
Express Card Reader (rev 01)
...

In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in
the logs. If I remove the SD card my machine boots and works okay until I
insert the SD card. Then:

Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch
Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card
at address 
Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB
Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2
Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error:
data->host_cookie = 62, host->cookie = 63
Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer
dereference at 0018
Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0
Aug 28 23:30:38 xtwo70 kernel: Oops:  [#1] SMP
Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not
tainted 4.19.0-rc1 #78
Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW,
BIOS R0IET53W (1.31 ) 05/22/2018
Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260
[rtsx_pci]
Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00
00 00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96
90 00 00 00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83
c7
Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202
Aug 28 23:30:38 xtwo70 kernel: RAX:  RBX: 0003
RCX: 
Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021
RDI: 8801b6328000
Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400
R09: 
Aug 28 23:30:38 xtwo70 kernel: R10:  R11: 
R12: a800
Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400
R15: 0001
Aug 28 23:30:38 xtwo70 kernel: FS:  ()
GS:88022738() knlGS:
Aug 28 23:30:38 xtwo70 kernel: CS:  0010 DS:  ES:  CR0:
80050033
Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006
CR4: 003606e0
Aug 28 23:30:38 xtwo70 kernel: Call Trace:
Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80
Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40
Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0
Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0
Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0
Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130
Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80
Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50
Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep
ccm btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc
intel_rapl nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat
intel_powerclamp fat coretemp kvm_intel arc4 snd_hda_codec_hdmi kvm
snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64
crypto_simd cryptd glue_helper mac80211 intel_cstate intel_uncore
intel_rapl_perf snd_hda_intel joydev snd_hda_codec mousedev snd_hwdep
iwlwifi snd_hda_core input_leds efi_pstore snd_pcm serio_raw efivars
cfg80211 rtsx_pci_ms memstick mei_me idma64 virt_dma mei intel_lpss_pci
intel_lpss intel_pch_thermal thinkpad_acpi nvram snd_seq_dummy tps6598x
snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event
Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd
soundcore rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq
ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT
nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp
xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter parport_pc ppdev lp
parport efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp
hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc mmc_core i915 nvme
e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video
drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci
Aug 28 23:30:38 xtwo70 kernel: CR2: 0018
Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]---
Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd:
service name='org.freedesktop.hostname1'
unit='dbus-org.freedesktop.hostn

Re: 4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63

2018-08-30 Thread Douglas Gilbert


On 2018-08-30 02:03 PM, Ulf Hansson wrote:

On 28 August 2018 at 23:47, Douglas Gilbert  wrote:

I usually boot my Lenovo X270 with a SD card in its:
   # lspci
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI
Express Card Reader (rev 01)
...

In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in
the logs. If I remove the SD card my machine boots and works okay until I
insert the SD card. Then:

Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch
Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card
at address 
Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB
Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2
Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error:
data->host_cookie = 62, host->cookie = 63
Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer
dereference at 0018
Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0
Aug 28 23:30:38 xtwo70 kernel: Oops:  [#1] SMP
Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not
tainted 4.19.0-rc1 #78
Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW,
BIOS R0IET53W (1.31 ) 05/22/2018
Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260
[rtsx_pci]
Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00
00 00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96
90 00 00 00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83
c7
Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202
Aug 28 23:30:38 xtwo70 kernel: RAX:  RBX: 0003
RCX: 
Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021
RDI: 8801b6328000
Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400
R09: 
Aug 28 23:30:38 xtwo70 kernel: R10:  R11: 
R12: a800
Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400
R15: 0001
Aug 28 23:30:38 xtwo70 kernel: FS:  ()
GS:88022738() knlGS:
Aug 28 23:30:38 xtwo70 kernel: CS:  0010 DS:  ES:  CR0:
80050033
Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006
CR4: 003606e0
Aug 28 23:30:38 xtwo70 kernel: Call Trace:
Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80
Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40
Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0
Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0
Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0
Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130
Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80
Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50
Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep
ccm btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc
intel_rapl nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat
intel_powerclamp fat coretemp kvm_intel arc4 snd_hda_codec_hdmi kvm
snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64
crypto_simd cryptd glue_helper mac80211 intel_cstate intel_uncore
intel_rapl_perf snd_hda_intel joydev snd_hda_codec mousedev snd_hwdep
iwlwifi snd_hda_core input_leds efi_pstore snd_pcm serio_raw efivars
cfg80211 rtsx_pci_ms memstick mei_me idma64 virt_dma mei intel_lpss_pci
intel_lpss intel_pch_thermal thinkpad_acpi nvram snd_seq_dummy tps6598x
snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event
Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd
soundcore rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq
ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT
nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp
xt_addrtype xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns
nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter parport_pc ppdev lp
parport efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp
hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc mmc_core i915 nvme
e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video
drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci
Aug 28 23:30:38 xtwo70 kernel: CR2: 0018
Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]---
Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd:
service name='org.freedesktop.hostname1'
unit='dbus-org.freedesktop.hostn

4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63

2018-08-28 Thread Douglas Gilbert


I usually boot my Lenovo X270 with a SD card in its:
  # lspci
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI 
Express Card Reader (rev 01)

...

In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in
the logs. If I remove the SD card my machine boots and works okay until I
insert the SD card. Then:

Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch
Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card at 
address 

Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB
Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2
Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error: 
data->host_cookie = 62, host->cookie = 63
Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 0018

Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0
Aug 28 23:30:38 xtwo70 kernel: Oops:  [#1] SMP
Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not tainted 
4.19.0-rc1 #78
Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW, BIOS 
R0IET53W (1.31 ) 05/22/2018

Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260 
[rtsx_pci]
Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00 00 
00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96 90 00 00 
00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83 c7

Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202
Aug 28 23:30:38 xtwo70 kernel: RAX:  RBX: 0003 RCX: 

Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021 RDI: 
8801b6328000
Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400 R09: 

Aug 28 23:30:38 xtwo70 kernel: R10:  R11:  R12: 
a800
Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400 R15: 
0001
Aug 28 23:30:38 xtwo70 kernel: FS:  () 
GS:88022738() knlGS:

Aug 28 23:30:38 xtwo70 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006 CR4: 
003606e0

Aug 28 23:30:38 xtwo70 kernel: Call Trace:
Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80
Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40
Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0
Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0
Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0
Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130
Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80
Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50
Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep ccm 
btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc intel_rapl 
nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat coretemp 
kvm_intel arc4 snd_hda_codec_hdmi kvm snd_hda_codec_realtek 
snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64 crypto_simd cryptd 
glue_helper mac80211 intel_cstate intel_uncore intel_rapl_perf snd_hda_intel 
joydev snd_hda_codec mousedev snd_hwdep iwlwifi snd_hda_core input_leds 
efi_pstore snd_pcm serio_raw efivars cfg80211 rtsx_pci_ms memstick mei_me idma64 
virt_dma mei intel_lpss_pci intel_lpss intel_pch_thermal thinkpad_acpi nvram 
snd_seq_dummy tps6598x snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event
Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd soundcore 
rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq ip6t_REJECT 
nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 
nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype xt_conntrack ip6table_filter 
ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat 
nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 
iptable_filter parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 
hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc 
mmc_core i915 nvme e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video 
drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci

Aug 28 23:30:38 xtwo70 kernel: CR2: 0018
Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]---
Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd: 
service name='org.freedesktop.hostname1' 
unit='dbus-org.freedesktop.hostname1.service' requested by ':1.77' (uid=1000 
pid=3518

4.19.0-rc1 rtsx_pci_sdmmc.0: error: data->host_cookie = 62, host->cookie = 63

2018-08-28 Thread Douglas Gilbert


I usually boot my Lenovo X270 with a SD card in its:
  # lspci
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS522A PCI 
Express Card Reader (rev 01)

...

In lk 4.19.0-rc1 the boot locks up solid, almost immediately and nothing in
the logs. If I remove the SD card my machine boots and works okay until I
insert the SD card. Then:

Aug 28 23:30:38 xtwo70 kernel: mmc0: cannot verify signal voltage switch
Aug 28 23:30:38 xtwo70 kernel: mmc0: new ultra high speed SDR104 SDXC card at 
address 

Aug 28 23:30:38 xtwo70 kernel: mmcblk0: mmc0: ACLCE 59.5 GiB
Aug 28 23:30:38 xtwo70 kernel: mmcblk0: p1 p2
Aug 28 23:30:38 xtwo70 kernel: rtsx_pci_sdmmc rtsx_pci_sdmmc.0: error: 
data->host_cookie = 62, host->cookie = 63
Aug 28 23:30:38 xtwo70 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 0018

Aug 28 23:30:38 xtwo70 kernel: PGD 0 P4D 0
Aug 28 23:30:38 xtwo70 kernel: Oops:  [#1] SMP
Aug 28 23:30:38 xtwo70 kernel: CPU: 3 PID: 1571 Comm: kworker/3:2 Not tainted 
4.19.0-rc1 #78
Aug 28 23:30:38 xtwo70 kernel: Hardware name: LENOVO 20HNCTO1WW/20HNCTO1WW, BIOS 
R0IET53W (1.31 ) 05/22/2018

Aug 28 23:30:38 xtwo70 kernel: Workqueue: events sd_request [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: RIP: 0010:rtsx_pci_dma_transfer+0x6e/0x260 
[rtsx_pci]
Aug 28 23:30:38 xtwo70 kernel: Code: 49 89 fe 45 89 c5 c7 87 90 00 00 00 00 00 
00 00 8d 6a ff 81 c9 00 00 00 88 31 d2 41 89 cc 45 31 ff eb 07 41 8b 96 90 00 00 
00 <8b> 78 18 48 63 ca 31 f6 44 39 fd 48 8b 50 10 40 0f 94 c6 41 83 c7

Aug 28 23:30:38 xtwo70 kernel: RSP: 0018:c9217d78 EFLAGS: 00010202
Aug 28 23:30:38 xtwo70 kernel: RAX:  RBX: 0003 RCX: 

Aug 28 23:30:38 xtwo70 kernel: RDX: 0001 RSI: 0021 RDI: 
8801b6328000
Aug 28 23:30:38 xtwo70 kernel: RBP: 0002 R08: 880036000400 R09: 

Aug 28 23:30:38 xtwo70 kernel: R10:  R11:  R12: 
a800
Aug 28 23:30:38 xtwo70 kernel: R13: 2710 R14: 88021fd35400 R15: 
0001
Aug 28 23:30:38 xtwo70 kernel: FS:  () 
GS:88022738() knlGS:

Aug 28 23:30:38 xtwo70 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Aug 28 23:30:38 xtwo70 kernel: CR2: 0018 CR3: 0400f006 CR4: 
003606e0

Aug 28 23:30:38 xtwo70 kernel: Call Trace:
Aug 28 23:30:38 xtwo70 kernel: ? mark_held_locks+0x50/0x80
Aug 28 23:30:38 xtwo70 kernel: ? _raw_spin_unlock_irqrestore+0x2d/0x40
Aug 28 23:30:38 xtwo70 kernel: sd_request+0x385/0x81a [rtsx_pci_sdmmc]
Aug 28 23:30:38 xtwo70 kernel: process_one_work+0x287/0x5e0
Aug 28 23:30:38 xtwo70 kernel: worker_thread+0x28/0x3d0
Aug 28 23:30:38 xtwo70 kernel: ? process_one_work+0x5e0/0x5e0
Aug 28 23:30:38 xtwo70 kernel: kthread+0x10e/0x130
Aug 28 23:30:38 xtwo70 kernel: ? kthread_park+0x80/0x80
Aug 28 23:30:38 xtwo70 kernel: ret_from_fork+0x3a/0x50
Aug 28 23:30:38 xtwo70 kernel: Modules linked in: mmc_block fuse msr bnep ccm 
btusb btrtl btbcm btintel bluetooth squashfs ecdh_generic binfmt_misc intel_rapl 
nls_iso8859_1 nls_cp437 x86_pkg_temp_thermal vfat intel_powerclamp fat coretemp 
kvm_intel arc4 snd_hda_codec_hdmi kvm snd_hda_codec_realtek 
snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc iwlmvm aesni_intel aes_x86_64 crypto_simd cryptd 
glue_helper mac80211 intel_cstate intel_uncore intel_rapl_perf snd_hda_intel 
joydev snd_hda_codec mousedev snd_hwdep iwlwifi snd_hda_core input_leds 
efi_pstore snd_pcm serio_raw efivars cfg80211 rtsx_pci_ms memstick mei_me idma64 
virt_dma mei intel_lpss_pci intel_lpss intel_pch_thermal thinkpad_acpi nvram 
snd_seq_dummy tps6598x snd_seq_oss typec snd_seq_midi snd_rawmidi snd_seq_midi_event
Aug 28 23:30:38 xtwo70 kernel: snd_seq snd_seq_device snd_timer snd soundcore 
rfkill tpm_crb tpm_tis tpm_tis_core tpm evdev mac_hid pcc_cpufreq ip6t_REJECT 
nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 nf_log_ipv4 
nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype xt_conntrack ip6table_filter 
ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat 
nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 
iptable_filter parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 
hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid rtsx_pci_sdmmc 
mmc_core i915 nvme e1000e i2c_algo_bit nvme_core drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops xhci_pci drm xhci_hcd video 
drm_panel_orientation_quirks usbcore intel_gtt agpgart usb_common rtsx_pci

Aug 28 23:30:38 xtwo70 kernel: CR2: 0018
Aug 28 23:30:38 xtwo70 kernel: ---[ end trace bb8ce18072d22d51 ]---
Aug 28 23:30:38 xtwo70 dbus-daemon[2110]: [system] Activating via systemd: 
service name='org.freedesktop.hostname1' 
unit='dbus-org.freedesktop.hostname1.service' requested by ':1.77' (uid=1000 
pid=3518

Re: [PATCH] scsi: sg: fix a missing-check bug

2018-05-06 Thread Douglas Gilbert


On 2018-05-05 11:21 PM, Wenwen Wang wrote:

In sg_write(), the opcode of the command is firstly copied from the
userspace pointer 'buf' and saved to the kernel variable 'opcode', using
the __get_user() function. The size of the command, i.e., 'cmd_size' is
then calculated based on the 'opcode'. After that, the whole command,
including the opcode, is copied again from 'buf' using the
__copy_from_user() function and saved to 'cmnd'. Finally, the function
  sg_common_write() is invoked to process 'cmnd'. Given that the 'buf'
pointer resides in userspace, a malicious userspace process can race to
change the opcode of the command between the two copies. That means, the
opcode indicated by the variable 'opcode' could be different from the
opcode in 'cmnd'. This can cause inconsistent data in 'cmnd' and
potential logical errors in the function sg_common_write(), as it needs to
work on 'cmnd'.

This patch reuses the opcode obtained in the first copy and only copies the
remaining part of the command from userspace.

Signed-off-by: Wenwen Wang 
---
  drivers/scsi/sg.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index c198b963..0ad8106 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -657,7 +657,8 @@ sg_write(struct file *filp, const char __user *buf, size_t 
count, loff_t * ppos)
hp->flags = input_size;  /* structure abuse ... */
hp->pack_id = old_hdr.pack_id;
hp->usr_ptr = NULL;
-   if (__copy_from_user(cmnd, buf, cmd_size))
+   cmnd[0] = opcode;
+   if (__copy_from_user(cmnd + 1, buf + 1, cmd_size - 1))
return -EFAULT;
/*
 * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV,



That is in the deprecated "v2" part of the sg driver (for around 15 years).
There are lots more interesting races with that interface than that one
described above. My guess is that all system calls would be susceptible
to playing around with a buffer being passed to or from the OS by a thread
other than the one doing the system call, during that call. Surely no Unix
like OS gives any security guarantees to a thread being attacked by a
malevolent thread in the same process!

My question is did this actually cause to program to fail; or is it something
that a sanity checker flagged?

Also wouldn't it be better just to return an error such as EINVAL if
opcode != command[0]  ?

Doug Gilbert

Re: [PATCH] scsi: sg: fix a missing-check bug

2018-05-06 Thread Douglas Gilbert


On 2018-05-05 11:21 PM, Wenwen Wang wrote:

In sg_write(), the opcode of the command is firstly copied from the
userspace pointer 'buf' and saved to the kernel variable 'opcode', using
the __get_user() function. The size of the command, i.e., 'cmd_size' is
then calculated based on the 'opcode'. After that, the whole command,
including the opcode, is copied again from 'buf' using the
__copy_from_user() function and saved to 'cmnd'. Finally, the function
  sg_common_write() is invoked to process 'cmnd'. Given that the 'buf'
pointer resides in userspace, a malicious userspace process can race to
change the opcode of the command between the two copies. That means, the
opcode indicated by the variable 'opcode' could be different from the
opcode in 'cmnd'. This can cause inconsistent data in 'cmnd' and
potential logical errors in the function sg_common_write(), as it needs to
work on 'cmnd'.

This patch reuses the opcode obtained in the first copy and only copies the
remaining part of the command from userspace.

Signed-off-by: Wenwen Wang 
---
  drivers/scsi/sg.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index c198b963..0ad8106 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -657,7 +657,8 @@ sg_write(struct file *filp, const char __user *buf, size_t 
count, loff_t * ppos)
hp->flags = input_size;  /* structure abuse ... */
hp->pack_id = old_hdr.pack_id;
hp->usr_ptr = NULL;
-   if (__copy_from_user(cmnd, buf, cmd_size))
+   cmnd[0] = opcode;
+   if (__copy_from_user(cmnd + 1, buf + 1, cmd_size - 1))
return -EFAULT;
/*
 * SG_DXFER_TO_FROM_DEV is functionally equivalent to SG_DXFER_FROM_DEV,



That is in the deprecated "v2" part of the sg driver (for around 15 years).
There are lots more interesting races with that interface than that one
described above. My guess is that all system calls would be susceptible
to playing around with a buffer being passed to or from the OS by a thread
other than the one doing the system call, during that call. Surely no Unix
like OS gives any security guarantees to a thread being attacked by a
malevolent thread in the same process!

My question is did this actually cause to program to fail; or is it something
that a sanity checker flagged?

Also wouldn't it be better just to return an error such as EINVAL if
opcode != command[0]  ?

Doug Gilbert

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-04 Thread Douglas Gilbert


On 2018-04-04 04:32 PM, Kees Cook wrote:

On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko
 wrote:

[  261.262135] Bad or missing usercopy whitelist? Kernel memory exposure
attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)!
I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling
smartctl in a loop and doing some usual background I/O. The warning is
triggered within 3 minutes or so (not instantly).


Also:

Can you send me your .config? What SCSI drivers are you using in the
VM and on the real server?

Are you able to see what ioctl()s smartctl is issuing? I'll try to
reproduce this on my end...


smartctl -r scsiioctl,3

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-04 Thread Douglas Gilbert


On 2018-04-04 04:32 PM, Kees Cook wrote:

On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko
 wrote:

[  261.262135] Bad or missing usercopy whitelist? Kernel memory exposure
attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)!
I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling
smartctl in a loop and doing some usual background I/O. The warning is
triggered within 3 minutes or so (not instantly).


Also:

Can you send me your .config? What SCSI drivers are you using in the
VM and on the real server?

Are you able to see what ioctl()s smartctl is issuing? I'll try to
reproduce this on my end...


smartctl -r scsiioctl,3

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-04 Thread Douglas Gilbert


On 2018-04-04 04:21 PM, Kees Cook wrote:

On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko
 wrote:

With v4.16 I get the following dump while using smartctl:
[...]
[  261.262135] Bad or missing usercopy whitelist? Kernel memory exposure
attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)!
[...]
[  261.345976] Call Trace:
[  261.350620]  __check_object_size+0x130/0x1a0
[  261.355775]  sg_io+0x269/0x3f0
[  261.360729]  ? path_lookupat+0xaa/0x1f0
[  261.364027]  ? current_time+0x18/0x70
[  261.366684]  scsi_cmd_ioctl+0x257/0x410
[  261.369871]  ? xfs_bmapi_read+0x1c3/0x340 [xfs]
[  261.372231]  sd_ioctl+0xbf/0x1a0 [sd_mod]
[  261.375456]  blkdev_ioctl+0x8ca/0x990
[  261.381156]  ? read_null+0x10/0x10
[  261.384984]  block_ioctl+0x39/0x40
[  261.388739]  do_vfs_ioctl+0xa4/0x630
[  261.392624]  ? vfs_write+0x164/0x1a0
[  261.396658]  SyS_ioctl+0x74/0x80
[  261.399563]  do_syscall_64+0x74/0x190
[  261.402685]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2


This is:

sg_io+0x269/0x3f0:
blk_complete_sghdr_rq at block/scsi_ioctl.c:280
  (inlined by) sg_io at block/scsi_ioctl.c:376

which is:

 if (req->sense_len && hdr->sbp) {
 int len = min((unsigned int) hdr->mx_sb_len, req->sense_len);

 if (!copy_to_user(hdr->sbp, req->sense, len))
 hdr->sb_len_wr = len;
 else
 ret = -EFAULT;
 }


[...]
I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling
smartctl in a loop and doing some usual background I/O. The warning is
triggered within 3 minutes or so (not instantly).

Initially, it was produced on my server after a kernel update (because disks
are monitored with smartctl via Zabbix).

Looks like the thing was introduced with
0afe76e88c57d91ef5697720aed380a339e3df70.

Any idea how to deal with this please? If needed, I can provide any additional
info, and also I'm happy/ready to test any proposed patches.


Interesting, and a little confusing. So, what's strange here is that
the scsi_sense_cache already has a full whitelist:

kmem_cache_create_usercopy("scsi_sense_cache",
SCSI_SENSE_BUFFERSIZE, 0, SLAB_HWCACHE_ALIGN,
0, SCSI_SENSE_BUFFERSIZE, NULL);

Arg 2 is the buffer size, arg 5 is the whitelist offset (0), and the
whitelist size (same as arg2). In other words, the entire buffer
should be whitelisted.

include/scsi/scsi_cmnd.h says:

#define SCSI_SENSE_BUFFERSIZE  96

That means scsi_sense_cache should be 96 bytes in size? But a 22 byte
read starting at offset 94 happened? That seems like a 20 byte read
beyond the end of the SLUB object? Though if it were reading past the
actual end of the object, I'd expect the hardened usercopy BUG (rather
than the WARN) to kick in. Ah, it looks like
/sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes
of actual allocation, so the 20 bytes doesn't strictly overlap another
object (hence no BUG):

/sys/kernel/slab/scsi_sense_cache# grep . object_size usersize slab_size
object_size:96
usersize:96
slab_size:128

Ah, right, due to SLAB_HWCACHE_ALIGN, the allocation is rounded up to
the next cache line size, so there's 32 bytes of padding to reach 128.

James or Martin, is this over-read "expected" behavior? i.e. does the
sense cache buffer usage ever pull the ugly trick of silently
expanding its allocation into the space the slab allocator has given
it? If not, this looks like a real bug.

What I don't see is how req->sense is _not_ at offset 0 in the
scsi_sense_cache object...


Looking at the smartctl SCSI code it pulls 32 byte sense buffers.
Can't see 22 anywhere relevant in its code.

There are two types of sense: fixed and descriptor: with fixed you
seldom need more than 18 bytes (but it can only represent 32 bit
LBAs). The other type has a header and 0 or more variable length
descriptors. If decoding of descriptor sense went wrong you might
end up at offset 94. But not with smartctl 

Doug Gilbert

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-04 Thread Douglas Gilbert


On 2018-04-04 04:21 PM, Kees Cook wrote:

On Wed, Apr 4, 2018 at 12:07 PM, Oleksandr Natalenko
 wrote:

With v4.16 I get the following dump while using smartctl:
[...]
[  261.262135] Bad or missing usercopy whitelist? Kernel memory exposure
attempt detected from SLUB object 'scsi_sense_cache' (offset 94, size 22)!
[...]
[  261.345976] Call Trace:
[  261.350620]  __check_object_size+0x130/0x1a0
[  261.355775]  sg_io+0x269/0x3f0
[  261.360729]  ? path_lookupat+0xaa/0x1f0
[  261.364027]  ? current_time+0x18/0x70
[  261.366684]  scsi_cmd_ioctl+0x257/0x410
[  261.369871]  ? xfs_bmapi_read+0x1c3/0x340 [xfs]
[  261.372231]  sd_ioctl+0xbf/0x1a0 [sd_mod]
[  261.375456]  blkdev_ioctl+0x8ca/0x990
[  261.381156]  ? read_null+0x10/0x10
[  261.384984]  block_ioctl+0x39/0x40
[  261.388739]  do_vfs_ioctl+0xa4/0x630
[  261.392624]  ? vfs_write+0x164/0x1a0
[  261.396658]  SyS_ioctl+0x74/0x80
[  261.399563]  do_syscall_64+0x74/0x190
[  261.402685]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2


This is:

sg_io+0x269/0x3f0:
blk_complete_sghdr_rq at block/scsi_ioctl.c:280
  (inlined by) sg_io at block/scsi_ioctl.c:376

which is:

 if (req->sense_len && hdr->sbp) {
 int len = min((unsigned int) hdr->mx_sb_len, req->sense_len);

 if (!copy_to_user(hdr->sbp, req->sense, len))
 hdr->sb_len_wr = len;
 else
 ret = -EFAULT;
 }


[...]
I can easily reproduce it with a qemu VM and 2 virtual SCSI disks by calling
smartctl in a loop and doing some usual background I/O. The warning is
triggered within 3 minutes or so (not instantly).

Initially, it was produced on my server after a kernel update (because disks
are monitored with smartctl via Zabbix).

Looks like the thing was introduced with
0afe76e88c57d91ef5697720aed380a339e3df70.

Any idea how to deal with this please? If needed, I can provide any additional
info, and also I'm happy/ready to test any proposed patches.


Interesting, and a little confusing. So, what's strange here is that
the scsi_sense_cache already has a full whitelist:

kmem_cache_create_usercopy("scsi_sense_cache",
SCSI_SENSE_BUFFERSIZE, 0, SLAB_HWCACHE_ALIGN,
0, SCSI_SENSE_BUFFERSIZE, NULL);

Arg 2 is the buffer size, arg 5 is the whitelist offset (0), and the
whitelist size (same as arg2). In other words, the entire buffer
should be whitelisted.

include/scsi/scsi_cmnd.h says:

#define SCSI_SENSE_BUFFERSIZE  96

That means scsi_sense_cache should be 96 bytes in size? But a 22 byte
read starting at offset 94 happened? That seems like a 20 byte read
beyond the end of the SLUB object? Though if it were reading past the
actual end of the object, I'd expect the hardened usercopy BUG (rather
than the WARN) to kick in. Ah, it looks like
/sys/kernel/slab/scsi_sense_cache/slab_size shows this to be 128 bytes
of actual allocation, so the 20 bytes doesn't strictly overlap another
object (hence no BUG):

/sys/kernel/slab/scsi_sense_cache# grep . object_size usersize slab_size
object_size:96
usersize:96
slab_size:128

Ah, right, due to SLAB_HWCACHE_ALIGN, the allocation is rounded up to
the next cache line size, so there's 32 bytes of padding to reach 128.

James or Martin, is this over-read "expected" behavior? i.e. does the
sense cache buffer usage ever pull the ugly trick of silently
expanding its allocation into the space the slab allocator has given
it? If not, this looks like a real bug.

What I don't see is how req->sense is _not_ at offset 0 in the
scsi_sense_cache object...


Looking at the smartctl SCSI code it pulls 32 byte sense buffers.
Can't see 22 anywhere relevant in its code.

There are two types of sense: fixed and descriptor: with fixed you
seldom need more than 18 bytes (but it can only represent 32 bit
LBAs). The other type has a header and 0 or more variable length
descriptors. If decoding of descriptor sense went wrong you might
end up at offset 94. But not with smartctl 

Doug Gilbert

Re: [PATCH] scsi: resolve COMMAND_SIZE at compile time

2018-03-10 Thread Douglas Gilbert


On 2018-03-10 03:49 PM, James Bottomley wrote:

On Sat, 2018-03-10 at 14:29 +0100, Stephen Kitt wrote:

Hi Bart,

On Fri, 9 Mar 2018 22:47:12 +, Bart Van Assche 
wrote:


On Fri, 2018-03-09 at 23:33 +0100, Stephen Kitt wrote:


+/*
+ * SCSI command sizes are as follows, in bytes, for fixed size
commands,
per
+ * group: 6, 10, 10, 12, 16, 12, 10, 10. The top three bits of
an opcode
+ * determine its group.
+ * The size table is encoded into a 32-bit value by subtracting
each
value
+ * from 16, resulting in a value of 1715488362
+ * (6 << 28 + 6 << 24 + 4 << 20 + 0 << 16 + 4 << 12 + 6 << 8 + 6
<< 4 +
10).
+ * Command group 3 is reserved and should never be used.
+ */
+#define COMMAND_SIZE(opcode) \
+   (16 - (15 & (1715488362 >> (4 * (((opcode) >> 5) &
7)


To me this seems hard to read and hard to verify. Could this have
been
written as a combination of ternary expressions, e.g. using a gcc
statement
expression to ensure that opcode is evaluated once?


That’s what I’d tried initially, e.g.

#define COMMAND_SIZE(opcode) ({ \
int index = ((opcode) >> 5) & 7; \
index == 0 ? 6 : (index == 4 ? 16 : index == 3 || index == 5 ? 12 :
10); \
})

But gcc still reckons that results in a VLA, defeating the initial
purpose of
the exercise.

Does it help if I make the magic value construction clearer?

#define SCSI_COMMAND_SIZE_TBL ( \
   (16 -  6)\
+ ((16 - 10) <<  4)   \
+ ((16 - 10) <<  8)   \
+ ((16 - 12) << 12)   \
+ ((16 - 16) << 16)   \
+ ((16 - 12) << 20)   \
+ ((16 - 10) << 24)   \
+ ((16 - 10) << 28))

#define COMMAND_SIZE(opcode)
\
   (16 - (15 & (SCSI_COMMAND_SIZE_TBL >> (4 * (((opcode) >> 5) &
7)


Couldn't we do the less clever thing of making the array a static const
and moving it to a header?  That way the compiler should be able to
work it out at compile time.


And maybe add a comment that as of now (SPC-5 rev 19), COMMAND_SIZE is not
valid for opcodes 0x7e and 0x7f plus everything above and including 0xc0.
The latter ones are vendor specific and are loosely constrained, probably
all even numbered lengths in the closed range: [6,260].


If the SCSI command sets want to keep up with NVMe, they may want to think
about how they can gainfully use cdb_s that are > 64 bytes long. WRITE
SCATTERED got into SBC-4 but READ GATHERED didn't, due to lack of interest.
The READ GATHERED proposed was a bidi command, but it could have been a
a simpler data-in command with a looong cdb (holding LBA, number_of_blocks
pairs).

Doug Gilbert

Re: [PATCH] scsi: resolve COMMAND_SIZE at compile time

2018-03-10 Thread Douglas Gilbert


On 2018-03-10 03:49 PM, James Bottomley wrote:

On Sat, 2018-03-10 at 14:29 +0100, Stephen Kitt wrote:

Hi Bart,

On Fri, 9 Mar 2018 22:47:12 +, Bart Van Assche 
wrote:


On Fri, 2018-03-09 at 23:33 +0100, Stephen Kitt wrote:


+/*
+ * SCSI command sizes are as follows, in bytes, for fixed size
commands,
per
+ * group: 6, 10, 10, 12, 16, 12, 10, 10. The top three bits of
an opcode
+ * determine its group.
+ * The size table is encoded into a 32-bit value by subtracting
each
value
+ * from 16, resulting in a value of 1715488362
+ * (6 << 28 + 6 << 24 + 4 << 20 + 0 << 16 + 4 << 12 + 6 << 8 + 6
<< 4 +
10).
+ * Command group 3 is reserved and should never be used.
+ */
+#define COMMAND_SIZE(opcode) \
+   (16 - (15 & (1715488362 >> (4 * (((opcode) >> 5) &
7)


To me this seems hard to read and hard to verify. Could this have
been
written as a combination of ternary expressions, e.g. using a gcc
statement
expression to ensure that opcode is evaluated once?


That’s what I’d tried initially, e.g.

#define COMMAND_SIZE(opcode) ({ \
int index = ((opcode) >> 5) & 7; \
index == 0 ? 6 : (index == 4 ? 16 : index == 3 || index == 5 ? 12 :
10); \
})

But gcc still reckons that results in a VLA, defeating the initial
purpose of
the exercise.

Does it help if I make the magic value construction clearer?

#define SCSI_COMMAND_SIZE_TBL ( \
   (16 -  6)\
+ ((16 - 10) <<  4)   \
+ ((16 - 10) <<  8)   \
+ ((16 - 12) << 12)   \
+ ((16 - 16) << 16)   \
+ ((16 - 12) << 20)   \
+ ((16 - 10) << 24)   \
+ ((16 - 10) << 28))

#define COMMAND_SIZE(opcode)
\
   (16 - (15 & (SCSI_COMMAND_SIZE_TBL >> (4 * (((opcode) >> 5) &
7)


Couldn't we do the less clever thing of making the array a static const
and moving it to a header?  That way the compiler should be able to
work it out at compile time.


And maybe add a comment that as of now (SPC-5 rev 19), COMMAND_SIZE is not
valid for opcodes 0x7e and 0x7f plus everything above and including 0xc0.
The latter ones are vendor specific and are loosely constrained, probably
all even numbered lengths in the closed range: [6,260].


If the SCSI command sets want to keep up with NVMe, they may want to think
about how they can gainfully use cdb_s that are > 64 bytes long. WRITE
SCATTERED got into SBC-4 but READ GATHERED didn't, due to lack of interest.
The READ GATHERED proposed was a bidi command, but it could have been a
a simpler data-in command with a looong cdb (holding LBA, number_of_blocks
pairs).

Doug Gilbert

Re: scsi: sg: assorted memory corruptions

2018-01-31 Thread Douglas Gilbert


On 2018-01-30 07:22 AM, Dmitry Vyukov wrote:

Uh, I've answered this a week ago, but did not notice that Doug
dropped everybody from CC. Reporting to all.

On Mon, Jan 22, 2018 at 8:16 PM, Douglas Gilbert <dgilb...@interlog.com> wrote:

On 2018-01-22 02:06 PM, Dmitry Vyukov wrote:


On Mon, Jan 22, 2018 at 7:57 PM, Douglas Gilbert <dgilb...@interlog.com>

Please show me the output of 'lsscsi -g' on your test machine.
/dev/sg0 is often associated with /dev/sda which is often a SATA
SSD (or a virtualized one) that holds the root file system.
With the sg pass-through driver it is relatively easy to write
random (user provided data) over the root file system which will
almost certainly "root" the system.



This is pretty standard qemu vm started with:

qemu-system-x86_64 -hda wheezy.img -net user,host=10.0.2.10 -net nic
-nographic -kernel arch/x86/boot/bzImage -append "console=ttyS0
root=/dev/sda earlyprintk=serial " -m 2G -smp 4

# lsscsi -g
[0:0:0:0]diskATA  QEMU HARDDISK0 /dev/sda   /dev/sg0


With lk 4.15.0-rc9 I can run your test program (with some additions, see
attachment) for 30 minutes against a scsi_debug simulated disk. You can
easily replicate this test just run 'modprobe scsi_debug' and a third
line should appear in your lsscsi output. The new device will most likely
be /dev/sg2 .

With lk 4.15.0 (release) running against a SAS SSD (SEAGATE ST200FM0073),
the test has  been running 20 minutes and counting without problems. That
is using a LSI HBA with the mpt3sas driver.


[1:0:0:0]cd/dvd  QEMU QEMU DVD-ROM 2.0.  /dev/sr0   /dev/sg1

# readlink /sys/class/scsi_generic/sg0
../../devices/pci:00/:00:01.1/ata1/host0/target0:0:0/0:0:0:0/scsi_generic/sg0

# cat /sys/class/scsi_generic/sg0/device/vendor
ATA


^
That subsystem is the culprit IMO, most likely libata.

Until you can show this test failing on something other than an
ATA disk, then I will treat this issue as closed.

Doug Gilbert



Perhaps it misbehaves when it
gets a SCSI command in the T10 range (i.e. not vendor specific) with
a 9 byte cdb length. As far as I'm aware T10 (and the Ansi committee
before it) have never defined a cdb with an odd length.

For those that are not aware, the sg driver is a relatively thin
shim over the block layer, the SCSI mid-level, and a low-level
driver which may have another kernel driver stack underneath it
(e.g. UAS (USB attached SCSI)). The previous report from syzkaller
on the sg driver ("scsi: memory leak in sg_start_req") has resulted
in one accepted patch on the block layer with probably more to
come in the same area.

Testing the patch Dmitry gave (with some added error checks which
reported no problems) with the scsi_debug driver supplying /dev/sg0
I have not seen any problems running that test program. Again
there might be a very slow memory leak, but if there is I don't
believe it is in the sg driver.



Did you run it in a loop? First runs pass just fine for me too.



Is thirty minutes long enough ??



Yes, it certainly should be enough. Here is what I see:


# while ./a.out; do echo RUN; done
RUN
RUN
RUN
RUN
RUN
RUN
RUN
[  371.977266] 
==
[  371.980158] BUG: KASAN: double-free or invalid-free in
__put_task_struct+0x1e7/0x5c0



Here is full execution trace of the write call if that will be of any help:
https://gist.githubusercontent.com/dvyukov/14ae64c3e753dedf9ab2608676ecf0b9/raw/9803d52bb1e317a9228e362236d042aaf0fa9d69/gistfile1.txt

This is on upstream commit 0d665e7b109d512b7cae3ccef6e8654714887844.
Also attaching my config just in case.



// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

#define SG_NEXT_CMD_LEN 0x2283

static const char * usage = "sg_syzk_next_cdb # (e.g. '/dev/sg3') ";

int main(int argc, const char * argv[])
{
  int res, err;
  int fd;
  long len = 9;
  char* p = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x47\x00\x00\x24\x00"
"\x00\x00\x00\x00\x00\x1c\xbb\xac\x14\x00\xaa\xe0\x00\x00\x01"
"\x00\x07\x07\x00\x00\x59\x08\x00\x00\x00\x80\xfe\x7f\x00\x00\x01";
const char * dev_name;
struct stat a_stat;

if (argc < 2) {
fprintf(stderr, "Usage: %s\n", usage);
return 1;
}
dev_name = argv[1];
if (0 != stat(dev_name, _stat)) {
err = errno;
fprintf(stderr, "Unable to stat %s, err: %s\n", dev_name,
strerror(err));
return 1;
}
if ((a_stat.st_mode & S_IFMT) != S_IFCHR) {
fprintf(stderr, "Expected %s, to be sg device\n", dev_name);
return 1;
}
  fd = open(dev_name, O_RDWR);
  if (fd < 0) {
	err = errno;
	fprintf(stderr, "open(%s) failed: %s [%d]\n", dev_name, strerror(err),
		err);
  }
  res = ioctl(fd, SG_NEXT_CMD

1 2 3 4 5 6 >

1 - 100 of 564 matches

Mail list logo