linux-next: BUG: KASAN: use-after-free in bt_iter+0x29b/0x310
Hi, We run CRIU tests on linux-next. Today we found this bug in a kernel log: https://travis-ci.org/avagin/linux/jobs/462912976 [2.516900] random: fast init done [2.591491] sd 0:0:1:0: [sda] 146800640 512-byte logical blocks: (75.2 GB/70.0 GiB) [2.591688] sd 0:0:1:0: Attached scsi generic sg0 type 0 [2.591703] sd 0:0:1:0: [sda] 4096-byte physical blocks [2.592085] sd 0:0:1:0: [sda] Write Protect is off [2.592245] sd 0:0:1:0: [sda] Mode Sense: 1f 00 00 08 [2.592390] sd 0:0:1:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [2.597534] == [2.597694] BUG: KASAN: use-after-free in bt_iter+0x29b/0x310 [2.597813] Read of size 8 at addr 8881d44a1780 by task kworker/u4:0/7 [2.597929] [2.598042] CPU: 1 PID: 7 Comm: kworker/u4:0 Not tainted 4.20.0-rc5-next-20181203+ #1 [2.598170] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [2.598308] Workqueue: events_unbound async_run_entry_fn [2.598424] Call Trace: [2.598549] dump_stack+0x5b/0x8b [2.598666] print_address_description+0x6a/0x270 [2.598796] ? bt_iter+0x29b/0x310 [2.598910] kasan_report+0x133/0x1ae [2.599024] ? bt_iter+0x29b/0x310 [2.599152] ? bt_iter+0x29b/0x310 [2.599285] bt_iter+0x29b/0x310 [2.599402] blk_mq_queue_tag_busy_iter+0x481/0x8f0 [2.599525] ? blk_mq_stop_hw_queues+0x100/0x100 [2.599644] ? blk_mq_put_tag+0x150/0x150 [2.599760] ? do_raw_spin_unlock+0x54/0x220 [2.599879] ? blk_mq_stop_hw_queues+0x100/0x100 [2.58] ? __sbitmap_get_word+0x2a/0x80 [2.600116] blk_mq_in_flight+0xd2/0x130 [2.600232] ? blk_mq_end_request+0x430/0x430 [2.600353] ? blk_account_io_start+0x602/0x760 [2.600469] ? find_held_lock+0x32/0x1c0 [2.600597] part_round_stats+0x11c/0x690 [2.600715] ? blk_get_request+0xa0/0xa0 [2.600831] ? lock_acquire+0xfe/0x290 [2.600949] blk_account_io_start+0x404/0x760 [2.601065] ? kvm_clock_get_cycles+0xd/0x10 [2.601180] ? ktime_get+0x9c/0x120 [2.601323] ? blk_account_io_done+0x750/0x750 [2.601439] ? blk_mq_get_request+0xd54/0x1720 [2.601562] ? dd_request_merge+0x220/0x220 [2.601681] blk_mq_make_request+0x825/0xf70 [2.601808] ? blk_mq_try_issue_directly+0x130/0x130 [2.601925] ? generic_make_request_checks+0xa89/0x18f0 [2.602042] ? blk_cleanup_queue+0x1b0/0x1b0 [2.602158] ? blk_dump_rq_flags+0x3b0/0x3b0 [2.602277] ? kthread+0x2e9/0x3a0 [2.602392] ? kasan_unpoison_shadow+0x35/0x40 [2.602512] ? kasan_kmalloc+0xa5/0xd0 [2.602629] generic_make_request+0x541/0xd60 [2.602746] ? mempool_alloc+0xf7/0x2c0 [2.602862] ? blk_queue_enter+0x840/0x840 [2.602981] ? guard_bio_eod+0x151/0x4c0 [2.603096] ? find_held_lock+0x32/0x1c0 [2.603234] ? submit_bio+0x142/0x3f0 [2.603354] submit_bio+0x142/0x3f0 [2.603469] ? lock_downgrade+0x5d0/0x5d0 [2.603589] ? lock_acquire+0xfe/0x290 [2.603704] ? generic_make_request+0xd60/0xd60 [2.603821] ? bvec_alloc+0x270/0x270 [2.603937] ? guard_bio_eod+0x169/0x4c0 [2.604055] submit_bh_wbc+0x4d0/0x710 [2.604172] ? _raw_spin_unlock+0x24/0x30 [2.604291] block_read_full_page+0x3e6/0x830 [2.604408] ? I_BDEV+0x10/0x10 [2.604527] ? __bread_gfp+0x1f0/0x1f0 [2.604653] ? add_to_page_cache_lru+0x112/0x1c0 [2.604770] ? add_to_page_cache_locked+0x10/0x10 [2.604892] ? alloc_pages_current+0xb3/0x2b0 [2.605009] do_read_cache_page+0x658/0x10f0 [2.605127] ? blkdev_writepages+0x10/0x10 [2.605243] ? pagecache_get_page+0x6a0/0x6a0 [2.605361] ? __device_add_disk+0xc9e/0xf40 [2.605476] ? sd_probe_async+0x42d/0x720 [2.605596] ? async_run_entry_fn+0xc3/0x5d0 [2.605711] ? process_one_work+0x96c/0x16c0 [2.605828] ? worker_thread+0x87/0xe80 [2.605941] ? kthread+0x2e9/0x3a0 [2.606054] ? ret_from_fork+0x35/0x40 [2.606171] ? __save_stack_trace+0x5e/0x100 [2.606291] ? deref_stack_reg+0xad/0xe0 [2.606406] ? __read_once_size_nocheck.constprop.6+0x10/0x10 [2.606533] ? depot_save_stack+0x2d9/0x460 [2.606650] ? fs_reclaim_release.part.90+0x5/0x20 [2.606766] ? find_held_lock+0x32/0x1c0 [2.606885] read_dev_sector+0xbb/0x380 [2.607002] read_lba+0x34d/0x620 [2.607118] ? ultrix_partition+0x7a0/0x7a0 [2.607233] ? kasan_unpoison_shadow+0x35/0x40 [2.607354] efi_partition+0x2f2/0x1690 [2.607468] ? get_page_from_freelist+0x7dc/0x4120 [2.607595] ? vzalloc+0x8c/0xb0 [2.607708] ? check_partition+0xe6/0x680 [2.607826] ? is_gpt_valid.part.5+0xd80/0xd80 [2.607941] ? get_page_from_freelist+0x70e/0x4120 [2.608062] ? string+0x14c/0x220 [2.608178] ? string+0x14c/0x220 [2.608296] ? format_decode+0x3be/0x760 [2.608417] ? memcpy+0x39/0x50 [2.608536] ? vsnprintf+0x204/0
Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration
On 04/12/2018 02:14, Dongli Zhang wrote: > Hi Boris, > > On 12/04/2018 12:07 AM, Boris Ostrovsky wrote: >> On 12/2/18 3:31 PM, Manjunath Patil wrote: >>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote: >>> On 11/30/18 4:49 PM, Manjunath Patil wrote: > Thank you Boris for your comments. I removed faulty email of mine. > > replies inline. > On 11/30/2018 12:42 PM, Boris Ostrovsky wrote: >> On 11/29/18 12:17 AM, Manjunath Patil wrote: >>> Hi, >>> Feel free to suggest/comment on this. >>> >>> I am trying to do the following at dst during the migration now. >>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it. >>> 2. Store the old rinfo and nr_rings into temp variables in >>> negotiate_mq() >>> 3. let nr_rings get re-calculated based on backend data >>> 4. try allocating new memory based on new nr_rings >> Since I suspect number of rings will likely be the same why not reuse >> the rings in the common case? > I thought attaching devices will be more often than migration. Hence > did not want add to an extra check for >- if I am inside migration code path and >- if new nr_rings is equal to old nr_rings or not > > Sure addition of such a thing would avoid the memory allocation > altogether in migration path, > but it would add a little overhead for normal device addition. > > Do you think its worth adding that change? IMO a couple of extra checks are not going to make much difference. >>> I will add this change I wonder though --- have you actually seen the case where you did fail allocation and changes provided in this patch made things work? I am asking because right after negotiate_mq() we will call setup_blkring() and it will want to allocate bunch of memory. A failure there is fatal (to ring setup). So it seems to me that you will survive negotiate_mq() but then will likely fail soon after. >>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I >>> included my patch, I manually triggered the ENOMEM using a debug flag. >>> The patch works for ENOMEM inside negotiate_mq(). >>> >>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we >>> might hit it in setup_blkring() as well. >>> We should add the similar change to blkif_sring struct as well. >> >> >> Won't you have a similar issue with other frontends, say, netfront? > > I think the kmalloc is failed not because of OOM. > > In fact, the size of "blkfront_ring_info" is large. When domU have 4 > queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB. > > There is chance that kmalloc() 300+ KB would fail. So kmalloc() might not be the best choice. Any reason why you don't change it to vmalloc()? This should address the problem in a much simpler way. Juergen
Re: DIF/DIX issue related to config CONFIG_SCSI_MQ_DEFAULT
Hi John, > We have also noticed that if we just enable DIF in hisi_sas (with MQ), > and not DIX, then no issue. Enabling DIF doesn't really do anything on the kernel side other than setting PROTECT=1 in the READ/WRITE CDB and telling the driver which DIX protection operation the HBA should use. Since protection information is invisible to the kernel and only sent on the wire between initiator and target, enabling DIF doesn't really have the ability to interfere with anything on the kernel side. We're basically just setting flags asking HBA and storage to enable protected transfers. > I did also noticed mail "[PATCH v2 01/23] zfcp: make DIX experimental, > disabled, and independent of DIF", where DIX is made experimental. ...for the zfcp driver on zSeries. Just nitpicking on terminology here: T10 Protection Information (formerly known as DIF) describes how to generate and verify 8 bytes of extra information that's sent trailing each logical block on the wire between an initiator and target. The T10 PI spec is focused on the target device implementation of this and largely ignores the initiator side. DIX tries to remedy this deficiency. It is a spec that describes a set of logical operations an initiator must implement to facilitate sending and receiving the T10 protection information to/from host memory instead of terminating it at the HBA. The DIX spec isn't experimental, it's about a decade old and hasn't changed in years. The Linux kernel support for data integrity passthrough in the block layer and SCSI isn't experimental either. It's also a decade old and used extensively in production. So I object to the notion of "DIX being made experimental". An ASIC/firmware/driver implementation of DIX may be experimental. And of course I can't rule out regressions in the kernel block integrity implementation as a result of some of the recent MQ changes (will be happy to work with you guys to figure those out). But DIX isn't experimental, nor is the kernel support for passing protection information to an HBA. > For now we may not support DIX. It seems to have issues. We wanted to > try 3008 card on our system, but it does not seem to support DIX 0-3. For some reason Broadcom have not upstreamed their DIX support. It's supposedly available in their outbox driver. -- Martin K. Petersen Oracle Linux Engineering
Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration
Hi Manjunath, On 12/04/2018 10:49 AM, Manjunath Patil wrote: > On 12/3/2018 6:16 PM, Boris Ostrovsky wrote: > >> On 12/3/18 8:14 PM, Dongli Zhang wrote: >>> Hi Boris, >>> >>> On 12/04/2018 12:07 AM, Boris Ostrovsky wrote: On 12/2/18 3:31 PM, Manjunath Patil wrote: > On 11/30/2018 2:33 PM, Boris Ostrovsky wrote: > >> On 11/30/18 4:49 PM, Manjunath Patil wrote: >>> Thank you Boris for your comments. I removed faulty email of mine. >>> >>> replies inline. >>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote: On 11/29/18 12:17 AM, Manjunath Patil wrote: > Hi, > Feel free to suggest/comment on this. > > I am trying to do the following at dst during the migration now. > 1. Dont clear the old rinfo in blkif_free(). Instead just clean it. > 2. Store the old rinfo and nr_rings into temp variables in > negotiate_mq() > 3. let nr_rings get re-calculated based on backend data > 4. try allocating new memory based on new nr_rings Since I suspect number of rings will likely be the same why not reuse the rings in the common case? >>> I thought attaching devices will be more often than migration. Hence >>> did not want add to an extra check for >>> - if I am inside migration code path and >>> - if new nr_rings is equal to old nr_rings or not >>> >>> Sure addition of such a thing would avoid the memory allocation >>> altogether in migration path, >>> but it would add a little overhead for normal device addition. >>> >>> Do you think its worth adding that change? >> IMO a couple of extra checks are not going to make much difference. > I will add this change >> I wonder though --- have you actually seen the case where you did fail >> allocation and changes provided in this patch made things work? I am >> asking because right after negotiate_mq() we will call setup_blkring() >> and it will want to allocate bunch of memory. A failure there is fatal >> (to ring setup). So it seems to me that you will survive negotiate_mq() >> but then will likely fail soon after. > I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I > included my patch, I manually triggered the ENOMEM using a debug flag. > The patch works for ENOMEM inside negotiate_mq(). > > As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we > might hit it in setup_blkring() as well. > We should add the similar change to blkif_sring struct as well. Won't you have a similar issue with other frontends, say, netfront? >>> I think the kmalloc is failed not because of OOM. >>> >>> In fact, the size of "blkfront_ring_info" is large. When domU have 4 >>> queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB. >>> >>> There is chance that kmalloc() 300+ KB would fail. >>> >>> >>> About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 >>> KB? >> TBH these look like comparable sizes to me. I am not convinced that >> these changes will make a difference. If the number of rings on source >> and destination were the same I'd absolutely agree with this patch but >> since you are trying to handle different sizes the code becomes somewhat >> more complex, and I am not sure it's worth it. (Can you actually give me >> an example of when we can expect number of rings to change during >> migration?) >> >> But others may think differently. > Hi Boris, > I think allocation of 300KB chunk[order 7 allocation] is more likely to fail > than 70KB[order 5] especially under memory pressure. > If it comes to that, I think we should fix this too. > > The no.of rings in most cases remain 4 thanks to xen_blkif_max_queues module > parameter. > If the src host has allocated less than 4[may be vpcu given to this dom0 were > less than 4], then we can expect the dst to allocate more than src side and > vice > versa. xen_blkif_max_queues is tunable so the size to kmalloc() would be larger when both xen_blkif_max_queues and dom0 vcpu are large. Dongli Zhang
Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration
On 12/3/2018 6:16 PM, Boris Ostrovsky wrote: On 12/3/18 8:14 PM, Dongli Zhang wrote: Hi Boris, On 12/04/2018 12:07 AM, Boris Ostrovsky wrote: On 12/2/18 3:31 PM, Manjunath Patil wrote: On 11/30/2018 2:33 PM, Boris Ostrovsky wrote: On 11/30/18 4:49 PM, Manjunath Patil wrote: Thank you Boris for your comments. I removed faulty email of mine. replies inline. On 11/30/2018 12:42 PM, Boris Ostrovsky wrote: On 11/29/18 12:17 AM, Manjunath Patil wrote: Hi, Feel free to suggest/comment on this. I am trying to do the following at dst during the migration now. 1. Dont clear the old rinfo in blkif_free(). Instead just clean it. 2. Store the old rinfo and nr_rings into temp variables in negotiate_mq() 3. let nr_rings get re-calculated based on backend data 4. try allocating new memory based on new nr_rings Since I suspect number of rings will likely be the same why not reuse the rings in the common case? I thought attaching devices will be more often than migration. Hence did not want add to an extra check for - if I am inside migration code path and - if new nr_rings is equal to old nr_rings or not Sure addition of such a thing would avoid the memory allocation altogether in migration path, but it would add a little overhead for normal device addition. Do you think its worth adding that change? IMO a couple of extra checks are not going to make much difference. I will add this change I wonder though --- have you actually seen the case where you did fail allocation and changes provided in this patch made things work? I am asking because right after negotiate_mq() we will call setup_blkring() and it will want to allocate bunch of memory. A failure there is fatal (to ring setup). So it seems to me that you will survive negotiate_mq() but then will likely fail soon after. I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I included my patch, I manually triggered the ENOMEM using a debug flag. The patch works for ENOMEM inside negotiate_mq(). As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we might hit it in setup_blkring() as well. We should add the similar change to blkif_sring struct as well. Won't you have a similar issue with other frontends, say, netfront? I think the kmalloc is failed not because of OOM. In fact, the size of "blkfront_ring_info" is large. When domU have 4 queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB. There is chance that kmalloc() 300+ KB would fail. About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB? TBH these look like comparable sizes to me. I am not convinced that these changes will make a difference. If the number of rings on source and destination were the same I'd absolutely agree with this patch but since you are trying to handle different sizes the code becomes somewhat more complex, and I am not sure it's worth it. (Can you actually give me an example of when we can expect number of rings to change during migration?) But others may think differently. Hi Boris, I think allocation of 300KB chunk[order 7 allocation] is more likely to fail than 70KB[order 5] especially under memory pressure. If it comes to that, I think we should fix this too. The no.of rings in most cases remain 4 thanks to xen_blkif_max_queues module parameter. If the src host has allocated less than 4[may be vpcu given to this dom0 were less than 4], then we can expect the dst to allocate more than src side and vice versa. -Thanks, Manjunath -boris ___ Xen-devel mailing list xen-de...@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration
On 12/3/18 8:14 PM, Dongli Zhang wrote: > Hi Boris, > > On 12/04/2018 12:07 AM, Boris Ostrovsky wrote: >> On 12/2/18 3:31 PM, Manjunath Patil wrote: >>> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote: >>> On 11/30/18 4:49 PM, Manjunath Patil wrote: > Thank you Boris for your comments. I removed faulty email of mine. > > replies inline. > On 11/30/2018 12:42 PM, Boris Ostrovsky wrote: >> On 11/29/18 12:17 AM, Manjunath Patil wrote: >>> Hi, >>> Feel free to suggest/comment on this. >>> >>> I am trying to do the following at dst during the migration now. >>> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it. >>> 2. Store the old rinfo and nr_rings into temp variables in >>> negotiate_mq() >>> 3. let nr_rings get re-calculated based on backend data >>> 4. try allocating new memory based on new nr_rings >> Since I suspect number of rings will likely be the same why not reuse >> the rings in the common case? > I thought attaching devices will be more often than migration. Hence > did not want add to an extra check for >- if I am inside migration code path and >- if new nr_rings is equal to old nr_rings or not > > Sure addition of such a thing would avoid the memory allocation > altogether in migration path, > but it would add a little overhead for normal device addition. > > Do you think its worth adding that change? IMO a couple of extra checks are not going to make much difference. >>> I will add this change I wonder though --- have you actually seen the case where you did fail allocation and changes provided in this patch made things work? I am asking because right after negotiate_mq() we will call setup_blkring() and it will want to allocate bunch of memory. A failure there is fatal (to ring setup). So it seems to me that you will survive negotiate_mq() but then will likely fail soon after. >>> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I >>> included my patch, I manually triggered the ENOMEM using a debug flag. >>> The patch works for ENOMEM inside negotiate_mq(). >>> >>> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we >>> might hit it in setup_blkring() as well. >>> We should add the similar change to blkif_sring struct as well. >> >> Won't you have a similar issue with other frontends, say, netfront? > I think the kmalloc is failed not because of OOM. > > In fact, the size of "blkfront_ring_info" is large. When domU have 4 > queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB. > > There is chance that kmalloc() 300+ KB would fail. > > > About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB? TBH these look like comparable sizes to me. I am not convinced that these changes will make a difference. If the number of rings on source and destination were the same I'd absolutely agree with this patch but since you are trying to handle different sizes the code becomes somewhat more complex, and I am not sure it's worth it. (Can you actually give me an example of when we can expect number of rings to change during migration?) But others may think differently. -boris
[PATCH v5 04/13] datagram: consolidate datagram copy to iter helpers
From: Sagi Grimberg skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy the same but with a couple of differences: The first is the copy operation used which either a simple copy or a csum_and_copy, and the second are the behavior on the "short copy" path where simply copy needs to return the number of bytes successfully copied while csum_and_copy needs to fault immediately as the checksum is partial. Introduce __skb_datagram_iter that additionally accepts: 1. copy operation function pointer 2. private data that goes with the copy operation 3. fault_short flag to indicate the action on short copy Suggested-by: David S. Miller Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- net/core/datagram.c | 136 ++-- 1 file changed, 42 insertions(+), 94 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index abe642181b64..382543302ae5 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -408,27 +408,20 @@ int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags) } EXPORT_SYMBOL(skb_kill_datagram); -/** - * skb_copy_datagram_iter - Copy a datagram to an iovec iterator. - * @skb: buffer to copy - * @offset: offset in the buffer to start copying from - * @to: iovec iterator to copy to - * @len: amount of data to copy from buffer to iovec - */ -int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, - struct iov_iter *to, int len) +int __skb_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len, bool fault_short, + size_t (*cb)(const void *, size_t, void *, struct iov_iter *), + void *data) { int start = skb_headlen(skb); int i, copy = start - offset, start_off = offset, n; struct sk_buff *frag_iter; - trace_skb_copy_datagram_iovec(skb, len); - /* Copy header. */ if (copy > 0) { if (copy > len) copy = len; - n = copy_to_iter(skb->data + offset, copy, to); + n = cb(skb->data + offset, copy, data, to); offset += n; if (n != copy) goto short_copy; @@ -450,8 +443,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, if (copy > len) copy = len; - n = copy_to_iter(vaddr + frag->page_offset + -offset - start, copy, to); + n = cb(vaddr + frag->page_offset + + offset - start, copy, data, to); kunmap(page); offset += n; if (n != copy) @@ -471,8 +464,8 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, if ((copy = end - offset) > 0) { if (copy > len) copy = len; - if (skb_copy_datagram_iter(frag_iter, offset - start, - to, copy)) + if (__skb_datagram_iter(frag_iter, offset - start, + to, copy, short_copy, cb, data)) goto fault; if ((len -= copy) == 0) return 0; @@ -493,11 +486,32 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, return -EFAULT; short_copy: - if (iov_iter_count(to)) + if (fault_short || iov_iter_count(to)) goto fault; return 0; } + +static size_t simple_copy_to_iter(const void *addr, size_t bytes, + void *data __always_unused, struct iov_iter *i) +{ + return copy_to_iter(addr, bytes, i); +} + +/** + * skb_copy_datagram_iter - Copy a datagram to an iovec iterator. + * @skb: buffer to copy + * @offset: offset in the buffer to start copying from + * @to: iovec iterator to copy to + * @len: amount of data to copy from buffer to iovec + */ +int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len) +{ + trace_skb_copy_datagram_iovec(skb, len); + return __skb_datagram_iter(skb, offset, to, len, false, + simple_copy_to_iter, NULL); +} EXPORT_SYMBOL(skb_copy_datagram_iter); /** @@ -648,87 +662,21 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from) } EXPORT_SYMBOL(zerocopy_sg_from_iter); +/** + * skb_copy_and_csum_datagram_iter - Copy datagram to an iovec iterator + * and update a checksum. + * @skb: buffer to copy + * @offset: offset in the buffer to start copying from + * @to: iovec iterator to copy to + * @len: amount of data to copy from buffer
[PATCH v5 06/13] datagram: introduce skb_copy_and_hash_datagram_iter helper
From: Sagi Grimberg Introduce a helper to copy datagram into an iovec iterator but also update a predefined hash. This is useful for consumers of skb_copy_datagram_iter to also support inflight data digest without having to finish to copy and only then traverse the iovec and calculate the digest hash. Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- include/linux/skbuff.h | 3 +++ net/core/datagram.c| 20 +++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0ba687454267..b0b8d5653f0d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3309,6 +3309,9 @@ static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset, } int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen, struct msghdr *msg); +int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len, + struct ahash_request *hash); int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset, struct iov_iter *from, int len); int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); diff --git a/net/core/datagram.c b/net/core/datagram.c index 382543302ae5..ef262282c8be 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -465,7 +465,7 @@ int __skb_datagram_iter(const struct sk_buff *skb, int offset, if (copy > len) copy = len; if (__skb_datagram_iter(frag_iter, offset - start, - to, copy, short_copy, cb, data)) + to, copy, fault_short, cb, data)) goto fault; if ((len -= copy) == 0) return 0; @@ -492,6 +492,24 @@ int __skb_datagram_iter(const struct sk_buff *skb, int offset, return 0; } +/** + * skb_copy_and_hash_datagram_iter - Copy datagram to an iovec iterator + * and update a hash. + * @skb: buffer to copy + * @offset: offset in the buffer to start copying from + * @to: iovec iterator to copy to + * @len: amount of data to copy from buffer to iovec + * @hash: hash request to update + */ +int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len, + struct ahash_request *hash) +{ + return __skb_datagram_iter(skb, offset, to, len, true, + hash_and_copy_to_iter, hash); +} +EXPORT_SYMBOL(skb_copy_and_hash_datagram_iter); + static size_t simple_copy_to_iter(const void *addr, size_t bytes, void *data __always_unused, struct iov_iter *i) { -- 2.17.1
[PATCH v5 05/13] iov_iter: introduce hash_and_copy_to_iter helper
From: Sagi Grimberg Allow consumers that want to use iov iterator helpers and also update a predefined hash calculation online when copying data. This is useful when copying incoming network buffers to a local iterator and calculate a digest on the incoming stream. nvme-tcp host driver that will be introduced in following patches is the first consumer via skb_copy_and_hash_datagram_iter. Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- include/linux/uio.h | 3 +++ lib/iov_iter.c | 16 2 files changed, 19 insertions(+) diff --git a/include/linux/uio.h b/include/linux/uio.h index 41d1f8d3313d..ecf584f6b82d 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -11,6 +11,7 @@ #include #include +#include #include struct page; @@ -269,6 +270,8 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count) size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i); size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); +size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp, + struct iov_iter *i); int import_iovec(int type, const struct iovec __user * uvector, unsigned nr_segs, unsigned fast_segs, diff --git a/lib/iov_iter.c b/lib/iov_iter.c index db93531ca3e3..8a5f7b2ae346 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -6,6 +6,7 @@ #include #include #include +#include #define PIPE_PARANOIA /* for now */ @@ -1475,6 +1476,21 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, } EXPORT_SYMBOL(csum_and_copy_to_iter); +size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp, + struct iov_iter *i) +{ + struct ahash_request *hash = hashp; + struct scatterlist sg; + size_t copied; + + copied = copy_to_iter(addr, bytes, i); + sg_init_one(&sg, addr, copied); + ahash_request_set_crypt(hash, &sg, NULL, copied); + crypto_ahash_update(hash); + return copied; +} +EXPORT_SYMBOL(hash_and_copy_to_iter); + int iov_iter_npages(const struct iov_iter *i, int maxpages) { size_t size = i->count; -- 2.17.1
[PATCH v5 00/13] TCP transport binding for NVMe over Fabrics
This patch set implements the NVMe over Fabrics TCP host and the target drivers. Now NVMe over Fabrics can run on every Ethernet port in the world. The implementation conforms to NVMe over Fabrics 1.1 specification (which will include already publicly available NVMe/TCP transport binding, TP 8000). The host driver hooks into the NVMe host stack and implements the TCP transport binding for NVMe over Fabrics. The NVMe over Fabrics TCP host driver is responsible for establishing a NVMe/TCP connection, TCP event and error handling and data-plane messaging and stream processing. The target driver hooks into the NVMe target core stack and implements the TCP transport binding. The NVMe over Fabrics target driver is responsible for accepting and establishing NVMe/TCP connections, TCP event and error handling, and data-plane messaging and stream processing. The implementation of both the host and target are fairly simple and straight-forward. Every NVMe queue is backed by a TCP socket that provides us reliable, in-order delivery of fabrics capsules and/or data. All NVMe queues are sharded over a private bound workqueue such that we always have a single context handling the byte stream and we don't need to worry about any locking/serialization. In addition, close attention was paid to a completely non-blocking data plane to minimize context switching and/or unforced scheduling. Also, @netdev mailing list is cc'd as this patch set contains generic helpers for online digest calculation (patches 1-3). The patchset structure: - patches 1-6 are prep to add a helper for digest calculation online with data placement - patches 7-9 are preparatory patches for NVMe/TCP - patches 10-13 implements NVMe/TCP Thanks to the members of the Fabrics Linux Driver team that helped development, testing and benchmarking this work. Gitweb code is available at: git://git.infradead.org/nvme.git nvme-tcp Changes from v4: - Added acks from Dave Miller for relevant patches - Fixed possible memory leak in nvmet-tcp error flow Changes from v3: - various changes based on comments from christoph - removed unused variables - united send/recv iter initialization - removed unneeded void * casting - fixed long lines - removed unneeded wrappers (nvme_tcp_free_tagset and friends) - remove null sgl setting - fixed socket callbacks naming - reworked nvmet-tcp send_list processing - omitted nvme-cli patches as no changes were made to them and no negative feedback was accepted since v3 Changes from v2: - fixed stupid missing symbol export for skb_copy_and_hash_datagram_iter - dropped patch that moved err_work and connect_work to nvme_ctrl - fixed maxr2t icreq validation - got rid of host and target send/recv context structures by moving the members directly to their parent structure along with some struct documentation - removed bh disable when locking the queue lock - moved definition in nvme-tcp.h to appropriate patch - added patch to rework nvme-cli trtype handling for discovery log entries a bit - rebased on top of nvme-4.21 branch - cleaned up some checkpatch warnings - collected review tags Changes from v1: - unified skb_copy_datagram_iter and skb_copy_and_csum_datagram (and the new skb_hash_and_copy_datagram_iter) to a single code path - removed nvmet modparam budgets (made them a define set to their default values) - fixed nvme-tcp host chained r2t transfers reported off-list - made .install_queue callout return nvme status code - Added some review tags - rebased on top of nvme-4.21 branch (nvme tree) + sqflow disable patches Sagi Grimberg (13): ath6kl: add ath6kl_ prefix to crypto_type datagram: open-code copy_page_to_iter iov_iter: pass void csum pointer to csum_and_copy_to_iter datagram: consolidate datagram copy to iter helpers iov_iter: introduce hash_and_copy_to_iter helper datagram: introduce skb_copy_and_hash_datagram_iter helper nvmet: Add install_queue callout nvme-fabrics: allow user passing header digest nvme-fabrics: allow user passing data digest nvme-tcp: Add protocol header nvmet-tcp: add NVMe over TCP target driver nvmet: allow configfs tcp trtype configuration nvme-tcp: add NVMe over TCP host driver drivers/net/wireless/ath/ath6kl/cfg80211.c |2 +- drivers/net/wireless/ath/ath6kl/common.h |2 +- drivers/net/wireless/ath/ath6kl/wmi.c |6 +- drivers/net/wireless/ath/ath6kl/wmi.h |6 +- drivers/nvme/host/Kconfig | 15 + drivers/nvme/host/Makefile |3 + drivers/nvme/host/fabrics.c| 10 + drivers/nvme/host/fabrics.h|4 + drivers/nvme/host/tcp.c| 2242 drivers/nvme/target/Kconfig| 10 + drivers/nvme/target/Makefile |2 + drivers/nvme/target/configfs.c |1 + drivers/nvme/target/fabrics-cmd.c | 10 + drivers/nvme/target/nvmet.h|1 + dri
[PATCH v5 03/13] iov_iter: pass void csum pointer to csum_and_copy_to_iter
From: Sagi Grimberg The single caller to csum_and_copy_to_iter is skb_copy_and_csum_datagram and we are trying to unite its logic with skb_copy_datagram_iter by passing a callback to the copy function that we want to apply. Thus, we need to make the checksum pointer private to the function. Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- include/linux/uio.h | 2 +- lib/iov_iter.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/uio.h b/include/linux/uio.h index 55ce99ddb912..41d1f8d3313d 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -266,7 +266,7 @@ static inline void iov_iter_reexpand(struct iov_iter *i, size_t count) { i->count = count; } -size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); +size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i); size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 7ebccb5c1637..db93531ca3e3 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1432,10 +1432,11 @@ bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, } EXPORT_SYMBOL(csum_and_copy_from_iter_full); -size_t csum_and_copy_to_iter(const void *addr, size_t bytes, __wsum *csum, +size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csump, struct iov_iter *i) { const char *from = addr; + __wsum *csum = csump; __wsum sum, next; size_t off = 0; sum = *csum; -- 2.17.1
[PATCH v5 01/13] ath6kl: add ath6kl_ prefix to crypto_type
From: Sagi Grimberg Prevent a namespace conflict as in following patches as skbuff.h will include the crypto API. Acked-by: David S. Miller Cc: Kalle Valo Signed-off-by: Sagi Grimberg --- drivers/net/wireless/ath/ath6kl/cfg80211.c | 2 +- drivers/net/wireless/ath/ath6kl/common.h | 2 +- drivers/net/wireless/ath/ath6kl/wmi.c | 6 +++--- drivers/net/wireless/ath/ath6kl/wmi.h | 6 +++--- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c b/drivers/net/wireless/ath/ath6kl/cfg80211.c index e121187f371f..fa049c4ae315 100644 --- a/drivers/net/wireless/ath/ath6kl/cfg80211.c +++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c @@ -1322,7 +1322,7 @@ static int ath6kl_cfg80211_set_default_key(struct wiphy *wiphy, struct ath6kl_vif *vif = netdev_priv(ndev); struct ath6kl_key *key = NULL; u8 key_usage; - enum crypto_type key_type = NONE_CRYPT; + enum ath6kl_crypto_type key_type = NONE_CRYPT; ath6kl_dbg(ATH6KL_DBG_WLAN_CFG, "%s: index %d\n", __func__, key_index); diff --git a/drivers/net/wireless/ath/ath6kl/common.h b/drivers/net/wireless/ath/ath6kl/common.h index 4f82e8632d37..d6e5234f67a1 100644 --- a/drivers/net/wireless/ath/ath6kl/common.h +++ b/drivers/net/wireless/ath/ath6kl/common.h @@ -67,7 +67,7 @@ struct ath6kl_llc_snap_hdr { __be16 eth_type; } __packed; -enum crypto_type { +enum ath6kl_crypto_type { NONE_CRYPT = 0x01, WEP_CRYPT = 0x02, TKIP_CRYPT = 0x04, diff --git a/drivers/net/wireless/ath/ath6kl/wmi.c b/drivers/net/wireless/ath/ath6kl/wmi.c index 777acc564ac9..9d7ac1ab2d02 100644 --- a/drivers/net/wireless/ath/ath6kl/wmi.c +++ b/drivers/net/wireless/ath/ath6kl/wmi.c @@ -1849,9 +1849,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx, enum network_type nw_type, enum dot11_auth_mode dot11_auth_mode, enum auth_mode auth_mode, - enum crypto_type pairwise_crypto, + enum ath6kl_crypto_type pairwise_crypto, u8 pairwise_crypto_len, - enum crypto_type group_crypto, + enum ath6kl_crypto_type group_crypto, u8 group_crypto_len, int ssid_len, u8 *ssid, u8 *bssid, u16 channel, u32 ctrl_flags, u8 nw_subtype) @@ -2301,7 +2301,7 @@ int ath6kl_wmi_disctimeout_cmd(struct wmi *wmi, u8 if_idx, u8 timeout) } int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index, - enum crypto_type key_type, + enum ath6kl_crypto_type key_type, u8 key_usage, u8 key_len, u8 *key_rsc, unsigned int key_rsc_len, u8 *key_material, diff --git a/drivers/net/wireless/ath/ath6kl/wmi.h b/drivers/net/wireless/ath/ath6kl/wmi.h index a60bb49fe920..784940ba4c90 100644 --- a/drivers/net/wireless/ath/ath6kl/wmi.h +++ b/drivers/net/wireless/ath/ath6kl/wmi.h @@ -2556,9 +2556,9 @@ int ath6kl_wmi_connect_cmd(struct wmi *wmi, u8 if_idx, enum network_type nw_type, enum dot11_auth_mode dot11_auth_mode, enum auth_mode auth_mode, - enum crypto_type pairwise_crypto, + enum ath6kl_crypto_type pairwise_crypto, u8 pairwise_crypto_len, - enum crypto_type group_crypto, + enum ath6kl_crypto_type group_crypto, u8 group_crypto_len, int ssid_len, u8 *ssid, u8 *bssid, u16 channel, u32 ctrl_flags, u8 nw_subtype); @@ -2610,7 +2610,7 @@ int ath6kl_wmi_config_debug_module_cmd(struct wmi *wmi, u32 valid, u32 config); int ath6kl_wmi_get_stats_cmd(struct wmi *wmi, u8 if_idx); int ath6kl_wmi_addkey_cmd(struct wmi *wmi, u8 if_idx, u8 key_index, - enum crypto_type key_type, + enum ath6kl_crypto_type key_type, u8 key_usage, u8 key_len, u8 *key_rsc, unsigned int key_rsc_len, u8 *key_material, -- 2.17.1
[PATCH v5 07/13] nvmet: Add install_queue callout
From: Sagi Grimberg nvmet-tcp will implement it to allocate queue commands which are only known at nvmf connect time (sq size). Reviewed-by: Christoph Hellwig Signed-off-by: Sagi Grimberg --- drivers/nvme/target/fabrics-cmd.c | 10 ++ drivers/nvme/target/nvmet.h | 1 + 2 files changed, 11 insertions(+) diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c index 328ae46d8344..ee7d84621d65 100644 --- a/drivers/nvme/target/fabrics-cmd.c +++ b/drivers/nvme/target/fabrics-cmd.c @@ -121,6 +121,16 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req) req->rsp->sq_head = cpu_to_le16(0x); } + if (ctrl->ops->install_queue) { + u16 ret = ctrl->ops->install_queue(req->sq); + + if (ret) { + pr_err("failed to install queue %d cntlid %d ret %x\n", + qid, ret, ctrl->cntlid); + return ret; + } + } + return 0; } diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 7d8b7a7d572a..89df51ee5bdf 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -279,6 +279,7 @@ struct nvmet_fabrics_ops { void (*delete_ctrl)(struct nvmet_ctrl *ctrl); void (*disc_traddr)(struct nvmet_req *req, struct nvmet_port *port, char *traddr); + u16 (*install_queue)(struct nvmet_sq *nvme_sq); }; #define NVMET_MAX_INLINE_BIOVEC8 -- 2.17.1
[PATCH v5 08/13] nvme-fabrics: allow user passing header digest
From: Sagi Grimberg Header digest is a nvme-tcp specific feature, but nothing prevents other transports reusing the concept so do not associate with tcp transport solely. Reviewed-by: Christoph Hellwig Signed-off-by: Sagi Grimberg --- drivers/nvme/host/fabrics.c | 5 + drivers/nvme/host/fabrics.h | 2 ++ 2 files changed, 7 insertions(+) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 10074ac7731b..4272f8a95db3 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -614,6 +614,7 @@ static const match_table_t opt_tokens = { { NVMF_OPT_HOST_ID, "hostid=%s" }, { NVMF_OPT_DUP_CONNECT, "duplicate_connect" }, { NVMF_OPT_DISABLE_SQFLOW, "disable_sqflow"}, + { NVMF_OPT_HDR_DIGEST, "hdr_digest"}, { NVMF_OPT_ERR, NULL} }; @@ -633,6 +634,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY; opts->kato = NVME_DEFAULT_KATO; opts->duplicate_connect = false; + opts->hdr_digest = false; options = o = kstrdup(buf, GFP_KERNEL); if (!options) @@ -827,6 +829,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, case NVMF_OPT_DISABLE_SQFLOW: opts->disable_sqflow = true; break; + case NVMF_OPT_HDR_DIGEST: + opts->hdr_digest = true; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index ecd9a006a091..a6127f1a9e8e 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -59,6 +59,7 @@ enum { NVMF_OPT_HOST_ID= 1 << 12, NVMF_OPT_DUP_CONNECT= 1 << 13, NVMF_OPT_DISABLE_SQFLOW = 1 << 14, + NVMF_OPT_HDR_DIGEST = 1 << 15, }; /** @@ -103,6 +104,7 @@ struct nvmf_ctrl_options { struct nvmf_host*host; int max_reconnects; booldisable_sqflow; + boolhdr_digest; }; /* -- 2.17.1
[PATCH v5 10/13] nvme-tcp: Add protocol header
From: Sagi Grimberg Signed-off-by: Sagi Grimberg --- include/linux/nvme-tcp.h | 189 +++ include/linux/nvme.h | 1 + 2 files changed, 190 insertions(+) create mode 100644 include/linux/nvme-tcp.h diff --git a/include/linux/nvme-tcp.h b/include/linux/nvme-tcp.h new file mode 100644 index ..03d87c0550a9 --- /dev/null +++ b/include/linux/nvme-tcp.h @@ -0,0 +1,189 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe over Fabrics TCP protocol header. + * Copyright (c) 2018 Lightbits Labs. All rights reserved. + */ + +#ifndef _LINUX_NVME_TCP_H +#define _LINUX_NVME_TCP_H + +#include + +#define NVME_TCP_DISC_PORT 8009 +#define NVME_TCP_ADMIN_CCSZSZ_8K +#define NVME_TCP_DIGEST_LENGTH 4 + +enum nvme_tcp_pfv { + NVME_TCP_PFV_1_0 = 0x0, +}; + +enum nvme_tcp_fatal_error_status { + NVME_TCP_FES_INVALID_PDU_HDR= 0x01, + NVME_TCP_FES_PDU_SEQ_ERR= 0x02, + NVME_TCP_FES_HDR_DIGEST_ERR = 0x03, + NVME_TCP_FES_DATA_OUT_OF_RANGE = 0x04, + NVME_TCP_FES_R2T_LIMIT_EXCEEDED = 0x05, + NVME_TCP_FES_DATA_LIMIT_EXCEEDED= 0x05, + NVME_TCP_FES_UNSUPPORTED_PARAM = 0x06, +}; + +enum nvme_tcp_digest_option { + NVME_TCP_HDR_DIGEST_ENABLE = (1 << 0), + NVME_TCP_DATA_DIGEST_ENABLE = (1 << 1), +}; + +enum nvme_tcp_pdu_type { + nvme_tcp_icreq = 0x0, + nvme_tcp_icresp = 0x1, + nvme_tcp_h2c_term = 0x2, + nvme_tcp_c2h_term = 0x3, + nvme_tcp_cmd= 0x4, + nvme_tcp_rsp= 0x5, + nvme_tcp_h2c_data = 0x6, + nvme_tcp_c2h_data = 0x7, + nvme_tcp_r2t= 0x9, +}; + +enum nvme_tcp_pdu_flags { + NVME_TCP_F_HDGST= (1 << 0), + NVME_TCP_F_DDGST= (1 << 1), + NVME_TCP_F_DATA_LAST= (1 << 2), + NVME_TCP_F_DATA_SUCCESS = (1 << 3), +}; + +/** + * struct nvme_tcp_hdr - nvme tcp pdu common header + * + * @type: pdu type + * @flags: pdu specific flags + * @hlen: pdu header length + * @pdo: pdu data offset + * @plen: pdu wire byte length + */ +struct nvme_tcp_hdr { + __u8type; + __u8flags; + __u8hlen; + __u8pdo; + __le32 plen; +}; + +/** + * struct nvme_tcp_icreq_pdu - nvme tcp initialize connection request pdu + * + * @hdr: pdu generic header + * @pfv: pdu version format + * @hpda: host pdu data alignment (dwords, 0's based) + * @digest:digest types enabled + * @maxr2t:maximum r2ts per request supported + */ +struct nvme_tcp_icreq_pdu { + struct nvme_tcp_hdr hdr; + __le16 pfv; + __u8hpda; + __u8digest; + __le32 maxr2t; + __u8rsvd2[112]; +}; + +/** + * struct nvme_tcp_icresp_pdu - nvme tcp initialize connection response pdu + * + * @hdr: pdu common header + * @pfv: pdu version format + * @cpda: controller pdu data alignment (dowrds, 0's based) + * @digest:digest types enabled + * @maxdata: maximum data capsules per r2t supported + */ +struct nvme_tcp_icresp_pdu { + struct nvme_tcp_hdr hdr; + __le16 pfv; + __u8cpda; + __u8digest; + __le32 maxdata; + __u8rsvd[112]; +}; + +/** + * struct nvme_tcp_term_pdu - nvme tcp terminate connection pdu + * + * @hdr: pdu common header + * @fes: fatal error status + * @fei: fatal error information + */ +struct nvme_tcp_term_pdu { + struct nvme_tcp_hdr hdr; + __le16 fes; + __le32 fei; + __u8rsvd[8]; +}; + +/** + * struct nvme_tcp_cmd_pdu - nvme tcp command capsule pdu + * + * @hdr: pdu common header + * @cmd: nvme command + */ +struct nvme_tcp_cmd_pdu { + struct nvme_tcp_hdr hdr; + struct nvme_command cmd; +}; + +/** + * struct nvme_tcp_rsp_pdu - nvme tcp response capsule pdu + * + * @hdr: pdu common header + * @hdr: nvme-tcp generic header + * @cqe: nvme completion queue entry + */ +struct nvme_tcp_rsp_pdu { + struct nvme_tcp_hdr hdr; + struct nvme_completion cqe; +}; + +/** + * struct nvme_tcp_r2t_pdu - nvme tcp ready-to-transfer pdu + * + * @hdr: pdu common header + * @command_id:nvme command identifier which this relates to + * @ttag: transfer tag (controller generated) + * @r2t_offset:offset from the start of the command data + * @r2t_length:length the host is allowed to send + */ +struct nvme_tcp_r2t_pdu { + struct nvme_tcp_hdr hdr; + __u16
[PATCH v5 13/13] nvme-tcp: add NVMe over TCP host driver
From: Sagi Grimberg This patch implements the NVMe over TCP host driver. It can be used to connect to remote NVMe over Fabrics subsystems over good old TCP/IP. The driver implements the TP 8000 of how nvme over fabrics capsules and data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte stream. nvme-tcp header and data digest are supported as well. To connect to all NVMe over Fabrics controllers reachable on a given taget port over TCP use the following command: nvme connect-all -t tcp -a $IPADDR This requires the latest version of nvme-cli with TCP support. Signed-off-by: Sagi Grimberg Signed-off-by: Roy Shterman Signed-off-by: Solganik Alexander --- drivers/nvme/host/Kconfig | 15 + drivers/nvme/host/Makefile |3 + drivers/nvme/host/tcp.c| 2242 3 files changed, 2260 insertions(+) create mode 100644 drivers/nvme/host/tcp.c diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index 88a8b5916624..0f345e207675 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -57,3 +57,18 @@ config NVME_FC from https://github.com/linux-nvme/nvme-cli. If unsure, say N. + +config NVME_TCP + tristate "NVM Express over Fabrics TCP host driver" + depends on INET + depends on BLK_DEV_NVME + select NVME_FABRICS + help + This provides support for the NVMe over Fabrics protocol using + the TCP transport. This allows you to use remote block devices + exported using the NVMe protocol set. + + To configure a NVMe over Fabrics controller use the nvme-cli tool + from https://github.com/linux-nvme/nvme-cli. + + If unsure, say N. diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index aea459c65ae1..8a4b671c5f0c 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -7,6 +7,7 @@ obj-$(CONFIG_BLK_DEV_NVME) += nvme.o obj-$(CONFIG_NVME_FABRICS) += nvme-fabrics.o obj-$(CONFIG_NVME_RDMA)+= nvme-rdma.o obj-$(CONFIG_NVME_FC) += nvme-fc.o +obj-$(CONFIG_NVME_TCP) += nvme-tcp.o nvme-core-y:= core.o nvme-core-$(CONFIG_TRACING)+= trace.o @@ -21,3 +22,5 @@ nvme-fabrics-y+= fabrics.o nvme-rdma-y+= rdma.o nvme-fc-y += fc.o + +nvme-tcp-y += tcp.o diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c new file mode 100644 index ..15543358e245 --- /dev/null +++ b/drivers/nvme/host/tcp.c @@ -0,0 +1,2242 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe over Fabrics TCP host. + * Copyright (c) 2018 Lightbits Labs. All rights reserved. + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "nvme.h" +#include "fabrics.h" + +struct nvme_tcp_queue; + +enum nvme_tcp_send_state { + NVME_TCP_SEND_CMD_PDU = 0, + NVME_TCP_SEND_H2C_PDU, + NVME_TCP_SEND_DATA, + NVME_TCP_SEND_DDGST, +}; + +struct nvme_tcp_request { + struct nvme_request req; + void*pdu; + struct nvme_tcp_queue *queue; + u32 data_len; + u32 pdu_len; + u32 pdu_sent; + u16 ttag; + struct list_headentry; + u32 ddgst; + + struct bio *curr_bio; + struct iov_iter iter; + + /* send state */ + size_t offset; + size_t data_sent; + enum nvme_tcp_send_state state; +}; + +enum nvme_tcp_queue_flags { + NVME_TCP_Q_ALLOCATED= 0, + NVME_TCP_Q_LIVE = 1, +}; + +enum nvme_tcp_recv_state { + NVME_TCP_RECV_PDU = 0, + NVME_TCP_RECV_DATA, + NVME_TCP_RECV_DDGST, +}; + +struct nvme_tcp_ctrl; +struct nvme_tcp_queue { + struct socket *sock; + struct work_struct io_work; + int io_cpu; + + spinlock_t lock; + struct list_headsend_list; + + /* recv state */ + void*pdu; + int pdu_remaining; + int pdu_offset; + size_t data_remaining; + size_t ddgst_remaining; + + /* send state */ + struct nvme_tcp_request *request; + + int queue_size; + size_t cmnd_capsule_len; + struct nvme_tcp_ctrl*ctrl; + unsigned long flags; + boolrd_enabled; + + boolhdr_digest; + booldata_digest; + struct ahash_reques
[PATCH v5 09/13] nvme-fabrics: allow user passing data digest
From: Sagi Grimberg Data digest is a nvme-tcp specific feature, but nothing prevents other transports reusing the concept so do not associate with tcp transport solely. Reviewed-by: Christoph Hellwig Signed-off-by: Sagi Grimberg --- drivers/nvme/host/fabrics.c | 5 + drivers/nvme/host/fabrics.h | 2 ++ 2 files changed, 7 insertions(+) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 4272f8a95db3..9c62c6838b76 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -615,6 +615,7 @@ static const match_table_t opt_tokens = { { NVMF_OPT_DUP_CONNECT, "duplicate_connect" }, { NVMF_OPT_DISABLE_SQFLOW, "disable_sqflow"}, { NVMF_OPT_HDR_DIGEST, "hdr_digest"}, + { NVMF_OPT_DATA_DIGEST, "data_digest" }, { NVMF_OPT_ERR, NULL} }; @@ -635,6 +636,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, opts->kato = NVME_DEFAULT_KATO; opts->duplicate_connect = false; opts->hdr_digest = false; + opts->data_digest = false; options = o = kstrdup(buf, GFP_KERNEL); if (!options) @@ -832,6 +834,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, case NVMF_OPT_HDR_DIGEST: opts->hdr_digest = true; break; + case NVMF_OPT_DATA_DIGEST: + opts->data_digest = true; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index a6127f1a9e8e..524a02a67817 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -60,6 +60,7 @@ enum { NVMF_OPT_DUP_CONNECT= 1 << 13, NVMF_OPT_DISABLE_SQFLOW = 1 << 14, NVMF_OPT_HDR_DIGEST = 1 << 15, + NVMF_OPT_DATA_DIGEST= 1 << 16, }; /** @@ -105,6 +106,7 @@ struct nvmf_ctrl_options { int max_reconnects; booldisable_sqflow; boolhdr_digest; + booldata_digest; }; /* -- 2.17.1
[PATCH v5 12/13] nvmet: allow configfs tcp trtype configuration
From: Sagi Grimberg Reviewed-by: Max Gurtovoy Reviewed-by: Christoph Hellwig Signed-off-by: Sagi Grimberg --- drivers/nvme/target/configfs.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index db2cb64be7ba..618bbd006544 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -34,6 +34,7 @@ static const struct nvmet_transport_name { } nvmet_transport_names[] = { { NVMF_TRTYPE_RDMA, "rdma" }, { NVMF_TRTYPE_FC, "fc" }, + { NVMF_TRTYPE_TCP, "tcp" }, { NVMF_TRTYPE_LOOP, "loop" }, }; -- 2.17.1
[PATCH v5 11/13] nvmet-tcp: add NVMe over TCP target driver
From: Sagi Grimberg This patch implements the TCP transport driver for the NVMe over Fabrics target stack. This allows exporting NVMe over Fabrics functionality over good old TCP/IP. The driver implements the TP 8000 of how nvme over fabrics capsules and data are encapsulated in nvme-tcp pdus and exchaged on top of a TCP byte stream. nvme-tcp header and data digest are supported as well. Signed-off-by: Sagi Grimberg Signed-off-by: Roy Shterman Signed-off-by: Solganik Alexander --- drivers/nvme/target/Kconfig | 10 + drivers/nvme/target/Makefile |2 + drivers/nvme/target/tcp.c| 1737 ++ 3 files changed, 1749 insertions(+) create mode 100644 drivers/nvme/target/tcp.c diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig index 3c7b61ddb0d1..d94f25cde019 100644 --- a/drivers/nvme/target/Kconfig +++ b/drivers/nvme/target/Kconfig @@ -60,3 +60,13 @@ config NVME_TARGET_FCLOOP to test NVMe-FC transport interfaces. If unsure, say N. + +config NVME_TARGET_TCP + tristate "NVMe over Fabrics TCP target support" + depends on INET + depends on NVME_TARGET + help + This enables the NVMe TCP target support, which allows exporting NVMe + devices over TCP. + + If unsure, say N. diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index 8118c93391c6..8c3ad0fb6860 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -5,6 +5,7 @@ obj-$(CONFIG_NVME_TARGET_LOOP) += nvme-loop.o obj-$(CONFIG_NVME_TARGET_RDMA) += nvmet-rdma.o obj-$(CONFIG_NVME_TARGET_FC) += nvmet-fc.o obj-$(CONFIG_NVME_TARGET_FCLOOP) += nvme-fcloop.o +obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y+= core.o configfs.o admin-cmd.o fabrics-cmd.o \ discovery.o io-cmd-file.o io-cmd-bdev.o @@ -12,3 +13,4 @@ nvme-loop-y += loop.o nvmet-rdma-y += rdma.o nvmet-fc-y += fc.o nvme-fcloop-y += fcloop.o +nvmet-tcp-y+= tcp.o diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c new file mode 100644 index ..d31bec260160 --- /dev/null +++ b/drivers/nvme/target/tcp.c @@ -0,0 +1,1737 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe over Fabrics TCP target. + * Copyright (c) 2018 Lightbits Labs. All rights reserved. + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "nvmet.h" + +#define NVMET_TCP_DEF_INLINE_DATA_SIZE (4 * PAGE_SIZE) + +#define NVMET_TCP_RECV_BUDGET 8 +#define NVMET_TCP_SEND_BUDGET 8 +#define NVMET_TCP_IO_WORK_BUDGET 64 + +enum nvmet_tcp_send_state { + NVMET_TCP_SEND_DATA_PDU, + NVMET_TCP_SEND_DATA, + NVMET_TCP_SEND_R2T, + NVMET_TCP_SEND_DDGST, + NVMET_TCP_SEND_RESPONSE +}; + +enum nvmet_tcp_recv_state { + NVMET_TCP_RECV_PDU, + NVMET_TCP_RECV_DATA, + NVMET_TCP_RECV_DDGST, + NVMET_TCP_RECV_ERR, +}; + +enum { + NVMET_TCP_F_INIT_FAILED = (1 << 0), +}; + +struct nvmet_tcp_cmd { + struct nvmet_tcp_queue *queue; + struct nvmet_reqreq; + + struct nvme_tcp_cmd_pdu *cmd_pdu; + struct nvme_tcp_rsp_pdu *rsp_pdu; + struct nvme_tcp_data_pdu*data_pdu; + struct nvme_tcp_r2t_pdu *r2t_pdu; + + u32 rbytes_done; + u32 wbytes_done; + + u32 pdu_len; + u32 pdu_recv; + int sg_idx; + int nr_mapped; + struct msghdr recv_msg; + struct kvec *iov; + u32 flags; + + struct list_headentry; + struct llist_node lentry; + + /* send state */ + u32 offset; + struct scatterlist *cur_sg; + enum nvmet_tcp_send_state state; + + __le32 exp_ddgst; + __le32 recv_ddgst; +}; + +enum nvmet_tcp_queue_state { + NVMET_TCP_Q_CONNECTING, + NVMET_TCP_Q_LIVE, + NVMET_TCP_Q_DISCONNECTING, +}; + +struct nvmet_tcp_queue { + struct socket *sock; + struct nvmet_tcp_port *port; + struct work_struct io_work; + int cpu; + struct nvmet_cq nvme_cq; + struct nvmet_sq nvme_sq; + + /* send state */ + struct nvmet_tcp_cmd*cmds; + unsigned intnr_cmds; + struct list_headfree_list; + struct llist_head resp_list; + struct list_headresp_send_list; + int send_list_len; +
[PATCH v5 02/13] datagram: open-code copy_page_to_iter
From: Sagi Grimberg This will be useful to consolidate skb_copy_and_hash_datagram_iter and skb_copy_and_csum_datagram to a single code path. Acked-by: David S. Miller Signed-off-by: Sagi Grimberg --- net/core/datagram.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/core/datagram.c b/net/core/datagram.c index 57f3a6fcfc1e..abe642181b64 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -445,11 +445,14 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset, end = start + skb_frag_size(frag); if ((copy = end - offset) > 0) { + struct page *page = skb_frag_page(frag); + u8 *vaddr = kmap(page); + if (copy > len) copy = len; - n = copy_page_to_iter(skb_frag_page(frag), - frag->page_offset + offset - - start, copy, to); + n = copy_to_iter(vaddr + frag->page_offset + +offset - start, copy, to); + kunmap(page); offset += n; if (n != copy) goto short_copy; -- 2.17.1
[PATCH v6 2/2] arm64: crypto: add NEON accelerated XOR implementation
This is a NEON acceleration method that can improve performance by approximately 20%. I got the following data from the centos 7.5 on Huawei's HISI1616 chip: [ 93.837726] xor: measuring software checksum speed [ 93.874039] 8regs : 7123.200 MB/sec [ 93.914038] 32regs : 7180.300 MB/sec [ 93.954043] arm64_neon: 9856.000 MB/sec [ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec) I believe this code can bring some optimization for all arm64 platform. thanks for Ard Biesheuvel's suggestions. Signed-off-by: Jackie Liu Reviewed-by: Ard Biesheuvel --- arch/arm64/include/asm/Kbuild | 1 - arch/arm64/include/asm/xor.h | 73 + arch/arm64/lib/Makefile | 6 ++ arch/arm64/lib/xor-neon.c | 184 ++ 4 files changed, 263 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/include/asm/xor.h create mode 100644 arch/arm64/lib/xor-neon.c diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild index 6cd5d77..1877f29 100644 --- a/arch/arm64/include/asm/Kbuild +++ b/arch/arm64/include/asm/Kbuild @@ -27,4 +27,3 @@ generic-y += trace_clock.h generic-y += unaligned.h generic-y += user.h generic-y += vga.h -generic-y += xor.h diff --git a/arch/arm64/include/asm/xor.h b/arch/arm64/include/asm/xor.h new file mode 100644 index 000..856386a --- /dev/null +++ b/arch/arm64/include/asm/xor.h @@ -0,0 +1,73 @@ +/* + * arch/arm64/include/asm/xor.h + * + * Authors: Jackie Liu + * Copyright (C) 2018,Tianjin KYLIN Information Technology Co., Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include + +#ifdef CONFIG_KERNEL_MODE_NEON + +extern struct xor_block_template const xor_block_inner_neon; + +static void +xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2) +{ + kernel_neon_begin(); + xor_block_inner_neon.do_2(bytes, p1, p2); + kernel_neon_end(); +} + +static void +xor_neon_3(unsigned long bytes, unsigned long *p1, unsigned long *p2, + unsigned long *p3) +{ + kernel_neon_begin(); + xor_block_inner_neon.do_3(bytes, p1, p2, p3); + kernel_neon_end(); +} + +static void +xor_neon_4(unsigned long bytes, unsigned long *p1, unsigned long *p2, + unsigned long *p3, unsigned long *p4) +{ + kernel_neon_begin(); + xor_block_inner_neon.do_4(bytes, p1, p2, p3, p4); + kernel_neon_end(); +} + +static void +xor_neon_5(unsigned long bytes, unsigned long *p1, unsigned long *p2, + unsigned long *p3, unsigned long *p4, unsigned long *p5) +{ + kernel_neon_begin(); + xor_block_inner_neon.do_5(bytes, p1, p2, p3, p4, p5); + kernel_neon_end(); +} + +static struct xor_block_template xor_block_arm64 = { + .name = "arm64_neon", + .do_2 = xor_neon_2, + .do_3 = xor_neon_3, + .do_4 = xor_neon_4, + .do_5 = xor_neon_5 +}; +#undef XOR_TRY_TEMPLATES +#define XOR_TRY_TEMPLATES \ + do {\ + xor_speed(&xor_block_8regs);\ + xor_speed(&xor_block_32regs);\ + if (cpu_has_neon()) { \ + xor_speed(&xor_block_arm64);\ + } \ + } while (0) + +#endif /* ! CONFIG_KERNEL_MODE_NEON */ diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile index 69ff988..5540a16 100644 --- a/arch/arm64/lib/Makefile +++ b/arch/arm64/lib/Makefile @@ -5,6 +5,12 @@ lib-y := clear_user.o delay.o copy_from_user.o \ memcmp.o strcmp.o strncmp.o strlen.o strnlen.o \ strchr.o strrchr.o tishift.o +ifeq ($(CONFIG_KERNEL_MODE_NEON), y) +obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o +CFLAGS_REMOVE_xor-neon.o += -mgeneral-regs-only +CFLAGS_xor-neon.o += -ffreestanding +endif + # Tell the compiler to treat all general purpose registers (with the # exception of the IP registers, which are already handled by the caller # in case of a PLT) as callee-saved, which allows for efficient runtime diff --git a/arch/arm64/lib/xor-neon.c b/arch/arm64/lib/xor-neon.c new file mode 100644 index 000..131c60c2 --- /dev/null +++ b/arch/arm64/lib/xor-neon.c @@ -0,0 +1,184 @@ +/* + * arch/arm64/lib/xor-neon.c + * + * Authors: Jackie Liu + * Copyright (C) 2018,Tianjin KYLIN Information Technology Co., Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include + +void xor_arm64_neon_2(unsigned long bytes, unsigned long *p1, + unsigned long *p2) +{ + uint64_t *dp1 = (uint64_t *)p1; + uint64_t *dp2 = (uint64_t *)p2; + + register uint64x2_t v0, v1, v2,
[PATCH v6 1/2] arm64/neon: add workaround for ambiguous C99 stdint.h types
In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround for ambiguous C99 stdint.h types"), this patch redefines the macros that are used in stdint.h so its definitions of uint64_t and int64_t are compatible with those of the kernel. This patch comes from: https://patchwork.kernel.org/patch/3540001/ Wrote by: Ard Biesheuvel We mark this file as a private file and don't have to override asm/types.h Signed-off-by: Jackie Liu Reviewed-by: Ard Biesheuvel --- arch/arm64/include/asm/neon-intrinsics.h | 39 1 file changed, 39 insertions(+) create mode 100644 arch/arm64/include/asm/neon-intrinsics.h diff --git a/arch/arm64/include/asm/neon-intrinsics.h b/arch/arm64/include/asm/neon-intrinsics.h new file mode 100644 index 000..2ba6c6b --- /dev/null +++ b/arch/arm64/include/asm/neon-intrinsics.h @@ -0,0 +1,39 @@ +/* + * Copyright (C) 2018 Linaro, Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#ifndef __ASM_NEON_INTRINSICS_H +#define __ASM_NEON_INTRINSICS_H + +#include + +/* + * In the kernel, u64/s64 are [un]signed long long, not [un]signed long. + * So by redefining these macros to the former, we can force gcc-stdint.h + * to define uint64_t / in64_t in a compatible manner. + */ + +#ifdef __INT64_TYPE__ +#undef __INT64_TYPE__ +#define __INT64_TYPE__ long long +#endif + +#ifdef __UINT64_TYPE__ +#undef __UINT64_TYPE__ +#define __UINT64_TYPE__unsigned long long +#endif + +/* + * genksyms chokes on the ARM NEON instrinsics system header, but we + * don't export anything it defines anyway, so just disregard when + * genksyms execute. + */ +#ifndef __GENKSYMS__ +#include +#endif + +#endif /* __ASM_NEON_INTRINSICS_H */ -- 2.7.4
[PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension
v6: neon-intrinsics.h: change _NEON_INTRINSICS_H to __ASM_NEON_INTRINSICS_H. neon-intrinsics.h: add header LICENSE declaration neon-intrinsics.h: explain part of the code with comments Jackie Liu (2): arm64/neon: add workaround for ambiguous C99 stdint.h types arm64: crypto: add NEON accelerated XOR implementation arch/arm64/include/asm/Kbuild| 1 - arch/arm64/include/asm/neon-intrinsics.h | 39 +++ arch/arm64/include/asm/xor.h | 73 arch/arm64/lib/Makefile | 6 + arch/arm64/lib/xor-neon.c| 184 +++ 5 files changed, 302 insertions(+), 1 deletion(-) create mode 100644 arch/arm64/include/asm/neon-intrinsics.h create mode 100644 arch/arm64/include/asm/xor.h create mode 100644 arch/arm64/lib/xor-neon.c -- 2.7.4
Re: [PATCH 7/7] blk-mq: use plug for devices that implement ->commits_rqs()
Reviewed-by: Sagi Grimberg
Re: [PATCH 6/7] blk-mq: use bd->last == true for list inserts
Reviewed-by: Sagi Grimberg
Re: [PATCH 4/7] virtio_blk: implement mq_ops->commit_rqs() hook
Reviewed-by: Sagi Grimberg
Re: [PATCH 5/7] ataflop: implement mq_ops->commit_rqs() hook
Reviewed-by: Sagi Grimberg
Re: [PATCH 2/7] blk-mq: add mq_ops->commit_rqs()
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/7] block: improve logic around when to sort a plug list
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
A driver may wish to iterate every tagged request, not just ones that satisfy blk_mq_request_started(). The intended use is so a driver may terminate entered requests on quiesced queues. How about we just move the started check into the handler passed in for those that care about it? Much saner to make the interface iterate everything, and leave whatever state check to the callback. So we used to do that, and I changed it back in May to test for MQ_RQ_IN_FLIGHT, and then Ming changed it to check blk_mq_request_started. So this is clearly a minefield of sorts.. Note that at least mtip32xx, nbd, skd and the various nvme transports want to use the function to terminate all requests in the error path, and it would be great to have one single understood, documented and debugged helper for that in the core, so this is a vote for moving more of the logic in your second helper into the core code. skd will need actually use ->complete to release resources for that, though and mtip plays some odd abort bits. If it weren't for the interesting abort behavior in nvme-fc that means we could even unexport the low-level interface. Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. What about requests that come in after the iteration runs? how are those terminated?
Re: [PATCH 13/13] block: enable polling by default if a poll map is initalized
Reviewed-by: Sagi Grimberg
Re: [PATCH 12/13] block: only allow polling if a poll queue_map exists
Reviewed-by: Sagi Grimberg
Re: [PATCH 11/13] block: remove ->poll_fn
Reviewed-by: Sagi Grimberg
Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration
Hi Boris, On 12/04/2018 12:07 AM, Boris Ostrovsky wrote: > On 12/2/18 3:31 PM, Manjunath Patil wrote: >> On 11/30/2018 2:33 PM, Boris Ostrovsky wrote: >> >>> On 11/30/18 4:49 PM, Manjunath Patil wrote: Thank you Boris for your comments. I removed faulty email of mine. replies inline. On 11/30/2018 12:42 PM, Boris Ostrovsky wrote: > On 11/29/18 12:17 AM, Manjunath Patil wrote: >> Hi, >> Feel free to suggest/comment on this. >> >> I am trying to do the following at dst during the migration now. >> 1. Dont clear the old rinfo in blkif_free(). Instead just clean it. >> 2. Store the old rinfo and nr_rings into temp variables in >> negotiate_mq() >> 3. let nr_rings get re-calculated based on backend data >> 4. try allocating new memory based on new nr_rings > Since I suspect number of rings will likely be the same why not reuse > the rings in the common case? I thought attaching devices will be more often than migration. Hence did not want add to an extra check for - if I am inside migration code path and - if new nr_rings is equal to old nr_rings or not Sure addition of such a thing would avoid the memory allocation altogether in migration path, but it would add a little overhead for normal device addition. Do you think its worth adding that change? >>> >>> IMO a couple of extra checks are not going to make much difference. >> I will add this change >>> >>> I wonder though --- have you actually seen the case where you did fail >>> allocation and changes provided in this patch made things work? I am >>> asking because right after negotiate_mq() we will call setup_blkring() >>> and it will want to allocate bunch of memory. A failure there is fatal >>> (to ring setup). So it seems to me that you will survive negotiate_mq() >>> but then will likely fail soon after. >> I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I >> included my patch, I manually triggered the ENOMEM using a debug flag. >> The patch works for ENOMEM inside negotiate_mq(). >> >> As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we >> might hit it in setup_blkring() as well. >> We should add the similar change to blkif_sring struct as well. > > > Won't you have a similar issue with other frontends, say, netfront? I think the kmalloc is failed not because of OOM. In fact, the size of "blkfront_ring_info" is large. When domU have 4 queues/rings, the size of 4 blkfront_ring_info can be about 300+ KB. There is chance that kmalloc() 300+ KB would fail. About netfront, to kmalloc() 8 'struct netfront_queue' seems consumes <70 KB? Dongli Zhang
Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
If it really becomes an issue we should rework the nvme code to also skip the multipath code for any private namespace, even if that could mean some trouble when rescanning. This requires some explanation? skip the multipath code how? Other than that, Reviewed-by: Sagi Grimberg
Re: [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues
Now that we can't poll regular, interrupt driven I/O queues there is almost nothing that can race with an interrupt. The only possible other contexts polling a CQ are the error handler and queue shutdown, and both are so far off in the slow path that we can simply use the big hammer of disabling interrupts. With that we can stop taking the cq_lock for normal queues. Nice, Reviewed-by: Sagi Grimberg
Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
Reviewed-by: Sagi Grimberg
Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
@@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) nvme_stop_queues(&dev->ctrl); if (!dead && dev->ctrl.queue_count > 0) { - nvme_disable_io_queues(dev); + if (nvme_disable_io_queues(dev, nvme_admin_delete_sq)) + Would be nice if the opcode change would be kept inside but still split like: static void nvme_disable_io_queues(struct nvme_dev *dev) { if (__nvme_disable_io_queues(dev, nvme_admin_delete_sq)) __nvme_disable_io_queues(dev, nvme_admin_delete_cq); }
Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag) Do we still need to carry the tag around? Other than that, Reviewed-by: Sagi Grimberg
Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
Reviewed-by: Sagi Grimberg
Re: [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) if (nr_io_queues == 0) return 0; + + clear_bit(NVMEQ_ENABLED, &adminq->flags); This is a change of behavior, looks correct though as we can fail nvme_setup_irqs after we freed the admin vector. Needs documentation though..
Re: block: sbitmap related lockdep warning
On 12/3/18 5:31 PM, Bart Van Assche wrote: > On Mon, 2018-12-03 at 15:24 -0700, Jens Axboe wrote: >> On 12/3/18 3:02 AM, Ming Lei wrote: >>> Hi, >>> >>> Just found there is sbmitmap related lockdep warning, not take a close >>> look yet, maybe >>> it is caused by recent sbitmap change. >>> >>> [1] test >>> - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 >>> submit_queues=1 hw_queue_depth=1 >>> - then run fio on the 4 null_blk devices >> >> This is a false positive - lockdep thinks that ->swap_lock needs to be >> IRQ safe since it's called with IRQs disabled from the >> blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ >> context. I wonder how to teach lockdep about that... > > There is probably a better solution, but one possible solution is to disable > lockdep checking for swap_lock by using lockdep_set_novalidate_class(). That does seem like a sledge hammer, but I don't see anything that does what we need directly. Surely this isn't a unique situation? Maybe marking it novalidate is just the way to do it... -- Jens Axboe
Re: [PATCH 01/13] block: move queues types to the block layer
On 12/2/18 8:46 AM, Christoph Hellwig wrote: Having another indirect all in the fast path doesn't really help in our post-spectre world. Also having too many queue type is just going to create confusion, so I'd rather manage them centrally. Note that the queue type naming and ordering changes a bit - the first index now is the default queue for everything not explicitly marked, the optional ones are read and poll queues. Signed-off-by: Christoph Hellwig --- block/blk-mq-sysfs.c| 9 +- block/blk-mq.h | 21 +++-- drivers/nvme/host/pci.c | 68 +++-- include/linux/blk-mq.h | 15 - 4 files changed, 51 insertions(+), 62 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 6efef1f679f0..9c2df137256a 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -173,9 +173,16 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page) return ret; } +static const char *const hctx_types[] = { + [HCTX_TYPE_DEFAULT] = "default", + [HCTX_TYPE_READ]= "read", + [HCTX_TYPE_POLL]= "poll", +}; + static ssize_t blk_mq_hw_sysfs_type_show(struct blk_mq_hw_ctx *hctx, char *page) { - return sprintf(page, "%u\n", hctx->type); + BUILD_BUG_ON(ARRAY_SIZE(hctx_types) != HCTX_MAX_TYPES); + return sprintf(page, "%s\n", hctx_types[hctx->type]); } static struct attribute *default_ctx_attrs[] = { diff --git a/block/blk-mq.h b/block/blk-mq.h index 7291e5379358..a664ea44ffd4 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -81,16 +81,14 @@ extern int blk_mq_hw_queue_to_node(struct blk_mq_queue_map *qmap, unsigned int); /* * blk_mq_map_queue_type() - map (hctx_type,cpu) to hardware queue * @q: request queue - * @hctx_type: the hctx type index + * @type: the hctx type index * @cpu: CPU */ static inline struct blk_mq_hw_ctx *blk_mq_map_queue_type(struct request_queue *q, - unsigned int hctx_type, + enum hctx_type type, unsigned int cpu) { - struct blk_mq_tag_set *set = q->tag_set; - - return q->queue_hw_ctx[set->map[hctx_type].mq_map[cpu]]; + return q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]]; } /* @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, unsigned int flags, unsigned int cpu) { - int hctx_type = 0; + enum hctx_type type = HCTX_TYPE_DEFAULT; + + if (q->tag_set->nr_maps > HCTX_TYPE_POLL && + ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags))) + type = HCTX_TYPE_POLL; - if (q->mq_ops->rq_flags_to_type) - hctx_type = q->mq_ops->rq_flags_to_type(q, flags); + else if (q->tag_set->nr_maps > HCTX_TYPE_READ && +((flags & REQ_OP_MASK) == REQ_OP_READ)) + type = HCTX_TYPE_READ; Nit, there seems to be an extra newline that can be omitted here before the else if statement (if I'm reading this correctly)... Otherwise looks good, Reviewed-by: Sagi Grimberg
Re: block: sbitmap related lockdep warning
On Mon, 2018-12-03 at 15:24 -0700, Jens Axboe wrote: > On 12/3/18 3:02 AM, Ming Lei wrote: > > Hi, > > > > Just found there is sbmitmap related lockdep warning, not take a close > > look yet, maybe > > it is caused by recent sbitmap change. > > > > [1] test > > - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 > > submit_queues=1 hw_queue_depth=1 > > - then run fio on the 4 null_blk devices > > This is a false positive - lockdep thinks that ->swap_lock needs to be > IRQ safe since it's called with IRQs disabled from the > blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ > context. I wonder how to teach lockdep about that... There is probably a better solution, but one possible solution is to disable lockdep checking for swap_lock by using lockdep_set_novalidate_class(). Bart.
Re: [PATCH] sbitmap: fix sbitmap_for_each_set()
On 12/3/18 3:45 PM, Omar Sandoval wrote: > From: Omar Sandoval > > We need to ignore bits in the cleared mask when iterating over all set > bits. Thanks Omar, applied. -- Jens Axboe
Re: [PATCH 05/13] blkcg: associate blkg when associating a device
On Fri, Nov 30, 2018 at 01:54:26AM -0800, Christoph Hellwig wrote: > > diff --git a/include/linux/bio.h b/include/linux/bio.h > > index 62715a5a4f32..8bc9d9b29fd3 100644 > > --- a/include/linux/bio.h > > +++ b/include/linux/bio.h > > @@ -486,6 +486,12 @@ extern unsigned int bvec_nr_vecs(unsigned short idx); > > extern const char *bio_devname(struct bio *bio, char *buffer); > > > > #define bio_set_dev(bio, bdev) \ > > +do { \ > > + bio_set_dev_only(bio, bdev);\ > > + bio_associate_blkg(bio);\ > > +} while (0) > > + > > +#define bio_set_dev_only(bio, bdev)\ > > This lacks any explanation on when you would use bio_set_dev_only or > bio_set_dev. Please document why we need both and why you'd choose or > the other. I realized after thinking about this more and checking more use cases that it isn't as simple as swapping macro uses because many of the callers share common bio allocation paths. I think the simplest way forward is to have writeback and swap do reassociation and split out bio init code in a future series. So in v5, there is only bio_set_dev(). Thanks, Dennis
[PATCH] sbitmap: fix sbitmap_for_each_set()
From: Omar Sandoval We need to ignore bits in the cleared mask when iterating over all set bits. Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") Signed-off-by: Omar Sandoval --- include/linux/sbitmap.h | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index 92806a2dbab7..03f50fcedc79 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb, nr = SB_NR_TO_BIT(sb, start); while (scanned < sb->depth) { - struct sbitmap_word *word = &sb->map[index]; - unsigned int depth = min_t(unsigned int, word->depth - nr, + unsigned long word; + unsigned int depth = min_t(unsigned int, + sb->map[index].depth - nr, sb->depth - scanned); scanned += depth; - if (!word->word) + word = sb->map[index].word & ~sb->map[index].cleared; + if (!word) goto next; /* @@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb, */ depth += nr; while (1) { - nr = find_next_bit(&word->word, depth, nr); + nr = find_next_bit(&word, depth, nr); if (nr >= depth) break; if (!fn(sb, (index << sb->shift) + nr, data)) -- 2.19.2
Re: block: sbitmap related lockdep warning
On 12/3/18 3:02 AM, Ming Lei wrote: > Hi, > > Just found there is sbmitmap related lockdep warning, not take a close > look yet, maybe > it is caused by recent sbitmap change. > > [1] test > - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 > submit_queues=1 hw_queue_depth=1 > - then run fio on the 4 null_blk devices This is a false positive - lockdep thinks that ->swap_lock needs to be IRQ safe since it's called with IRQs disabled from the blk_mq_mark_tag_wait() path. But we never grab the lock from IRQ context. I wonder how to teach lockdep about that... -- Jens Axboe
Re: sbitmap: check cleared bits when iterating busy bits
On 12/3/18 3:05 PM, Omar Sandoval wrote: > On Mon, Dec 03, 2018 at 02:56:17PM -0700, Jens Axboe wrote: >> When we are iterating the set bits in a word, we also need to factor in >> the cleared bits. Don't call fn() unless the bit is also not set in >> the cleared word. >> >> Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") >> Signed-off-by: Jens Axboe >> >> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h >> index 92806a2dbab7..9f374fbcdba6 100644 >> --- a/include/linux/sbitmap.h >> +++ b/include/linux/sbitmap.h >> @@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct >> sbitmap *sb, >> nr = find_next_bit(&word->word, depth, nr); >> if (nr >= depth) >> break; >> +/* if set in cleared, it's actually free */ >> +if (test_bit(nr, &word->cleared)) { >> +nr++; >> +continue; >> +} >> if (!fn(sb, (index << sb->shift) + nr, data)) >> return; >> >> -- >> Jens Axboe >> > > How about something like this: > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index f0f49bbb2617..fe9122386255 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct > sbitmap *sb, > nr = SB_NR_TO_BIT(sb, start); > > while (scanned < sb->depth) { > - struct sbitmap_word *word = &sb->map[index]; > - unsigned int depth = min_t(unsigned int, word->depth - nr, > + unsigned long word; > + unsigned int depth = min_t(unsigned int, > +sb->map[index].depth - nr, > sb->depth - scanned); > > scanned += depth; > - if (!word->word) > + word = sb->map[index].word & ~sb->map[index].cleared; > + if (!word) > goto next; > > /* > @@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap > *sb, >*/ > depth += nr; > while (1) { > - nr = find_next_bit(&word->word, depth, nr); > + nr = find_next_bit(&word, depth, nr); > if (nr >= depth) > break; > if (!fn(sb, (index << sb->shift) + nr, data)) > > Might be marginally faster. Yeah that looks fine as well, tests out good too. -- Jens Axboe
Re: sbitmap: check cleared bits when iterating busy bits
On Mon, Dec 03, 2018 at 02:56:17PM -0700, Jens Axboe wrote: > When we are iterating the set bits in a word, we also need to factor in > the cleared bits. Don't call fn() unless the bit is also not set in > the cleared word. > > Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") > Signed-off-by: Jens Axboe > > diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h > index 92806a2dbab7..9f374fbcdba6 100644 > --- a/include/linux/sbitmap.h > +++ b/include/linux/sbitmap.h > @@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap > *sb, > nr = find_next_bit(&word->word, depth, nr); > if (nr >= depth) > break; > + /* if set in cleared, it's actually free */ > + if (test_bit(nr, &word->cleared)) { > + nr++; > + continue; > + } > if (!fn(sb, (index << sb->shift) + nr, data)) > return; > > -- > Jens Axboe > How about something like this: diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index f0f49bbb2617..fe9122386255 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb, nr = SB_NR_TO_BIT(sb, start); while (scanned < sb->depth) { - struct sbitmap_word *word = &sb->map[index]; - unsigned int depth = min_t(unsigned int, word->depth - nr, + unsigned long word; + unsigned int depth = min_t(unsigned int, + sb->map[index].depth - nr, sb->depth - scanned); scanned += depth; - if (!word->word) + word = sb->map[index].word & ~sb->map[index].cleared; + if (!word) goto next; /* @@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb, */ depth += nr; while (1) { - nr = find_next_bit(&word->word, depth, nr); + nr = find_next_bit(&word, depth, nr); if (nr >= depth) break; if (!fn(sb, (index << sb->shift) + nr, data)) Might be marginally faster.
sbitmap: check cleared bits when iterating busy bits
When we are iterating the set bits in a word, we also need to factor in the cleared bits. Don't call fn() unless the bit is also not set in the cleared word. Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") Signed-off-by: Jens Axboe diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index 92806a2dbab7..9f374fbcdba6 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap *sb, nr = find_next_bit(&word->word, depth, nr); if (nr >= depth) break; + /* if set in cleared, it's actually free */ + if (test_bit(nr, &word->cleared)) { + nr++; + continue; + } if (!fn(sb, (index << sb->shift) + nr, data)) return; -- Jens Axboe
Re: [PATCH 04/13] blkcg: introduce common blkg association logic
Hi Christoph, On Fri, Nov 30, 2018 at 01:52:09AM -0800, Christoph Hellwig wrote: > > EXPORT_SYMBOL_GPL(bio_associate_blkcg); > > > > /** > > - * bio_associate_blkg - associate a bio with the a blkg > > + * bio_has_queue - required check for blkg association > > + * @bio: target bio > > + * > > + * A blkg represents the relationship between a blkcg and a request_queue. > > + * If there is no request_queue, there is no blkg and therefore nothing to > > + * associate with. > > + */ > > +static inline bool bio_has_queue(struct bio *bio) > > +{ > > + return bio->bi_disk && bio->bi_disk->queue; > > +} > > How do you ever see a bio without a queue? We can't even do I/O in > that case. The case I found was with the flush bio in dm which is statically allocated in dm_alloc(). The issue issue is that bio_set_dev() is called on a bdev that isn't opened. So, the bdev wasn't pointing to a genhd. I've fixed the issue with the patch below, which will be added in v5. I think I was being overly cautious with the change and have taken this out in v5. It seems that this should be a one-off case which should work with the patch below. Thanks, Dennis --- >From 3ee13402af369ee8618549b63593d68ffca574ca Mon Sep 17 00:00:00 2001 From: Dennis Zhou Date: Mon, 3 Dec 2018 10:56:34 -0800 Subject: [PATCH 05/14] dm: set flush bio device on demand The next patch changes the macro bio_set_dev() to associate a bio with a blkg based on the device set. However, dm creates a static bio to be used as the basis for cloning empty flush bios on creation. This association is with a not-opened bdev so bd_disk is %NULL. To easily get around this, we will set the device on the static bio every time and use that to copy to the other bios. Signed-off-by: Dennis Zhou --- drivers/md/dm.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index a733e4c920af..b5e996c5c709 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1417,10 +1417,14 @@ static int __send_empty_flush(struct clone_info *ci) unsigned target_nr = 0; struct dm_target *ti; + bio_set_dev(ci->bio, ci->io->md->bdev); + BUG_ON(bio_has_data(ci->bio)); while ((ti = dm_table_get_target(ci->map, target_nr++))) __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL); + bio_disassociate_blkg(ci->bio); + return 0; } @@ -1939,7 +1943,6 @@ static struct mapped_device *alloc_dev(int minor) goto bad; bio_init(&md->flush_bio, NULL, 0); - bio_set_dev(&md->flush_bio, md->bdev); md->flush_bio.bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC; dm_stats_init(&md->stats); -- 2.17.1
Re: [PATCH v2] blk-mq: don't call ktime_get_ns() if we don't need it
On Fri, Nov 30, 2018 at 02:13:54PM -0700, Jens Axboe wrote: > We only need the request fields and the end_io time if we have > stats enabled, or if we have a scheduler attached as those may > use it for completion time stats. Reviewed-by: Omar Sandoval > Signed-off-by: Jens Axboe > > --- > > v2: add helper, use it in both spots. also clear ->start_time_ns > so merging doesn't read garbage. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 7dcef565dc0f..e09d7f500077 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -281,6 +281,15 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx) > } > EXPORT_SYMBOL(blk_mq_can_queue); > > +/* > + * Only need start/end time stamping if we have stats enabled, or using > + * an IO scheduler. > + */ > +static inline bool blk_mq_need_time_stamp(struct request *rq) > +{ > + return (rq->rq_flags & RQF_IO_STAT) || rq->q->elevator; > +} > + > static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, > unsigned int tag, unsigned int op) > { > @@ -316,7 +325,10 @@ static struct request *blk_mq_rq_ctx_init(struct > blk_mq_alloc_data *data, > RB_CLEAR_NODE(&rq->rb_node); > rq->rq_disk = NULL; > rq->part = NULL; > - rq->start_time_ns = ktime_get_ns(); > + if (blk_mq_need_time_stamp(rq)) > + rq->start_time_ns = ktime_get_ns(); > + else > + rq->start_time_ns = 0; > rq->io_start_time_ns = 0; > rq->nr_phys_segments = 0; > #if defined(CONFIG_BLK_DEV_INTEGRITY) > @@ -522,7 +534,10 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request); > > inline void __blk_mq_end_request(struct request *rq, blk_status_t error) > { > - u64 now = ktime_get_ns(); > + u64 now = 0; > + > + if (blk_mq_need_time_stamp(rq)) > + now = ktime_get_ns(); > > if (rq->rq_flags & RQF_STATS) { > blk_mq_poll_stats_start(rq->q); > -- > Jens Axboe >
Re: [PATCH v5 1/2] arm64/neon: add workaround for ambiguous C99 stdint.h types
On Mon, 3 Dec 2018 at 20:22, Will Deacon wrote: > > On Wed, Nov 28, 2018 at 09:09:00AM +0800, Jackie Liu wrote: > > In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround > > for ambiguous C99 stdint.h types"), this patch redefines the macros that > > are used in stdint.h so its definitions of uint64_t and int64_t are > > compatible with those of the kernel. > > > > This patch comes from: https://patchwork.kernel.org/patch/3540001/ > > Wrote by: Ard Biesheuvel > > > > We mark this file as a private file and don't have to override asm/types.h > > > > Reviewed-by: Ard Biesheuvel > > Signed-off-by: Jackie Liu > > --- > > arch/arm64/include/asm/neon-intrinsics.h | 34 > > > > 1 file changed, 34 insertions(+) > > create mode 100644 arch/arm64/include/asm/neon-intrinsics.h > > > > diff --git a/arch/arm64/include/asm/neon-intrinsics.h > > b/arch/arm64/include/asm/neon-intrinsics.h > > new file mode 100644 > > index 000..e378766 > > --- /dev/null > > +++ b/arch/arm64/include/asm/neon-intrinsics.h > > @@ -0,0 +1,34 @@ > > +#ifndef _NEON_INTRINSICS_H > > +#define _NEON_INTRINSICS_H > > We tend to name these with an __ASM_ prefix, so it should be: > > #ifndef __ASM_NEON_INTRINSICS_H > > That said, I notice that the commit you refer to for arch/arm/ actually > places this stuff under uapi/. Is that needed? > No, it doesn't. It creates asm/types.h which has been moved into uap/ at a later date (which I guess means we're stuck with it). In hindsight, it would have been better for ARM to create a neon instrinsics header file such as this one, since the override is only needed when you include . > > +#include > > + > > +/* > > + * For Aarch64, there is some ambiguity in the definition of the types > > below > > + * between the kernel and GCC itself. This is usually not a big deal, but > > it > > + * causes trouble when including GCC's version of 'stdint.h' (this is the > > file > > + * that gets included when you #include on a -ffreestanding > > build). > > + * As this file also gets included implicitly when including 'arm_neon.h' > > (the > > + * NEON intrinsics support header), we need the following to work around > > the > > + * issue if we want to use NEON intrinsics in the kernel. > > + */ > > Could you elaborate on what the ambiguities / conflicts in the types are > please? I think you can also remove the sentence about directly including > stdint on a freestanding build, since it doesn't seem relevant to the > kernel afaict (we only pull it in via arm_neon.h). > In the kernel, u64/s64 are [un]signed long long, not [un]signed long. So by redefining these macros to the former, we can force gcc-stdint.h to define uint64_t / in64_t in a compatible manner. > > + > > +#ifdef __INT64_TYPE__ > > +#undef __INT64_TYPE__ > > +#define __INT64_TYPE__ __signed__ long long > > Do we need this __signed__ part? > No that seems redundant to me.
Re: [PATCH v5 1/2] arm64/neon: add workaround for ambiguous C99 stdint.h types
On Wed, Nov 28, 2018 at 09:09:00AM +0800, Jackie Liu wrote: > In a way similar to ARM commit 09096f6a0ee2 ("ARM: 7822/1: add workaround > for ambiguous C99 stdint.h types"), this patch redefines the macros that > are used in stdint.h so its definitions of uint64_t and int64_t are > compatible with those of the kernel. > > This patch comes from: https://patchwork.kernel.org/patch/3540001/ > Wrote by: Ard Biesheuvel > > We mark this file as a private file and don't have to override asm/types.h > > Reviewed-by: Ard Biesheuvel > Signed-off-by: Jackie Liu > --- > arch/arm64/include/asm/neon-intrinsics.h | 34 > > 1 file changed, 34 insertions(+) > create mode 100644 arch/arm64/include/asm/neon-intrinsics.h > > diff --git a/arch/arm64/include/asm/neon-intrinsics.h > b/arch/arm64/include/asm/neon-intrinsics.h > new file mode 100644 > index 000..e378766 > --- /dev/null > +++ b/arch/arm64/include/asm/neon-intrinsics.h > @@ -0,0 +1,34 @@ > +#ifndef _NEON_INTRINSICS_H > +#define _NEON_INTRINSICS_H We tend to name these with an __ASM_ prefix, so it should be: #ifndef __ASM_NEON_INTRINSICS_H That said, I notice that the commit you refer to for arch/arm/ actually places this stuff under uapi/. Is that needed? > +#include > + > +/* > + * For Aarch64, there is some ambiguity in the definition of the types below > + * between the kernel and GCC itself. This is usually not a big deal, but it > + * causes trouble when including GCC's version of 'stdint.h' (this is the > file > + * that gets included when you #include on a -ffreestanding > build). > + * As this file also gets included implicitly when including 'arm_neon.h' > (the > + * NEON intrinsics support header), we need the following to work around the > + * issue if we want to use NEON intrinsics in the kernel. > + */ Could you elaborate on what the ambiguities / conflicts in the types are please? I think you can also remove the sentence about directly including stdint on a freestanding build, since it doesn't seem relevant to the kernel afaict (we only pull it in via arm_neon.h). > + > +#ifdef __INT64_TYPE__ > +#undef __INT64_TYPE__ > +#define __INT64_TYPE__ __signed__ long long Do we need this __signed__ part? Will
Re: [PATCH V2] mm: Replace all open encodings for NUMA_NO_NODE
On Mon, 2018-11-26 at 17:56 +0530, Anshuman Khandual wrote: > At present there are multiple places where invalid node number is encoded > as -1. Even though implicitly understood it is always better to have macros > in there. Replace these open encodings for an invalid node number with the > global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like > 'invalid node' from various places redirecting them to a common definition. > > Signed-off-by: Anshuman Khandual > --- > Changes in V2: > > - Added inclusion of 'numa.h' header at various places per Andrew > - Updated 'dev_to_node' to use NUMA_NO_NODE instead per Vinod > > Changes in V1: (https://lkml.org/lkml/2018/11/23/485) > > - Dropped OCFS2 changes per Joseph > - Dropped media/video drivers changes per Hans > > RFC - https://patchwork.kernel.org/patch/10678035/ > > Build tested this with multiple cross compiler options like alpha, sparc, > arm64, x86, powerpc, powerpc64le etc with their default config which might > not have compiled tested all driver related changes. I will appreciate > folks giving this a test in their respective build environment. > > All these places for replacement were found by running the following grep > patterns on the entire kernel code. Please let me know if this might have > missed some instances. This might also have replaced some false positives. > I will appreciate suggestions, inputs and review. > > 1. git grep "nid == -1" > 2. git grep "node == -1" > 3. git grep "nid = -1" > 4. git grep "node = -1" > > drivers/infiniband/hw/hfi1/affinity.c | 3 ++- > drivers/infiniband/hw/hfi1/init.c | 3 ++- For the drivers/infiniband changes, Acked-by: Doug Ledford -- Doug Ledford GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On 12/1/2018 10:32 AM, Bart Van Assche wrote: On 12/1/18 9:11 AM, Hannes Reinecke wrote: Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. So having a common helper for terminating requests for queue errors would be very welcomed here. But when we have that we really should audit all drivers to ensure they do the right thin (tm). Would calling blk_abort_request() for all outstanding requests be sufficient to avoid that the queue has to be stopped and restarted in the nvme-fc driver? what nvme-fc does is the same as what is done in all the other transports - for the same reasons. If we're eliminating those synchronization reasons, and now that we've plugged the request_queue path into the transports to check state appropriately, I don' t think there are reasons to block the queue. In some respects, it is nice to stop new io while the work to terminate everything else happens, but I don't know that it's required. I would hope that the bounced work due to the controller state (returned BLK_STAT_RESOURCE) is actually pausing for a short while. I've seen some circumstances where it didn't and was infinitely polling. Which would be a change in behavior vs the queue stops. -- james
Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues
On Sun, Dec 02, 2018 at 08:46:19AM -0800, Christoph Hellwig wrote: > This will allow us to simplify both the regular NVMe interrupt handler > and the upcoming aio poll code. In addition to that the separate > queues are generally a good idea for performance reasons. > > Signed-off-by: Christoph Hellwig Looks good. Reviewed-by: Keith Busch
Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
On Sun, Dec 02, 2018 at 08:46:25AM -0800, Christoph Hellwig wrote: > The ->poll_fn has been stale for a while, as a lot of places check for mq > ops. But there is no real point in it anyway, as we don't even use > the multipath code for subsystems without multiple ports, which is usually > what we do high performance I/O to. If it really becomes an issue we > should rework the nvme code to also skip the multipath code for any > private namespace, even if that could mean some trouble when rescanning. > > Signed-off-by: Christoph Hellwig This was a bit flawed anyway since the head's current path could change, and you end up polling the wrong request_queue. Not really harmful other than some wasted CPU cycles, but might be worth thinking about if we want to bring mpath polling back. Reviewed-by: Keith Busch
Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues
On Sun, Dec 02, 2018 at 08:46:22AM -0800, Christoph Hellwig wrote: > This is the last place outside of nvme_irq that handles CQEs from > interrupt context, and thus is in the way of removing the cq_lock for > normal queues, and avoiding lockdep warnings on the poll queues, for > which we already take it without IRQ disabling. > > Signed-off-by: Christoph Hellwig Looks good. Reviewed-by: Keith Busch
Re: [Xen-devel] [PATCH] xen-blkfront: use old rinfo after enomem during migration
On 12/2/18 3:31 PM, Manjunath Patil wrote: > On 11/30/2018 2:33 PM, Boris Ostrovsky wrote: > >> On 11/30/18 4:49 PM, Manjunath Patil wrote: >>> Thank you Boris for your comments. I removed faulty email of mine. >>> >>> replies inline. >>> On 11/30/2018 12:42 PM, Boris Ostrovsky wrote: On 11/29/18 12:17 AM, Manjunath Patil wrote: > Hi, > Feel free to suggest/comment on this. > > I am trying to do the following at dst during the migration now. > 1. Dont clear the old rinfo in blkif_free(). Instead just clean it. > 2. Store the old rinfo and nr_rings into temp variables in > negotiate_mq() > 3. let nr_rings get re-calculated based on backend data > 4. try allocating new memory based on new nr_rings Since I suspect number of rings will likely be the same why not reuse the rings in the common case? >>> I thought attaching devices will be more often than migration. Hence >>> did not want add to an extra check for >>> - if I am inside migration code path and >>> - if new nr_rings is equal to old nr_rings or not >>> >>> Sure addition of such a thing would avoid the memory allocation >>> altogether in migration path, >>> but it would add a little overhead for normal device addition. >>> >>> Do you think its worth adding that change? >> >> IMO a couple of extra checks are not going to make much difference. > I will add this change >> >> I wonder though --- have you actually seen the case where you did fail >> allocation and changes provided in this patch made things work? I am >> asking because right after negotiate_mq() we will call setup_blkring() >> and it will want to allocate bunch of memory. A failure there is fatal >> (to ring setup). So it seems to me that you will survive negotiate_mq() >> but then will likely fail soon after. > I have noticed the ENOMEM insise negotiate_mq() on ct machine. When I > included my patch, I manually triggered the ENOMEM using a debug flag. > The patch works for ENOMEM inside negotiate_mq(). > > As you mentioned, if we really hit the ENOMEM in negotiate_mq(), we > might hit it in setup_blkring() as well. > We should add the similar change to blkif_sring struct as well. Won't you have a similar issue with other frontends, say, netfront? -boris
block: sbitmap related lockdep warning
Hi, Just found there is sbmitmap related lockdep warning, not take a close look yet, maybe it is caused by recent sbitmap change. [1] test - modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1 - then run fio on the 4 null_blk devices [2] lockdep warning [ 100.967642] start test sanity/001 [ 101.238280] null: module loaded [ 106.093735] [ 106.094012] = [ 106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected [ 106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted [ 106.096551] - [ 106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 106.098231] 4c43fa71 (&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c [ 106.099431] [ 106.099431] and this task is already holding: [ 106.100229] 7eec8b2f (&(&hctx->dispatch_wait_lock)->rlock){}, at: blk_mq_dispatch_rq_list+0x4c1/0xd7c [ 106.101630] which would create a new lock dependency: [ 106.102326] (&(&hctx->dispatch_wait_lock)->rlock){} -> (&(&sb->map[i].swap_lock)->rlock){+.+.} [ 106.103553] [ 106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock: [ 106.104580] (&sbq->ws[i].wait){..-.} [ 106.104582] [ 106.104582] ... which became SOFTIRQ-irq-safe at: [ 106.105751] _raw_spin_lock_irqsave+0x4b/0x82 [ 106.106284] __wake_up_common_lock+0x119/0x1b9 [ 106.106825] sbitmap_queue_wake_up+0x33f/0x383 [ 106.107456] sbitmap_queue_clear+0x4c/0x9a [ 106.108046] __blk_mq_free_request+0x188/0x1d3 [ 106.108581] blk_mq_free_request+0x23b/0x26b [ 106.109102] scsi_end_request+0x345/0x5d7 [ 106.109587] scsi_io_completion+0x4b5/0x8f0 [ 106.110099] scsi_finish_command+0x412/0x456 [ 106.110615] scsi_softirq_done+0x23f/0x29b [ 106.15] blk_done_softirq+0x2a7/0x2e6 [ 106.111608] __do_softirq+0x360/0x6ad [ 106.112062] run_ksoftirqd+0x2f/0x5b [ 106.112499] smpboot_thread_fn+0x3a5/0x3db [ 106.113000] kthread+0x1d4/0x1e4 [ 106.113457] ret_from_fork+0x3a/0x50 [ 106.113969] [ 106.113969] to a SOFTIRQ-irq-unsafe lock: [ 106.114672] (&(&sb->map[i].swap_lock)->rlock){+.+.} [ 106.114674] [ 106.114674] ... which became SOFTIRQ-irq-unsafe at: [ 106.116000] ... [ 106.116003] _raw_spin_lock+0x33/0x64 [ 106.116676] sbitmap_get+0xd5/0x22c [ 106.117134] __sbitmap_queue_get+0xe8/0x177 [ 106.117731] __blk_mq_get_tag+0x1e6/0x22d [ 106.118286] blk_mq_get_tag+0x1db/0x6e4 [ 106.118756] blk_mq_get_driver_tag+0x161/0x258 [ 106.119383] blk_mq_dispatch_rq_list+0x28e/0xd7c [ 106.120043] blk_mq_do_dispatch_sched+0x23a/0x287 [ 106.120607] blk_mq_sched_dispatch_requests+0x379/0x3fc [ 106.121234] __blk_mq_run_hw_queue+0x137/0x17e [ 106.121781] __blk_mq_delay_run_hw_queue+0x80/0x25f [ 106.122366] blk_mq_run_hw_queue+0x151/0x187 [ 106.122887] blk_mq_sched_insert_requests+0x13f/0x175 [ 106.123492] blk_mq_flush_plug_list+0x7d6/0x81b [ 106.124042] blk_flush_plug_list+0x392/0x3d7 [ 106.124557] blk_finish_plug+0x37/0x4f [ 106.125019] read_pages+0x3ef/0x430 [ 106.125446] __do_page_cache_readahead+0x18e/0x2fc [ 106.126027] force_page_cache_readahead+0x121/0x133 [ 106.126621] page_cache_sync_readahead+0x35f/0x3bb [ 106.127229] generic_file_buffered_read+0x410/0x1860 [ 106.127932] __vfs_read+0x319/0x38f [ 106.128415] vfs_read+0xd2/0x19a [ 106.128817] ksys_read+0xb9/0x135 [ 106.129225] do_syscall_64+0x140/0x385 [ 106.129684] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 106.130292] [ 106.130292] other info that might help us debug this: [ 106.130292] [ 106.131226] Chain exists of: [ 106.131226] &sbq->ws[i].wait --> &(&hctx->dispatch_wait_lock)->rlock --> &(&sb->map[i].swap_lock)->rlock [ 106.131226] [ 106.132865] Possible interrupt unsafe locking scenario: [ 106.132865] [ 106.133659]CPU0CPU1 [ 106.134194] [ 106.134733] lock(&(&sb->map[i].swap_lock)->rlock); [ 106.135318]local_irq_disable(); [ 106.136014]lock(&sbq->ws[i].wait); [ 106.136747] lock(&(&hctx->dispatch_wait_lock)->rlock); [ 106.137742] [ 106.138110] lock(&sbq->ws[i].wait); [ 106.138625] [ 106.138625] *** DEADLOCK *** [ 106.138625] [ 106.139430] 3 locks held by fio/1043: [ 106.139947] #0: 76ff0fd9 (rcu_read_lock){}, at: hctx_lock+0x29/0xe8 [ 106.140813] #1: 2feb1016 (&sbq->ws[i].wait){..-.}, at: blk_mq_dispatch_rq_list+0x4ad/0xd7c [ 106.141877] #2: 7eec8b2f (&(&hctx->dispatch_wait_lock)->rlock){}, at: blk_mq_dispatch_rq_list+0x4c1/0xd7c [ 106.143267] [ 106.143267] the dependencies between SOFTIRQ-irq-safe lock and the holding lock: [ 106.144351] -> (&sbq->ws[i].wait){..-.} ops: 82 { [ 106.144926] IN-SOFTIRQ-W at: [ 106.145314] _raw
Re: [PATCH v5 0/5] lightnvm: Flexible metadata
Great! The tests(rocksdb, pblk recovery and the generic xfs suite) completed successfully on one of our disks, so feel free to add: Tested-by: Hans Holmberg Thanks, Hans On Fri, Nov 30, 2018 at 2:03 PM Hans Holmberg wrote: > > I just started a regression test on this patch set that'll run over > the weekend. I'll add a tested-by if everything checks out. > > All the best, > Hans > On Fri, Nov 30, 2018 at 12:49 PM Igor Konopko > wrote: > > > > This series of patches extends the way how pblk can > > store L2P sector metadata. After this set of changes > > any size of NVMe metadata is supported in pblk. > > Also there is an support for case without NVMe metadata. > > > > Changes v4 --> v5: > > -rebase on top of ocssd/for-4.21/core > > > > Changes v3 --> v4: > > -rename nvm_alloc_dma_pool() to nvm_create_dma_pool() > > -split pblk_get_meta() calls and lba setting into > > two operations for better core readability > > -fixing compilation with CONFIG_NVM disabled > > -getting rid of unnecessary memcpy for packed metadata > > on write path > > -support for drives with oob size >0 and <16B in packed > > metadata mode > > -minor commit message updates > > > > Changes v2 --> v3: > > -Rebase on top of ocssd/for-4.21/core > > -get/set_meta_lba helpers were removed > > -dma reallocation was replaced with single allocation > > -oob metadata size was added to pblk structure > > -proper checks on pblk creation were added > > > > Changes v1 --> v2: > > -Revert sector meta size back to 16b for pblk > > -Dma pool for larger oob meta are handled in core instead of pblk > > -Pblk oob meta helpers uses __le64 as input outpu instead of u64 > > -Other minor fixes based on v1 patch review > > > > Igor Konopko (5): > > lightnvm: pblk: Move lba list to partial read context > > lightnvm: pblk: Helpers for OOB metadata > > lightnvm: Flexible DMA pool entry size > > lightnvm: Disable interleaved metadata > > lightnvm: pblk: Support for packed metadata > > > > drivers/lightnvm/core.c | 9 -- > > drivers/lightnvm/pblk-core.c | 61 +++-- > > drivers/lightnvm/pblk-init.c | 44 +-- > > drivers/lightnvm/pblk-map.c | 20 +++- > > drivers/lightnvm/pblk-rb.c | 3 ++ > > drivers/lightnvm/pblk-read.c | 66 > > +++- > > drivers/lightnvm/pblk-recovery.c | 25 +-- > > drivers/lightnvm/pblk-sysfs.c| 7 + > > drivers/lightnvm/pblk-write.c| 9 +++--- > > drivers/lightnvm/pblk.h | 24 +-- > > drivers/nvme/host/lightnvm.c | 6 ++-- > > include/linux/lightnvm.h | 3 +- > > 12 files changed, 209 insertions(+), 68 deletions(-) > > > > -- > > 2.14.5 > >