[This email was generated by a script. Let me know if you have any suggestions
to make it better.]
Of the currently open syzbot reports against the upstream kernel, I've manually
marked 11 of them as possibly being bugs in the block subsystem. I've listed
these reports below, sorted by an algori
MIng,
On Tue, 25 Jun 2019, Ming Lei wrote:
> On Mon, Jun 24, 2019 at 05:42:39PM +0200, Thomas Gleixner wrote:
> > On Mon, 24 Jun 2019, Weiping Zhang wrote:
> >
> > > The driver may implement multiple affinity set, and some of
> > > are empty, for this case we just skip them.
> >
> > Why? What's
A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".
If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks t
BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.
While a bfq_queue Q1 is in service, some emp
Until the base value for request service times gets finally computed
for a bfq_queue, the inject limit for that queue does depend on the
think-time state (short|long) of the queue. A timely update of the
think time then guarantees a quicker activation or deactivation of the
injection. Fortunately,
One of the cases where the parameters for injection may be updated is
when there are no more in-flight I/O requests. The number of in-flight
requests is stored in the field bfqd->rq_in_driver of the descriptor
bfqd of the device. So, the controlled condition is
bfqd->rq_in_driver == 0.
Unfortunate
I/O injection gets reduced if it increases the request service times
of the victim queue beyond a certain threshold. The threshold, in its
turn, is computed as a function of the base service time enjoyed by
the queue when it undergoes no injection.
As a consequence, for injection to work properly
Until the base value of the request service times gets finally
computed for a bfq_queue, the inject limit does depend on the
think-time state (short|long). The limit must be 0 or 1 if the think
time is deemed, respectively, as short or long. However, such a check
and possible limit update is perfor
Consider, on one side, a bfq_queue Q that remains empty while in
service, and, on the other side, the pending I/O of bfq_queues that,
according to their timestamps, have to be served after Q. If an
uncontrolled amount of I/O from the latter bfq_queues were dispatched
while Q is waiting for its new
[SAME AS V1, APART FROM SRIVATSA ADDED AS REPORTER]
Hi Jens,
this series, based against for-5.3/block, contains:
1) The improvements to recover the throughput loss reported by
Srivatsa [1] (first five patches)
2) A preemption improvement to reduce I/O latency
3) A fix of a subtle bug causing lo
> Il giorno 24 giu 2019, alle ore 22:15, Srivatsa S. Bhat
> ha scritto:
>
> On 6/24/19 12:40 PM, Paolo Valente wrote:
>> Hi Jens,
>> this series, based against for-5.3/block, contains:
>> 1) The improvements to recover the throughput loss reported by
>> Srivatsa [1] (first five patches)
>>
Hi Ming,
> -Original Message-
> From: Ming Lei
> Sent: Tuesday, June 25, 2019 10:27 AM
> To: wenbinzeng(曾文斌)
> Cc: Wenbin Zeng ; ax...@kernel.dk;
> keith.bu...@intel.com;
> h...@suse.com; osan...@fb.com; s...@grimberg.me; bvanass...@acm.org;
> linux-block@vger.kernel.org; linux-ker...@v
Looks good with one nit, can be done at the time of applying patch.
Reviewed-by: Chaitanya Kulkarni
On 6/24/19 7:46 PM, Damien Le Moal wrote:
> To allow the SCSI subsystem scsi_execute_req() function to issue
> requests using large buffers that are better allocated with vmalloc()
> rather than k
On 6/25/19 10:27 AM, Ming Lei wrote:
> On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote:
>> Hi Ming,
>>
>>> -Original Message-
>>> From: Ming Lei
>>> Sent: Tuesday, June 25, 2019 9:55 AM
>>> To: Wenbin Zeng
>>> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; o
Limit the size of the struct blk_zone array used in
blk_revalidate_disk_zones() to avoid memory allocation failures leading
to disk revalidation failure. Further reduce the likelyhood of these
failures by using kvmalloc() instead of directly allocating contiguous
pages.
Fixes: 515ce6061312 ("scsi:
This series addresses a reccuring problem with zone revalidation
failures observed during extensive testing with memory constrained
system and device hot-plugging.
The problem source is failure to allocate large memory areas with
alloc_pages() or kmalloc() in blk_revalidate_disk_zones() to store t
During disk scan and revalidation done with sd_revalidate(), the zones
of a zoned disk are checked using the helper function
blk_revalidate_disk_zones() if a configuration change is detected
(change in the number of zones or zone size). The function
blk_revalidate_disk_zones() issues report_zones c
To allow the SCSI subsystem scsi_execute_req() function to issue
requests using large buffers that are better allocated with vmalloc()
rather than kmalloc(), modify bio_map_kern() to allow passing a buffer
allocated with the vmalloc() function. To do so, simply test the buffer
address using is_vmal
Hi Dongli,
> -Original Message-
> From: Dongli Zhang
> Sent: Tuesday, June 25, 2019 9:30 AM
> To: Wenbin Zeng
> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com;
> ming@redhat.com;
> osan...@fb.com; s...@grimberg.me; bvanass...@acm.org;
> linux-block@vger.kernel.org; linux-
On Tue, Jun 25, 2019 at 02:14:46AM +, wenbinzeng(曾文斌) wrote:
> Hi Ming,
>
> > -Original Message-
> > From: Ming Lei
> > Sent: Tuesday, June 25, 2019 9:55 AM
> > To: Wenbin Zeng
> > Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com;
> > s...@grimberg.me; bvanas
Hi Ming,
> -Original Message-
> From: Ming Lei
> Sent: Tuesday, June 25, 2019 9:55 AM
> To: Wenbin Zeng
> Cc: ax...@kernel.dk; keith.bu...@intel.com; h...@suse.com; osan...@fb.com;
> s...@grimberg.me; bvanass...@acm.org; linux-block@vger.kernel.org;
> linux-ker...@vger.kernel.org; wenbin
Hi Thomas,
On Mon, Jun 24, 2019 at 05:42:39PM +0200, Thomas Gleixner wrote:
> On Mon, 24 Jun 2019, Weiping Zhang wrote:
>
> > The driver may implement multiple affinity set, and some of
> > are empty, for this case we just skip them.
>
> Why? What's the point of creating empty sets? Just because
On 2019/6/25 2:14 上午, Eric Wheeler wrote:
> On Mon, 24 Jun 2019, Coly Li wrote:
>
>> On 2019/6/23 7:16 上午, Eric Wheeler wrote:
>>> From: Eric Wheeler
>>>
>>> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
>>> currently no SCSI/RAID controller drivers that do. Previously s
On Mon, Jun 24, 2019 at 11:24:07PM +0800, Wenbin Zeng wrote:
> Currently hctx->cpumask is not updated when hot-plugging new cpus,
> as there are many chances kblockd_mod_delayed_work_on() getting
> called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
There are only two cases in which
Hi Wenbin,
On 6/24/19 11:24 PM, Wenbin Zeng wrote:
> Currently hctx->cpumask is not updated when hot-plugging new cpus,
> as there are many chances kblockd_mod_delayed_work_on() getting
> called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
> on the newly-plugged cpus, consequently _
Eric,
> Perhaps they do not set stripe_width using io_opt? I did a grep to see
> if any of them did, but I didn't see them. How is stripe_width
> indicated by RAID controllers?
The values are reported in the Block Limits VPD page for each SCSI block
device and are thus set by the SCSI disk driv
> @@ -2627,7 +2752,30 @@ static int nvme_pci_get_address(struct nvme_ctrl
> *ctrl, char *buf, int size)
>
> static void nvme_pci_get_ams(struct nvme_ctrl *ctrl, u32 *ams)
> {
> - *ams = NVME_CC_AMS_RR;
> + /* if deivce doesn't support WRR, force reset wrr queues to 0 */
> + if (!NV
On 6/24/19 12:40 PM, Paolo Valente wrote:
> Hi Jens,
> this series, based against for-5.3/block, contains:
> 1) The improvements to recover the throughput loss reported by
>Srivatsa [1] (first five patches)
> 2) A preemption improvement to reduce I/O latency
> 3) A fix of a subtle bug causing l
On 19-06-24 22:29:05, Weiping Zhang wrote:
> The get_ams() will return the AMS(Arbitration Mechanism Selected)
> from the driver.
>
> Signed-off-by: Weiping Zhang
Hello, Weiping.
Sorry, but I don't really get what your point is here. Could you please
elaborate this patch a little bit more? Th
On 19-06-24 22:29:19, Weiping Zhang wrote:
> Now nvme support three type hardware queues, read, poll and default,
> this patch rename write_queues to read_queues to set the number of
> read queues more explicitly. This patch alos is prepared for nvme
> support WRR(weighted round robin) that we can
BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.
While a bfq_queue Q1 is in service, some emp
Consider, on one side, a bfq_queue Q that remains empty while in
service, and, on the other side, the pending I/O of bfq_queues that,
according to their timestamps, have to be served after Q. If an
uncontrolled amount of I/O from the latter bfq_queues were dispatched
while Q is waiting for its new
A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".
If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks t
One of the cases where the parameters for injection may be updated is
when there are no more in-flight I/O requests. The number of in-flight
requests is stored in the field bfqd->rq_in_driver of the descriptor
bfqd of the device. So, the controlled condition is
bfqd->rq_in_driver == 0.
Unfortunate
Until the base value for request service times gets finally computed
for a bfq_queue, the inject limit for that queue does depend on the
think-time state (short|long) of the queue. A timely update of the
think time then guarantees a quicker activation or deactivation of the
injection. Fortunately,
Hi Jens,
this series, based against for-5.3/block, contains:
1) The improvements to recover the throughput loss reported by
Srivatsa [1] (first five patches)
2) A preemption improvement to reduce I/O latency
3) A fix of a subtle bug causing loss of control over I/O bandwidths
Thanks,
Paolo
[1]
Until the base value of the request service times gets finally
computed for a bfq_queue, the inject limit does depend on the
think-time state (short|long). The limit must be 0 or 1 if the think
time is deemed, respectively, as short or long. However, such a check
and possible limit update is perfor
I/O injection gets reduced if it increases the request service times
of the victim queue beyond a certain threshold. The threshold, in its
turn, is computed as a function of the base service time enjoyed by
the queue when it undergoes no injection.
As a consequence, for injection to work properly
On 2019-06-24 12:54 p.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote:
>
>>> Sounded like this series does generate the dma_addr for the correct
>>> device..
>>
>> This series doesn't generate any DMA addresses with dma_map(). The
>> current p2pdma c
> Il giorno 24 giu 2019, alle ore 18:12, Jens Axboe ha
> scritto:
>
> On 6/22/19 2:44 PM, Paolo Valente wrote:
>> By mistake, there is a '&' instead of a '==' in the definition of the
>> macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
>> the correct one.
>
> A bit worr
On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote:
> > Sounded like this series does generate the dma_addr for the correct
> > device..
>
> This series doesn't generate any DMA addresses with dma_map(). The
> current p2pdma code ensures everything is behind the same root port and
>
On 2019-06-24 12:16 p.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote:
>>> It is only a very narrow case where you can take shortcuts with
>>> dma_addr_t, and I don't think shortcuts like are are appropriate for
>>> the mainline kernel..
>>
>> I don't
On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote:
> > It is only a very narrow case where you can take shortcuts with
> > dma_addr_t, and I don't think shortcuts like are are appropriate for
> > the mainline kernel..
>
> I don't think it's that narrow and it opens up a lot of avenue
On Mon, 24 Jun 2019, Coly Li wrote:
> On 2019/6/23 7:16 上午, Eric Wheeler wrote:
> > From: Eric Wheeler
> >
> > While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> > currently no SCSI/RAID controller drivers that do. Previously stripe_size
> > and partial_stripes_expensive w
On Sat 15-06-19 11:24:48, Tejun Heo wrote:
> When a shared kthread needs to issue a bio for a cgroup, doing so
> synchronously can lead to priority inversions as the kthread can be
> trapped waiting for that cgroup. This patch implements
> REQ_CGROUP_PUNT flag which makes submit_bio() punt the act
On 2019-06-24 7:55 a.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote:
>> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
>>> BTW, it is not just offset right? It is possible that the IOMMU can
>>> generate unique dma_addr_t values f
On Sat 15-06-19 11:24:46, Tejun Heo wrote:
> When writeback IOs are bounced through async layers, the IOs should
> only be accounted against the wbc from the original bdi writeback to
> avoid confusing cgroup inode ownership arbitration. Add
> wbc->no_wbc_acct to allow disabling wbc accounting. T
On Sat 15-06-19 11:24:47, Tejun Heo wrote:
> Add a helper to determine the target blkcg from wbc.
>
> Signed-off-by: Tejun Heo
> Reviewed-by: Josef Bacik
Looks good to me. You can add:
Reviewed-by: Jan Kara
Honza
> ---
> inclu
On Sat 15-06-19 11:24:45, Tejun Heo wrote:
> btrfs is going to use css_put() and wbc helpers to improve cgroup
> writeback support. Add dummy css_get() definition and export wbc
> helpers to prepare for module and !CONFIG_CGROUP builds.
>
> Signed-off-by: Tejun Heo
> Reported-by: kbuild test rob
On Mon 24-06-19 05:58:56, Tejun Heo wrote:
> Hello, Jan.
>
> On Mon, Jun 24, 2019 at 10:21:30AM +0200, Jan Kara wrote:
> > OK, now I understand. Just one more question: So effectively, you are using
> > wbc->no_wbc_acct to pass information from btrfs code to btrfs code telling
> > it whether IO sh
On 6/22/19 2:44 PM, Paolo Valente wrote:
> By mistake, there is a '&' instead of a '==' in the definition of the
> macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
> the correct one.
A bit worrying that this wasn't caught in testing, as it would have
resulted in _any_ queue b
On 6/24/19 12:12 AM, Christoph Hellwig wrote:
> A large chunk of NVMe updates for 5.3. Highlights:
>
> - improved PCIe suspent support (Keith Busch)
> - error injection support for the admin queue (Akinobu Mita)
> - Fibre Channel discovery improvements (James Smart)
> - tracing improvemen
On 2019-06-24 7:46 a.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote:
>> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
My primary concern with this is that ascribes a level of generality
that just isn't there for peer-to
set_capacity expects the disk size in sectors of 512 bytes, and changing
the magic number 9 to SECTOR_SHIFT clarifies this intent.
Signed-off-by: Marcos Paulo de Souza
---
drivers/block/nbd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/nbd.c b/drivers/block/
On 2019-06-24 1:27 a.m., Christoph Hellwig wrote:
> This is not going to fly.
>
> For one passing a dma_addr_t through the block layer is a layering
> violation, and one that I think will also bite us in practice.
> The host physical to PCIe bus address mapping can have offsets, and
> those off
On Mon, 24 Jun 2019, Weiping Zhang wrote:
> The driver may implement multiple affinity set, and some of
> are empty, for this case we just skip them.
Why? What's the point of creating empty sets? Just because is not a real
good justification.
Leaving the patch for Ming.
Thanks,
tglx
>
Currently hctx->cpumask is not updated when hot-plugging new cpus,
as there are many chances kblockd_mod_delayed_work_on() getting
called with WORK_CPU_UNBOUND, workqueue blk_mq_run_work_fn may run
on the newly-plugged cpus, consequently __blk_mq_run_hw_queue()
reporting excessive "run queue from w
Hi,
This series try to add Weighted Round Robin for block cgroup and nvme
driver. When multiple containers share a single nvme device, we want
to protect IO critical container from not be interfernced by other
containers. We add blkio.wrr interface to user to control their IO
priority. The blkio.w
The driver may implement multiple affinity set, and some of
are empty, for this case we just skip them.
Signed-off-by: Weiping Zhang
---
kernel/irq/affinity.c | 4
1 file changed, 4 insertions(+)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index f18cd5aa33e8..6d964fe0fbd8 10
Each block cgroup can select a weighted round robin type to make
its io requests go to the specified haredware queue. Now we support
three round robin type high, medium, low like what nvme specification
donse.
Signed-off-by: Weiping Zhang
---
block/blk-cgroup.c | 89 +
This patch enalbe Weithed Round Robin if nvme device support it.
We add four module parameters wrr_urgent_queues, wrr_high_queeus,
wrr_medium_queues, wrr_low_queues to control the number of queues for
specified priority. If device doesn't support WRR, all these four
parameters will be forced reset
Now nvme support three type hardware queues, read, poll and default,
this patch rename write_queues to read_queues to set the number of
read queues more explicitly. This patch alos is prepared for nvme
support WRR(weighted round robin) that we can get the number of
each queue type easily.
Signed-o
The get_ams() will return the AMS(Arbitration Mechanism Selected)
from the driver.
Signed-off-by: Weiping Zhang
---
drivers/nvme/host/core.c | 9 -
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/pci.c | 6 ++
include/linux/nvme.h | 1 +
4 files changed, 16 insertions(+), 1 de
On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
> > BTW, it is not just offset right? It is possible that the IOMMU can
> > generate unique dma_addr_t values for each device?? Simple offset is
> > just something w
On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
> BTW, it is not just offset right? It is possible that the IOMMU can
> generate unique dma_addr_t values for each device?? Simple offset is
> just something we saw in certain embedded cases, IIRC.
Yes, it could. If we are trying to
On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
> > > My primary concern with this is that ascribes a level of generality
> > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > addresses are not
On 24/06/2019 09:46, Ming Lei wrote:
On Wed, Jun 05, 2019 at 03:10:51PM +0100, John Garry wrote:
On 31/05/2019 03:27, Ming Lei wrote:
index 32b8ad3d341b..49d73d979cb3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2433,6 +2433,11 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set
Hello, Jan.
On Mon, Jun 24, 2019 at 10:21:30AM +0200, Jan Kara wrote:
> OK, now I understand. Just one more question: So effectively, you are using
> wbc->no_wbc_acct to pass information from btrfs code to btrfs code telling
> it whether IO should or should not be accounted with wbc_account_io().
On Wed, Jun 05, 2019 at 03:10:51PM +0100, John Garry wrote:
> On 31/05/2019 03:27, Ming Lei wrote:
> > index 32b8ad3d341b..49d73d979cb3 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2433,6 +2433,11 @@ static bool __blk_mq_alloc_rq_map(struct
> > blk_mq_tag_set *set, int hctx_idx
On Fri, May 31, 2019 at 08:37:39AM -0700, Bart Van Assche wrote:
> On 5/30/19 7:27 PM, Ming Lei wrote:
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 6aea0ebc3a73..3d6780504dcb 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -237,6 +237,7
On Fri, May 31, 2019 at 08:39:04AM -0700, Bart Van Assche wrote:
> On 5/30/19 7:27 PM, Ming Lei wrote:
> > +static int g_host_tags = 0;
>
> Static variables should not be explicitly initialized to zero.
OK
>
> > +module_param_named(host_tags, g_host_tags, int, S_IRUGO);
> > +MODULE_PARM_DESC(ho
Hello Tejun!
On Thu 20-06-19 10:02:50, Tejun Heo wrote:
> On Thu, Jun 20, 2019 at 05:21:45PM +0200, Jan Kara wrote:
> > I'm completely ignorant of how btrfs compressed writeback works so don't
> > quite understand implications of this. So does this mean that writeback to
> > btrfs compressed files
Hello,
syzbot found the following crash on:
HEAD commit:bed3c0d8 Merge tag 'for-5.2-rc5-tag' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1418bf0aa0
kernel config: https://syzkaller.appspot.com/x/.config?x=28ec3437a5394ee0
da
On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
> > My primary concern with this is that ascribes a level of generality
> > that just isn't there for peer-to-peer dma operations. "Peer"
> > addresses are not "DMA" addresses, and the rules about what can and
> > can't do peer-DMA ar
This is not going to fly.
For one passing a dma_addr_t through the block layer is a layering
violation, and one that I think will also bite us in practice.
The host physical to PCIe bus address mapping can have offsets, and
those offsets absolutely can be different for differnet root ports.
So wit
On 2019/6/23 7:16 上午, Eric Wheeler wrote:
> From: Eric Wheeler
>
> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> currently no SCSI/RAID controller drivers that do. Previously stripe_size
> and partial_stripes_expensive were read-only values and could not be
> tuned by
On 2019/6/23 7:16 上午, Eric Wheeler wrote:
> From: Eric Wheeler
>
> While some drivers set queue_limits.io_opt (e.g., md raid5), there are
> currently no SCSI/RAID controller drivers that do. Previously stripe_size
> and partial_stripes_expensive were read-only values and could not be
> tuned by
77 matches
Mail list logo