Re: linux-next: Tree for Jul 25 (ahci_imx.c)

2013-07-26 Thread t...@kernel.org
On Fri, Jul 26, 2013 at 01:25:21PM +0800, Shawn Guo wrote:
> On Fri, Jul 26, 2013 at 04:58:14AM +, Zhu Richard-R65037 wrote:
> > Hi Shawn:
> > yes, it is.
> > 
> > BTW, Should I re-send the AHCI_IMX patch-set or send out one stand-alone 
> > patch to fix this issue?
> > 
> 
> I think an incremental fixing patch is good.

Yeap, it's already upstream.  There's no other way.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ahci_imx: depend on CONFIG_MFD_SYSCON

2013-07-26 Thread t...@kernel.org
>From 867974fc09f93bdd7f98d46ac3733934486bbf4a Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Fri, 26 Jul 2013 08:57:56 -0400

ahci_imx makes use of regmap but the dependency wasn't specified in
Kconfig leading build failures if CONFIG_AHCI_IMX is enabled but
CONFIG_MFD_SYSCON is not.  Add the Kconfig dependency.

Signed-off-by: Tejun Heo 
Reported-by: Stephen Rothwell 
---
Applied to libata/for-3.11-fixes.

Thanks.

 drivers/ata/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index a4af0a2..4e73772 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -99,7 +99,7 @@ config SATA_AHCI_PLATFORM
 
 config AHCI_IMX
tristate "Freescale i.MX AHCI SATA support"
-   depends on SATA_AHCI_PLATFORM
+   depends on SATA_AHCI_PLATFORM && MFD_SYSCON
help
  This option enables support for the Freescale i.MX SoC's
  onboard AHCI SATA.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] AHCI: Clear GHC.IS to prevent unexpectly asserting INTx

2016-07-20 Thread t...@kernel.org
On Wed, Jul 20, 2016 at 12:13:46PM +, Pang Raymond wrote:
> 
> Due to PCI subsystem behaviour, unloading AHCI driver will disable 
> MSI and enable INTx. When HBA supports MSIx or Multiple MSI, Driver's
> irq handler doesn't clear GHC.IS register. It works well when reading or
> writing data and GHC.IS is always non-zero. But when unloading driver
> (or any other operation which causes disable MSIx and enable INTx), PCI
>  subsystem uses config write(Rx04.bit10) to enable INTx. Because
> GHC.IS is non-zero, HBA will falsely assume some port needs interrupt 
> service. Then it asserts INTx. To make things worse, when AHCI controller
> shares the same interrupt pin with other PCI device, that PCI device's ISR
> will be called and nobody de-asserts previous INTx.
> This patch clears GHC.IS in ahci_port_stop() even when using MSIx or 
> MMSI to prevent this case. It ensures GHC.IS is zero before PCI subsystem 
> enables INTx.
> 
> Signed-off-by: Raymond Pang 

Applied to libata/for-4.8 with minor updates to the comment.

Thanks.

-- 
tejun


Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 09:06:55PM +, Bart Van Assche wrote:
> On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> > +static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> > +{
> > +   unsigned long flags;
> > +
> > +   local_irq_save(flags);
> > +   u64_stats_update_begin(&rq->aborted_gstate_sync);
> > +   rq->aborted_gstate = gstate;
> > +   u64_stats_update_end(&rq->aborted_gstate_sync);
> > +   local_irq_restore(flags);
> > +}
> 
> Please add a comment that explains the purpose of local_irq_save() and
> local_irq_restore(). Please also explain why you chose to disable interrupts

Will do.

> instead of disabling preemption. I think that disabling preemption should be
> sufficient since this is the only code that updates rq->aborted_gstate and
> since this function is always called from thread context.

blk_mq_complete_request() can read it from the irq context.  If that
happens between update_begin and end, it'll end up looping infinitely.

> > @@ -801,6 +840,12 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> > reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > +   /*
> > +* As nothing prevents from completion happening while
> > +* ->aborted_gstate is set, this may lead to ignored
> > +* completions and further spurious timeouts.
> > +*/
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > blk_add_timer(req);
> > blk_clear_rq_complete(req);
> > break;
> 
> Is the race that the comment refers to addressed by one of the later patches?

No, but it's not a new race.  It has always been there and I suppose
doesn't lead to practical problems - the race window is pretty small
and the effect isn't critical.  I'm just documenting the existing race
condition.  Will note that in the description.

Thanks.

-- 
tejun


Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 11:29:11PM +, Bart Van Assche wrote:
> Does "gstate" perhaps stand for "generation number and state"? If so, please
> mention this in one of the above comments.

Yeah, will do.

Thanks.

-- 
tejun


Re: [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path

2018-01-09 Thread t...@kernel.org
On Mon, Jan 08, 2018 at 10:10:01PM +, Bart Van Assche wrote:
> Other req->deadline writes are protected by preempt_disable(),
> write_seqcount_begin(&rq->gstate_seq), write_seqcount_end(&rq->gstate_seq)
> and preempt_enable(). I think it's fine that the above req->deadline store
> does not have that protection but I also think that that deserves a comment.

Will add.

Thanks.

-- 
tejun


Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread t...@kernel.org
Hello, Bart.

On Tue, Jan 09, 2018 at 04:12:40PM +, Bart Van Assche wrote:
> I'm concerned about the additional CPU cycles needed for the new 
> blk_mq_map_queue()
> call, although I know this call is cheap. Would the timeout code really get 
> that

So, if that is really a concern, let's cache that mapping instead of
changing synchronization rules for that.

> much more complicated if the hctx_lock() and hctx_unlock() calls would be 
> changed
> into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if
> "if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" 
> in
> blk_mq_timeout_work()?

Code-wise, it won't be too much extra code but I think diverging the
sync methods between issue and completion paths is more fragile and
likely to invite confusions and mistakes in the future.  We have the
normal path (issue&completion) synchronizing against the exception
path (timeout).  I think it's best to keep the sync constructs aligned
with that conceptual picture.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > And looking at the change, it looks like the right thing we should
> > have done is caching @lock on the print_blkg side and when switching
> > locks make sure both locks are held.  IOW, do the following in
> > blk_cleanup_queue()
> > 
> > spin_lock_irq(lock);
> > if (q->queue_lock != &q->__queue_lock) {
> > spin_lock(&q->__queue_lock);
> > q->queue_lock = &q->__queue_lock;
> > spin_unlock(&q->__queue_lock);
> > }
> > spin_unlock_irq(lock);
> > 
> > Otherwise, there can be two lock holders thinking they have exclusive
> > access to the request_queue.
> 
> I think that's a bad idea. A block driver is allowed to destroy the
> spinlock it associated with the request queue as soon as blk_cleanup_queue()
> has finished. If the block cgroup controller would cache a pointer to the
> block driver spinlock then that could cause the cgroup code to attempt to
> lock a spinlock after it has been destroyed. I don't think we need that kind
> of race conditions.

I see, but that problem is there with or without caching as long as we
have queu_lock usage which reach beyond cleanup_queue, right?  Whether
that user caches the lock for matching unlocking or not doesn't really
change the situation.

Short of adding protection around queue_lock switching, I can't think
of a solution tho.  Probably the right thing to do is adding queue
lock/unlock helpers which are safe to use beyond cleanup_queue.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote:
> A simple and effective solution is to dissociate a request queue from the
> block cgroup controller before blk_cleanup_queue() returns. This is why commit
> a063057d7c73 ("block: Fix a race between request queue removal and the block
> cgroup controller") moved the blkcg_exit_queue() call from 
> __blk_release_queue()
> into blk_cleanup_queue().

which is broken.  We can try to switch the lifetime model to revoking
all live objects but that likely is a lot more involving than blindly
moving blkg shootdown from release to cleanup.  Implementing sever
semantics is usually a lot more involved / fragile because it requires
explicit participation from all users (exactly the same way revoking
->queue_lock is difficult).

I'm not necessarily against switching to sever model, but what the
patch did isn't that.  It just moved some code without actually
understanding or auditing what the implications are.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:26:12PM +, Bart Van Assche wrote:
> Please explain what you wrote further. It's not clear to me why you think
> that anything is broken nor what a "sever model" is.

So, sever (or revoke) model is where you actively disconnect an object
and ensuring that there'll be no further references from others.  In
contrast, what we usually do is refcnting, where we don't actively
sever the dying object from its users but let the users drain
themselves over time and destroy the object when we know all the users
are gone (recnt reaching zero).

> I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
> to avoid race conditions between request queue cleanup and the block cgroup
> controller. Additionally, since it is guaranteed that no new requests will
> be submitted to a queue after it has been marked dead I don't see why it
> would make sense to keep the association between a request queue and the
> blkcg controller until the last reference on a queue is dropped.

Sure, new requests aren't the only source tho.  e.g. there can be
derefs through sysfs / cgroupfs or writeback attempts on dead devices.
If you want to implement sever, you gotta hunt down all those and make
sure they can't make no further derefs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> On 04/11/18 13:00, Alexandru Moise wrote:
> >But the root cause of it is in blkcg_init_queue() when blkg_create() returns
> >an ERR ptr, because it tries to insert into a populated index into 
> >blkcg->blkg_tree,
> >the entry that we fail to remove at __blk_release_queue().
> 
> Hello Alex,
> 
> Had you considered something like the untested patch below?

But queue init shouldn't fail here, right?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > > On 04/11/18 13:00, Alexandru Moise wrote:
> > > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > > returns
> > > > an ERR ptr, because it tries to insert into a populated index into 
> > > > blkcg->blkg_tree,
> > > > the entry that we fail to remove at __blk_release_queue().
> > > 
> > > Hello Alex,
> > > 
> > > Had you considered something like the untested patch below?
> > 
> > But queue init shouldn't fail here, right?
> 
> Hello Tejun,
> 
> Your question is not entirely clear to me. Are you referring to the atomic
> allocations in blkg_create() or are you perhaps referring to something else?

Hmm.. maybe I'm confused but I thought that the fact that
blkcg_init_queue() fails itself is already a bug, which happens
because a previously destroyed queue left behind blkgs.

Thanks.

-- 
tejun


Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-06-01 Thread 't...@kernel.org'
Hello,

On Fri, Jun 01, 2018 at 09:25:32AM +, Hatayama, Daisuke wrote:
> kernfs_dir_pos() checks if a kernfs_node object given as one of its
> arguments is still active and if so returns it, or returns a
> kernfs_node object that is most equal (possibly smaller and larger) to
> the given object.

Sometimes they're duplicate operations tho, which is exactly the bug
the posted patch is trying to fix.  What I'm suggesting is instead of
leaving both instances and skipping one conditionally, put them in one
place and trigger only when necessary.  The sequence of operations
would be exactly the same.  The only difference is how the code is
organized.

> kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> object given by kernfs_dir_pos().
> 
> Two functions does different things and both need to skip inactive
> nodes. I don't think it natural to remove the skip only from
> kernfs_dir_pos().
> 
> OTOH, throughout getdents(), there is no case that the kernfs_node
> object given to kernfs_dir_pos() is used afterwards in the
> processing. So, is it enough to provide kernfs_dir_next_pos() only?
> Then, the skip code is now not duplicated.
> 
> The patch below is my thought. How do you think?
> 
> But note that this patch has some bug so that system boot get hang
> without detecting root filesystem disk :) I'm debugging this now.

I haven't looked into the code that closely but given that we had
cases where both skippings were fine and not, the condition is likely
gonna be a bit tricker?

Thanks.

-- 
tejun


Re: [PATCH] hotplug: fix oops when adding cpu.

2015-06-17 Thread t...@kernel.org
Hello,

On Wed, Jun 17, 2015 at 06:04:06PM +0800, songxium...@inspur.com wrote:
> Subject: [PATCH] hotplug: fix oops when adding cpu
> From: SongXiumiao 
> 
> If memory is not in node0 and a cpu is logically hotadded, the kernel oopses. 
> 
> By analysing the bug function call trace,we find that create_worker function 
> will alloc the memory from node0.Because node0 is offline,the allocation is 
> failed.Then we add a condition to ensure the node is online and 
> system can alloc memory from a node that is online. 
> 
> Here's the backtrace:
> [root@localhost ~]# echo 1 > /sys/devices/system/cpu/cpu90/online
...
> [  225.755126] BUG: unable to handle kernel paging request at 1b08
> [  225.762931] IP: [] __alloc_pages_nodemask+0xb7/0x940
> [  225.770247] PGD 449bb0067 PUD 46110e067 PMD 0
> [  225.775248] Oops:  [#1] SMP
> [  225.778875] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT 
> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntracd
> [  225.868198] CPU: 43 PID: 5400 Comm: bash Not tainted 
> 4.0.0-rc4-bug-fixed-remove #16
> [  225.876754] Hardware name: Insyde Brickland/Type2 - Board Product Name1, 
> BIOS Brickland.05.04.15.0024 02/28/2015
> [  225.888122] task: 88045a3d8da0 ti: 88044612 task.ti: 
> 88044612
> [  225.896484] RIP: 0010:[]  [] 
> __alloc_pages_nodemask+0xb7/0x940
...
> [  226.002904] Call Trace:
> [  226.019173]  [] new_slab+0xa7/0x460
> [  226.024826]  [] __slab_alloc+0x310/0x470
> [  226.043812]  [] kmem_cache_alloc_node_trace+0x91/0x250
> [  226.051299]  [] alloc_worker+0x21/0x50
> [  226.057236]  [] create_worker+0x53/0x1e0
> [  226.063357]  [] alloc_unbound_pwq+0x2a2/0x510
> [  226.069974]  [] wq_update_unbound_numa+0x1b4/0x220
> [  226.077076]  [] workqueue_cpu_up_callback+0x308/0x3d0
> [  226.084468]  [] notifier_call_chain+0x4e/0x80
> [  226.091084]  [] __raw_notifier_call_chain+0xe/0x10
> [  226.098189]  [] cpu_notify+0x23/0x50
...
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 586ad91..22d194c 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -3253,7 +3253,8 @@ static struct worker_pool *get_unbound_pool(const 
> struct workqueue_attrs *attrs)
> if (wq_numa_enabled) {
> for_each_node(node) {
> if (cpumask_subset(pool->attrs->cpumask,
> -wq_numa_possible_cpumask[node])) {
> + wq_numa_possible_cpumask[node])
> +&& node_online(node)) {
> pool->node = node;
> break;

Umm... can you please talk with the fujitsu guys?  They're working on
the same problem and this isn't the full solution.  What happens if
the node goes down after pool is initialized?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip

2018-05-29 Thread 't...@kernel.org'
Hello,

On Mon, May 28, 2018 at 12:54:03PM +, Hatayama, Daisuke wrote:
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, 
> struct file *filp)
>  static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>   struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>  {
> + struct kernfs_node *orig = pos;
> +
>   pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && kernfs_sd_compare(pos, orig) <= 0) {

Hmm... the code seems a bit unintuitive to me and I wonder whether
it's because there are two identical skipping loops in
kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
selectively disable one of them.  Wouldn't it make more sense to get
rid of it from kernfs_dir_pos() and skip explicitly only when
necessary?

Thanks.

-- 
tejun


Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.

2018-06-05 Thread &#x27;t...@kernel.org'
Hello,

On Tue, Jun 05, 2018 at 10:31:36AM -0500, Eric W. Biederman wrote:
> What I have above is not the clearest, and in fact the logic could be
> better.
> 
> The fundamental challenge is because hash collisions are possible a file
> offset does not hold complete position information in a directory.
> 
> So the kernfs node that is to be read/displayed next is saved in the
> struct file.  The it is tested if the saved kernfs node is usable
> for finding the location in the directory.  Several things may have
> gone wrong.
> 
> - Someone may have called seekdir.
> - The saved kernfs node may have been renamed.
> - The saved kernfs node may have been moved to a different directory in
>   kernfs.
> - the saved kernfs node may have been deleted.
> 
> If any of those are true the code needs to do the rbtree lookup.

So, given that the whole thing is protected by a mutex which protects
modifications, it could be an option to simply keep track of who's
iterating what and shift them on removals.  IOW, always keep cursor
pointing to the next thing to visit and if that gets removed shift the
cursor to the next one.

Thanks.

-- 
tejun


Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-25 Thread t...@kernel.org
Hello,

On Fri, Nov 20, 2020 at 10:23:44AM +1100, NeilBrown wrote:
> On Mon, Nov 09 2020, t...@kernel.org wrote:
> 
> >Given that nothing on
> > these types of workqueues can be latency sensitive
> 
> This caught my eye and it seems worth drilling in to.  There is no
> mention of "latency" in workqueue.rst or workqueue.h.  But you seem to
> be saying there is an undocumented assumption that latency-sensitive
> work items much not be scheduled on CM-workqueues.
> Is that correct?

Yeah, correct. Because they're all sharing execution concurrency, the
latency consistency is likely a lot worse.

> NFS writes are latency sensitive to a degree as increased latency per
> request will hurt overall throughput.  Does this mean that handling
> write-completion in a CM-wq is a poor choice?
> Would it be better to us WQ_HIGHPRI??  Is there any rule-of-thumb that
> can be used to determine when WQ_HIGHPRI is appropriate?

I don't think it'd need HIGHPRI but UNBOUND or CPU_INTENSIVE would make
sense. I think the rule of the thumb is along the line of if you're worried
about cpu consumption or latency, let the scheduler take care of it (ie. use
unbound workqueues).

Thanks.

-- 
tejun


Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-12-02 Thread t...@kernel.org
Hello, Neil.

(cc'ing Lai)

On Fri, Nov 20, 2020 at 10:07:56AM +1100, NeilBrown wrote:
> If the workqueue maintainers are unmovable in the position that a
> CM-workitem must not use excessive CPU ever, and so must not call
> cond_resched(), then I can take that back to the NFS maintainers and

Oh, that's not my position at all. I'm completely movable and am pretty sure
Lai would be too.

> negotiate different workqueue settings.  But as I've said, I think this
> is requiring the decision to be made in a place that is not well
> positioned to make it.

Very true. A lot of the current workqueue code is a result of gradual
evolution and has a lot of historical cruft. Things like CPU_INTENSIVE or
long_wq were introduced partly to facilitate existing users to CM workqueue
and definitely are silly interfaces as you mentioned in another reply.

I can see two directions:

1. Keeping things mostly as-is - leave the selection of the specific
   workqueue type to users. We can improve things by adding debug / sanity
   checks to surfaces issues more proactively, adding more documentation and
   cleaning up old cruft.

2. Make things properly dynamic. By default, let workqueue core decide what
   type of execution constraints are gonna be applied and dynamically adjust
   according to the behavior of specific workqueues or work items. e.g. it
   can start most work items CM per-cpu by default and then release them as
   unbound when its cpu consumption exceeds certain threshold.

#1 is easier. #2 is definitely more attractive long term but would need a
lot of careful work. I was suggesting towards #1 mostly because it's a safer
/ easier direction but if you wanna explore #2, please be my guest.

Thank you.

-- 
tejun


Re: [PATCH] devres: Freeing the drs after all release() are called

2013-11-06 Thread t...@kernel.org
Hello, Liu.

On Thu, Nov 07, 2013 at 12:27:56AM +, Liu, Chuansheng wrote:
> The driver code is as below:
> _INIT() {
> 
> A = devm_kzalloc();
> B= devm_request_threaded_irq(isr_handler);
> C = devm_kzalloc();
> }
> 
> When driver _EXIT, the devres_release_all () will be called.
> The C will be kfreed before B, but when freeing irq B, the pending 
> isr_handler() possibly
> will access the memory B which has been freed.
> Then the memory corruption occurred.
> 
> This patch can solve this scenario.

Isn't the bug there IRQ being requested before all its resources are
allocated?  The proposed change just masks the underlying issue or
incorrectly ordered operations.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] devres: Freeing the drs after all release() are called

2013-11-06 Thread t...@kernel.org
Hello,

On Thu, Nov 07, 2013 at 12:36:56AM +, Liu, Chuansheng wrote:
> Yes, I knew I can put the code always like below:
> A = devm_kzalloc();
> C = devm_kzalloc();
> ...
> B= devm_request_threaded_irq(isr_handler);
> 
> But, the above is just one simple coding prototype, if there are many calling:
> E -- > F -- > D -- >... then to devm_kzalloc().
> 
> To be honest, it will make code too hard to always adapt the rule?
> And I trying to find out every potential devm_kzalloc() before irq requesting.

It isn't a good idea to paper over existing bugs from upper layer.
You realize that the above code sequence is already buggy during init
unless there's something explicitly blocking generation of irqs until
init is complete, right?  The right thing to do would be either
reordering the operations or wrapping the operation which unblocks irq
at the end of init with devres so that irq gets blocked before the
rest of release proceeds.

What we must *NOT* do is working around existing bugs in a half-assed
way from midlayer.  What if some of the devres resources are more
complex and actually need to be released by the devres release
callback instead of being freed along with @dr?  What if devres blocks
are in use and releases are done in multiple steps and irq release
belongs to a later group?

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] iwlwifi: dvm: convert create_singlethread_workqueue() to alloc_workqueue()

2016-03-19 Thread t...@kernel.org
Hello,

On Thu, Mar 17, 2016 at 12:43:35PM +, Grumbach, Emmanuel wrote:
> > 
> > Use alloc_workqueue() to allocate the workqueue instead of
> > create_singlethread_workqueue() since the latter is deprecated and is
> > scheduled
> > for removal.
> 
> I can't see any indication of that in the code of workqueue.
> Can you share details about that?

create*_workqueue()'s have been deprecated and replaced by
alloc*_workqueue()'s for some years now.  I probably should note that
in the header file.

Thanks.

-- 
tejun


Re: [PATCH 0/4] allow multiple kthreadd's

2020-05-01 Thread t...@kernel.org
Hello,

On Fri, May 01, 2020 at 07:05:46PM +, Trond Myklebust wrote:
> Wen running an instance of knfsd from inside a container, you want to
> be able to have the knfsd kthreads be parented to the container init
> process so that they get killed off when the container is killed.
> 
> Right now, we can easily leak those kernel threads simply by killing
> the container.

Hmm... I'm not sure that'd be a lot easier because they're in their own
thread group. It's not like you can queue signal to the group leader cause
the other kthreads to automatically die. Each would have to handle the exit
explicitly. As long as there is a way to iterate the member kthreads (e.g.
list going through kthread_data or sth else hanging off there), just using
existing kthread interface might not be much different, or maybe even easier
given how hairy signal handling in kthreads can get.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 09:31:34PM +, Bart Van Assche wrote:
> > > > +* As nothing prevents from completion happening while
> > > > +* ->aborted_gstate is set, this may lead to ignored completions
> > > > +* and further spurious timeouts.
> > > > +*/
> > > > +   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
> > > > +   blk_mq_rq_update_aborted_gstate(rq, 0);
...
> I think it can happen that the above code sees that (rq->rq_flags &
> RQF_MQ_TIMEOUT_RESET) != 0, that blk_mq_start_request() executes the
> following code:
> 
>   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>   blk_add_timer(rq);
> 
> and that subsequently blk_mq_rq_update_aborted_gstate(rq, 0) is called,
> which will cause the next completion to be lost. Is fixing one occurrence
> of a race and reintroducing it in another code path really an improvement?

I'm not following at all.  How would blk_mq_start_request() get called
on the request while it's still owned by the timeout handler?  That
gstate clearing is what transfers the ownership back to the
non-timeout path.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 09:56:41PM +, Bart Van Assche wrote:
> This patch increases the time during which .aborted_gstate == .gstate if the
> timeout is reset. Does that increase the chance that a completion will be 
> missed
> if the request timeout is reset?

It sure does, but we're comparing an outright kernel bug vs. an
inherently opportunistic mechanism being a bit more lossy.  I think
the answer is pretty clear.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 10:09:18PM +, Bart Van Assche wrote:
> Please elaborate what your long-term goal is for the blk-mq timeout handler.

Hmm... I don't really have any plans beyond what's been posted.

> The legacy block layer suspends request state changes while a timeout is
> being processed by holding the request queue lock while requests are being
> processed, while processing a timeout and while calling 
> q->rq_timed_out_fn(rq).
> Do you think it is possible to make the blk-mq core suspend request processing
> while processing a timeout without introducing locking in
> blk_mq_complete_request()? If you do not plan to add locking in
> blk_mq_complete_request(), do you think it is possible to fix all the races we
> discussed in previous e-mails?

I don't know of multiple race conditions.  What am I missing?  AFAIK,
there's one non-critical race condition which has always been there.
We have a larger race window for that case but don't yet know whether
that's problematic or not.  If that actually is problematic, we can
figure out a way to solve that but such effort / added complexity
doesn't seem justified yet.  No?

Thanks.

-- 
tejun


Re: [PATCH 5/7] block: Remove superflous rcu_read_[un]lock_sched() in blk_queue_enter()

2018-03-14 Thread t...@kernel.org
Hello, Bart.

On Tue, Mar 06, 2018 at 05:52:50PM +, Bart Van Assche wrote:
> I think the rcu_read_lock() and rcu_read_unlock() are really necessary in this
> code. In the LWN article "The RCU-barrier menagerie" it is explained that RCU
> can be used to enforce write ordering globally if the code that reads the 
> writes

Yeah but, for it to be meaningful, just like barriers, it has to be
paired with something on the other side.

> that are ordered with an RCU read lock (https://lwn.net/Articles/573497/). See
> also the following comment in scsi_device_quiesce():
> 
>   /*
>* Ensure that the effect of blk_set_preempt_only() will be visible
>* for percpu_ref_tryget() callers that occur after the queue
>* unfreeze even if the queue was already frozen before this function
>* was called. See also https://lwn.net/Articles/573497/.
>*/
> 
> Since this patch introduces a subtle and hard to debug race condition, please
> drop this patch.

Hah, the pairing is between scsi_device_quiesce() and
blk_queue_enter()?  But that doesn't make sense either because
scsi_device_quiesce() is doing regular synchronize_rcu() and
blk_queue_entre() is doing rcu_read_lock_sched().  They don't
interlock with each other in any way.

So, the right thing to do here would be somehow moving the RCU
synchronization into blk_set_preempt_only() and switching to regular
RCU protection in blk_queue_enter().  The code as-is isn't really
doing anything.

I'll drop this patch from the series for now.  Let's revisit it later.

Thanks.

-- 
tejun


[PATCH] pata_octeon_cf: remove unused local variables from octeon_cf_set_piomode()

2017-01-23 Thread t...@kernel.org
>From d786b91f422c6ad4c0d9bb9c1bef2dd5008e3d9d Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Mon, 23 Jan 2017 14:28:51 -0500

@t1 and @t2i are calculated along with @t2 but never used.  Drop them.

Signed-off-by: Tejun Heo 
Reported-by: David Binderman 
---
Applied to libata/for-4.11.  Thanks.

 drivers/ata/pata_octeon_cf.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/drivers/ata/pata_octeon_cf.c b/drivers/ata/pata_octeon_cf.c
index e94e7ca..f524a90 100644
--- a/drivers/ata/pata_octeon_cf.c
+++ b/drivers/ata/pata_octeon_cf.c
@@ -138,9 +138,7 @@ static void octeon_cf_set_piomode(struct ata_port *ap, 
struct ata_device *dev)
int trh;
int pause;
/* These names are timing parameters from the ATA spec */
-   int t1;
int t2;
-   int t2i;
 
/*
 * A divisor value of four will overflow the timing fields at
@@ -154,15 +152,9 @@ static void octeon_cf_set_piomode(struct ata_port *ap, 
struct ata_device *dev)
 
BUG_ON(ata_timing_compute(dev, dev->pio_mode, &timing, T, T));
 
-   t1 = timing.setup;
-   if (t1)
-   t1--;
t2 = timing.active;
if (t2)
t2--;
-   t2i = timing.act8b;
-   if (t2i)
-   t2i--;
 
trh = ns_to_tim_reg(div, 20);
if (trh)
-- 
2.9.3



Re: BUG: INTx is assered unexpectly when unload AHCI driver with MSIx support.

2016-07-18 Thread t...@kernel.org
Hello, Pang.

On Fri, Jul 15, 2016 at 12:39:59PM +, Pang Raymond wrote:
> Hi Tejun,
> 
> 
> Yes! It only happens when the device is shutdown.
> 
> I think you're right. It's more wiser to add clearing operation to driver
> 
> clean up path.
> 
> So we can add it to ahci_port_stop()
> 
> 
> libahci.c is got from Kernel 4.6.3 stable
> 
> 
> 
> --- drivers/ata/libahci.c.old 2016-07-15 13:33:47.489620405 +0800
> +++ drivers/ata/libahci.c 2016-07-15 14:01:33.081574586 +0800
> @@ -2392,12 +2392,18 @@
> static void ahci_port_stop(struct ata_port *ap)
> {
>   const char *emsg = NULL;
> + struct ahci_host_priv *hpriv = ap->host->private_data;
> + void __iomem *host_mmio = hpriv->mmio;
>   int rc;
> 
>   /* de-initialize port */
>   rc = ahci_deinit_port(ap, &emsg);
>   if (rc)
>ata_port_warn(ap, "%s (%d)\n", emsg, rc);
> +
> + /* Clear GHC.IS in case of asserting INTx after disable MSIx and re-enable 
> INTx */
> + writel(1 << ap->port_no, host_mmio + HOST_IRQ_STAT);
> +
> }

Yeah, this looks good to me.  Can you please format the patch
properly, add description and Signed-off-by?

  https://www.kernel.org/doc/Documentation/SubmittingPatches

Thanks.

-- 
tejun


Re: BUG: INTx is assered unexpectly when unload AHCI driver with MSIx support.

2016-07-06 Thread t...@kernel.org
Hello,

(Can you please flow the text a bit below 80 column when you reply?)

On Wed, Jul 06, 2016 at 02:51:47AM +, Pang Raymond wrote:
> Why does the phenomenon appear?
>   Two factors cause this problem.
>   1. In MSIx's irq handler
>  (ahci_multi_irqs_intr_hard()), we do not clear
>  GHC.IS register which presents which ports need
>  interrupt service.  static irqreturn_t
>  ahci_multi_irqs_intr_hard(int irq, void
>  *dev_instance)
>   {
>   //  omitting unconcerned codes here
>   // ...  
>   // here we just clear PxIS register, not clear 
> GHC.IS
>   status = readl(port_mmio + PORT_IRQ_STAT);
>   writel(status, port_mmio + PORT_IRQ_STAT);
>   // ...
>   }
>   2. In PCI subsystem, after Disable MSIx, INTx is
>  enabled automatically.
>   void pci_msix_shutdown(struct pci_dev *dev)
>   {
>   // omitting unconcerned codes here
>   // ...  
>   // after Disable MSIx, enable INTx imediately
>   pci_msix_clear_and_set_ctrl(dev, 
> PCI_MSIX_FLAGS_ENABLE, 0);
>   pci_intx_for_msi(dev, 1);
>   dev->msix_enabled = 0;
>   pcibios_alloc_irq(dev);
>   }
>
> When Kernel enables INTx(I mean clearing PCI configure space
> Rx04.bit10), controller will find GHC.IS is not Zero (because driver
> haven't cleared GHC.IS after servicing MSIx interrupt), so it
> believes some SATA ports need interrupt servicing and asserts INTx
> interrupt.

Looks like you debugged the problem.  Care to write up and test a
patch?

Thanks a lot!

-- 
tejun


Re: Kernel Panics on Xen ARM64 for Domain0 and Guest

2016-11-28 Thread t...@kernel.org
Hello,

On Mon, Nov 28, 2016 at 11:59:15AM +, Julien Grall wrote:
> > commit 3ca45a46f8af8c4a92dd8a08eac57787242d5021
> > percpu: ensure the requested alignment is power of two
> 
> It would have been useful to specify the tree used. In this case,
> this commit comes from linux-next.

I'm surprised this actually triggered.

> diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
> index f193414..4986dc0 100644
> --- a/arch/arm/xen/enlighten.c
> +++ b/arch/arm/xen/enlighten.c
> @@ -372,8 +372,7 @@ static int __init xen_guest_init(void)
>  * for secondary CPUs as they are brought up.
>  * For uniformity we use VCPUOP_register_vcpu_info even on cpu0.
>  */
> -   xen_vcpu_info = __alloc_percpu(sizeof(struct vcpu_info),
> -  sizeof(struct vcpu_info));
> +   xen_vcpu_info = alloc_percpu(struct vcpu_info);
> if (xen_vcpu_info == NULL)
> return -ENOMEM;

Yes, this looks correct.  Can you please cc stable too?  percpu
allocator never supported alignments which aren't power of two and has
always behaved incorrectly with alignments which aren't power of two.

Thanks.

-- 
tejun


Re: 答复: BUG: INTx is assered unexpectly when unload AHCI driver with MSIx support.

2016-07-12 Thread t...@kernel.org
Hello,

On Mon, Jul 11, 2016 at 05:16:00AM +, Pang Raymond wrote:
> static irqreturn_t  ahci_multi_irqs_intr_hard(int irq,
> void *dev_instance)
> {
>   //  omitting unconcerned codes here
>   //  ...
>status = readl(port_mmio + PORT_IRQ_STAT);
>writel(status, port_mmio + PORT_IRQ_STAT);
> 
>   // add patch code here.
> + writel(1 << ap->port_no, ap->host->iomap + HOST_IRQ_STAT);
> 
>   // ...

I think it'd be better to avoid adding stuff to the hot path.  This
only matters when the device is shut down, right?  Can't it just be
cleared in the driver cleanup path?

Thanks.

-- 
tejun


Re: [OOPS] BUG_ON in cgroups on 4.4.0-rc5-next

2015-12-23 Thread t...@kernel.org
Hello, Alex.

On Tue, Dec 22, 2015 at 07:06:41PM +, Alex Ng (LIS) wrote:
> > Can you please let me know the steps to reproduce the bug?
> 
> I tried this on a Hyper-V VM hosted in Windows Server 2012R2 and ran
> the attached script.  The script clones the linux-next tree in a
> random directory under /tmp in a tight loop.
>
> This panic is not always reproducible, and I have only hit it once
> after running the script about 10 times. A different kernel panic
> happens each time I run this script; and the panics always happen
> during the first iteration of the loop.

Heh, I don't get it.  The script doesn't do anything cgroup specific.
Can you please apply the attached patch, reproduce the issue and
report the kernel log?

Thanks.

-- 
tejun
---
 kernel/cgroup.c |   21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -779,6 +779,22 @@ static inline void get_css_set(struct cs
atomic_inc(&cset->refcount);
 }
 
+static void dump_cset(struct css_set *cset)
+{
+   struct cgrp_cset_link *link;
+
+   printk("XXX dumping cset %p\n", cset);
+   list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
+   struct cgroup *cgrp = link->cgrp;
+   struct cgroup_root *root = cgrp->root;
+
+   printk("root %d:0x%04x:%s ",
+  root->hierarchy_id, root->subsys_mask, root->name);
+   pr_cont_cgroup_path(cgrp);
+   pr_cont("\n");
+   }
+}
+
 /**
  * compare_css_sets - helper function for find_existing_css_set().
  * @cset: candidate css_set being tested
@@ -831,7 +847,10 @@ static bool compare_css_sets(struct css_
cgrp1 = link1->cgrp;
cgrp2 = link2->cgrp;
/* Hierarchies should be linked in the same order. */
-   BUG_ON(cgrp1->root != cgrp2->root);
+   if (WARN_ON(cgrp1->root != cgrp2->root)) {
+   dump_cset(cset);
+   dump_cset(old_cset);
+   }
 
/*
 * If this hierarchy is the hierarchy of the cgroup


Re: [OOPS] BUG_ON in cgroups on 4.4.0-rc5-next

2015-12-21 Thread t...@kernel.org
Hello, Alex.

On Fri, Dec 18, 2015 at 08:08:03PM +, Alex Ng (LIS) wrote:
> Hi,
> 
> I was running a "git clone" of the linux-next source tree and hit the 
> following BUG_ON condition. My box is running kernel 
> 4.4.0-rc5-next-20151217-52.27. Any ideas on how to pin down the cause?
> 
> The trace indicates that the following condition in compare_css_sets() 
> triggered the oops:

Can you please let me know the steps to reproduce the bug?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-09 Thread t...@kernel.org
Hello,

On Mon, Nov 09, 2020 at 01:50:40PM +, Trond Myklebust wrote:
> > I'm thinking the real problem is that you're abusing workqueues. Just
> > don't stuff so much work into it that this becomes a problem. Or
> > rather,
> > if you do, don't lie to it about it.
> 
> If we can't use workqueues to call iput_final() on an inode, then what
> is the point of having them at all?
> 
> Neil's use case is simply a file that has managed to accumulate a
> seriously large page cache, and is therefore taking a long time to
> complete the call to truncate_inode_pages_final(). Are you saying we
> have to allocate a dedicated thread for every case where this happens?

I think the right thing to do here is setting CPU_INTENSIVE or using an
unbound workqueue. Concurrency controlled per-cpu workqueue is unlikely to
be a good fit if the work can run long enough to need cond_resched(). Better
to let the scheduler handle it. Making workqueue warn against long-running
concurrency managed per-cpu work items would be great. I'll put that on my
todo list but if anyone is interested please be my guest.

Thanks.

-- 
tejun


Re: [PATCH rfc] workqueue: honour cond_resched() more effectively.

2020-11-09 Thread t...@kernel.org
Hello,

On Mon, Nov 09, 2020 at 02:11:42PM +, Trond Myklebust wrote:
> That means changing all filesystem code to use cpu-intensive queues. As
> far as I can tell, they all use workqueues (most of them using the
> standard system queue) for fput(), dput() and/or iput() calls.

I suppose the assumption was that those operations couldn't possiby be
expensive enough to warrant other options, which doesn't seem to be the case
unfortunately. Switching the users to system_unbound_wq, which should be
pretty trivial, seems to be the straight forward solution.

I can definitely see benefits in making workqueue smarter about
concurrency-managed work items taking a long time. Given that nothing on
these types of workqueues can be latency sensitive and the problem being
reported is on the scale of tens of seconds, I think a more palatable
approach could be through watchdog mechanism rather than hooking into
cond_resched(). Something like:

* Run watchdog timer more frequently - e.g. 1/4 of threshold.

* If a work item is occupying the local concurrency for too long, set
  WORKER_CPU_INTENSIVE for the worker and, probably, generate a warning.

I still think this should generate a warning and thus can't replace
switching to unbound wq. The reason is that the concurrency limit isn't the
only problem. A kthread needing to run on one particular CPU for tens of
seconds just isn't great.

Thanks.

-- 
tejun


Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread t...@kernel.org
On Wed, Dec 20, 2017 at 11:41:02PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > Currently, blk-mq timeout path synchronizes against the usual
> > issue/completion path using a complex scheme involving atomic
> > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > rules.  Unfortunatley, it contains quite a few holes.
> 
> Hello Tejun,
> 
> An attempt to run SCSI I/O with this patch series applied resulted in
> the following:

Can you please try the v3?  There were a couple bugs that I missed
while testing earlier versions.

 http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org

Thanks.

-- 
tejun


Re: [PATCH 1/6] blk-mq: protect completion path with RCU

2017-12-14 Thread t...@kernel.org
On Thu, Dec 14, 2017 at 05:01:06PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +   } else {
> > +   srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> > +   if (!blk_mark_rq_complete(rq))
> > +   __blk_mq_complete_request(rq);
> > +   srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> 
> Hello Tejun,
> 
> The name queue_rq_srcu was chosen to reflect the original use of that 
> structure,
> namely to protect .queue_rq() calls. Your patch series broadens the use of 
> that

Yeah, will add a patch to rename it.

> srcu structure so I would appreciate it if it would be renamed, e.g. into 
> "srcu".
> See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Ah yeah, it'd be nice to have the [s]rcu synchronize calls factored
out.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread t...@kernel.org
Hello,

On Thu, Dec 14, 2017 at 06:51:11PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > rules.  Unfortunatley, it contains quite a few holes.
>   ^
>   Unfortunately?
> 
> > While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
> ^^^
> synchronization?

lol, believe it or not, my english spelling is a lot better than my
korean.  Will fix them.

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct 
> > request *rq)
> > rq->start_time = jiffies;
> > set_start_time_ns(rq);
> > rq->part = NULL;
> > +   seqcount_init(&rq->gstate_seq);
> > +   u64_stats_init(&rq->aborted_gstate_sync);
> >  }
> >  EXPORT_SYMBOL(blk_rq_init);
> 
> Sorry but the above change looks ugly to me. My understanding is that 
> blk_rq_init() is only used inside the block layer to initialize legacy block
> layer requests while gstate_seq and aborted_gstate_sync are only relevant
> for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> called for blk-mq requests such that the above change can be left out? The
> only callers outside the block layer core of blk_rq_init() I know of are
> ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> code if you want.

This is also used by flush path.  We probably should clean that up,
but let's worry about that later cuz flush handling has enough of its
own complications.

> > +   write_seqcount_begin(&rq->gstate_seq);
> > +   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +   blk_add_timer(rq);
> > +   write_seqcount_end(&rq->gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Write memory barrier has no cost on x86 and usually pretty low cost
elsewhere too and they're likely in the same cacheline as other rq
fields.

> > @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> > reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > +   /*
> > +* As nothing prevents from completion happening while
> > +* ->aborted_gstate is set, this may lead to ignored
> > +* completions and further spurious timeouts.
> > +*/
> > +   u64_stats_update_begin(&req->aborted_gstate_sync);
> > +   req->aborted_gstate = 0;
> > +   u64_stats_update_end(&req->aborted_gstate_sync);
> 
> If a blk-mq request is resubmitted 2**62 times, can that result in the above
> code setting aborted_gstate to the same value as gstate? Isn't that a bug?
> If so, how about setting aborted_gstate in the above code to e.g. gstate ^ 
> (2**63)?

A request gets aborted only if the state is in-flight, 0 isn't that.
Also, how many years would 2^62 times be?

> > +   struct u64_stats_sync aborted_gstate_sync;
> > +   u64 aborted_gstate;
> > +
> > unsigned long deadline;
> > struct list_head timeout_list;
> 
> Why are gstate and aborted_gstate 64-bit variables? What makes you think that
> 32 bits would not be enough?

Because 32 bits puts it in the rance where a false hit is still
theoretically possible in a reasonable amount of time.

Thanks.

-- 
tejun


Re: [PATCH 4/6] blk-mq: make blk_abort_request() trigger timeout path

2017-12-14 Thread t...@kernel.org
Hello,

On Thu, Dec 14, 2017 at 06:56:55PM +, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> >  void blk_abort_request(struct request *req)
> >  {
> > -   if (blk_mark_rq_complete(req))
> > -   return;
> >  
> > if (req->q->mq_ops) {
> > -   blk_mq_rq_timed_out(req, false);
> > +   req->deadline = jiffies;
> > +   mod_timer(&req->q->timeout, 0);
> > } else {
> > +   if (blk_mark_rq_complete(req))
> > +   return;
> > blk_delete_timer(req);
> > blk_rq_timed_out(req);
> > }
> 
> This patch makes blk_abort_request() asynchronous for blk-mq. Have all callers
> been audited to verify whether this change is safe?

I *think* so.  For all the ata related parts, I know they're.
mtip32xx and dasd_ioctl, it seems safe, but I can't tell for sure.
Will cc the respective maintainers on the next posting.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-15 Thread t...@kernel.org
Hello, Bart.

On Thu, Dec 14, 2017 at 09:13:32PM +, Bart Van Assche wrote:
...
> however is called before a every use of a request. Sorry but I'm not
> enthusiast about the code in blk_rq_init() that reinitializes state
> information that should survive request reuse.

If it wasn't clear, me neither.  I think what'd be better is making
those paths use the usual request allocation path instead of custom
one but for flush handling, that's not gonna be trivial, so let's deal
with that later.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-15 Thread t...@kernel.org
Hello, Peter.

On Thu, Dec 14, 2017 at 09:20:42PM +0100, Peter Zijlstra wrote:
> But now that I look at this again, TJ, why can't the below happen?
> 
>   write_seqlock_begin();
>   blk_mq_rq_update_state(rq, IN_FLIGHT);
>   blk_add_timer(rq);
>   
>   read_seqcount_begin()
>   while (seq & 1)
>   cpurelax();
>   // life-lock
>   
>   write_seqlock_end();

Ah, you're right.  For both gstate_seq and aborted_gstate_sync, we can
push all synchronization to the timeout side - ie. gstate_seq read can
yield, pause or synchronize_rcu and hte aborted_gstate_sync can
disable irq around update.

Thanks.

-- 
tejun


Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-12 Thread t...@kernel.org
Hello, Bart.

On Tue, Dec 12, 2017 at 09:37:11PM +, Bart Van Assche wrote:
> Have you considered the following instead of introducing MQ_RQ_IDLE and
> MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
> operations introduced in the hot path by this patch series.

But nothing in the hot paths is atomic.

> static inline bool blk_mq_rq_in_flight(struct request *rq)
> {
>   return list_empty(&rq->queuelist);
> }

And the fact that we encode the generation number and state into a
single variable contributes to not needing atomic operations.
Breaking up the state and generation like the above would need more
synchronization, not less.

Thanks.

-- 
tejun


Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED

2017-12-12 Thread t...@kernel.org
On Tue, Dec 12, 2017 at 10:20:39PM +, Bart Van Assche wrote:
> The above code should show all requests owned by the block driver. Patch
> "blk-mq-debugfs: Also show requests that have not yet been started" (not yet
> in Jens' tree) changes the REQ_ATOM_STARTED test into 
> list_empty(&rq->queuelist).
> Can that change be supported with the existing MQ_RQ_* states or will a new
> state have to be introduced to support this? See also
> https://marc.info/?l=linux-block&m=151252188411991.

If list_empty() test was correct before, it'd be correct now too.

Thnaks.

-- 
tejun


Re: net/sunrpc: v4.14-rc4 lockdep warning

2017-10-10 Thread t...@kernel.org
Hello, Trond.

On Mon, Oct 09, 2017 at 06:32:13PM +, Trond Myklebust wrote:
> On Mon, 2017-10-09 at 19:17 +0100, Lorenzo Pieralisi wrote:
> > I have run into the lockdep warning below while running v4.14-rc3/rc4
> > on an ARM64 defconfig Juno dev board - reporting it to check whether
> > it is a known/genuine issue.
> > 
> > Please let me know if you need further debug data or need some
> > specific tests.
> > 
> > [6.209384] ==
> > [6.215569] WARNING: possible circular locking dependency detected
> > [6.221755] 4.14.0-rc4 #54 Not tainted
> > [6.225503] --
> > [6.231689] kworker/4:0H/32 is trying to acquire lock:
> > [6.236830]  ((&task->u.tk_work)){+.+.}, at: []
> > process_one_work+0x1cc/0x3f0
> > [6.245472] 
> >but task is already holding lock:
> > [6.251309]  ("xprtiod"){+.+.}, at: []
> > process_one_work+0x1cc/0x3f0
> > [6.259158] 
> >which lock already depends on the new lock.
> > 
> > [6.267345] 
> >the existing dependency chain (in reverse order) is:
..
> Adding Tejun and Lai, since this looks like a workqueue locking issue.

It looks a bit cryptic but it's warning against the following case.

1. Memory pressure is high and rescuer kicks in for the xprtiod
   workqueue.  There are no other kworkers serving the workqueue.

2. The rescuer runs the xptr_destroy path and ends up calling
   cancel_work_sync() on a work item which is queued on xprtiod.

3. The work item is pending on the same workqueue and assuming that
   memory pressure doesn't let off (let's say reclaim is trying to
   kick off nfs pages), the only way it can get executed is by the
   rescuer which is waiting for the work item - an A-B-A deadlock.

Thanks.

-- 
tejun


Re: net/sunrpc: v4.14-rc4 lockdep warning

2017-10-10 Thread t...@kernel.org
Hello,

On Tue, Oct 10, 2017 at 04:48:57PM +, Trond Myklebust wrote:
> Thanks for the explanation. What I'm not really understanding here
> though, is how the work item could be queued at all. We have a
> wait_on_bit_lock() in xprt_destroy() that should mean the xprt-
> >task_cleanup work item has completed running, and that it cannot be
> requeued.
> 
> Is there a possibility that the flush_queue() might be triggered
> despite the work item not being queued?

Yeah, for sure.  The lockdep annotations don't distinguish those
cases and assume the worst case.

Thanks.

-- 
tejun


Re: ATA failure regression

2016-09-29 Thread t...@kernel.org
Hello,

On Wed, Sep 28, 2016 at 05:45:08AM +, Shah, Nehal-bakulchandra wrote:
> Can someone please help me to debug this issue?

The only thing I can do from libata side is disbling msi on the
affected platform, but the problem doesn't seem confined to ahci, so
probably the right thing to do for now is disabling msi on the
platform?

Thanks.

-- 
tejun


Re: ATA failure regression

2016-09-29 Thread t...@kernel.org
Hello,

(cc'ing Dan, hi!)

On Thu, Sep 29, 2016 at 08:25:58AM -0300, Henrique de Moraes Holschuh wrote:
> Actually, according to Shah's post from the 26th to this thread,
> everything works just fine even with the IOMMU enabled, as long as the
> system is configured to NOT auto-shutdown unused SATA ports.
> 
> That seems to give some extra avenues at fixing/working around the
> issue, maybe?

Hmm... it could be that the multiple msix routing gets screwed when
the bios disables unoccupied ports.  Dan, can you please look into
this?  Shah is seeing ahci failing during boot on an AMD platform with
msi enabled.  If disabling of unoccupied ports is turned off in bios,
the problem goes away.  Maybe we're mapping the irqs incorrectly in
such cases?

Thanks.

-- 
tejun


Re: Question about mv_print_info in sata_mv.c in sata_mv.c

2014-12-26 Thread t...@kernel.org >> Tejun Heo
Hello, Nick.

On Fri, Dec 26, 2014 at 04:06:11PM -0500, nick wrote:
> I am assuming after reading this function's code, that this function is 
> completed and no longer
> needs a fix me comment above it to be completed.

I do appreciate that you're studying the FIXME comments but at this
point I'm not sure whether blindly chasing them and asking people
whether they're still necessary is a productive thing to do.  If
they're actively misleading, sure, let's remove them, but FIXME in a
sata_mv function which prints some controller identification
information just doesn't matter.  If you can assert that the comment
is no longer necessary and misleading, please submit a patch with
backing rationale; otherwise, obsessing with each instance of FIXME
comment doesn't seem to be a particularly productive way of
participating in kernel development.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/