On Mon, Oct 09, 2017 at 09:16:53PM +0000, Bart Van Assche wrote:
> On Sat, 2017-10-07 at 12:33 +0800, Ming Lei wrote:
> > On Wed, Oct 04, 2017 at 05:01:10PM -0700, Bart Van Assche wrote:
> > > It is essential during suspend and resume that neither the filesystem
> > > state nor the filesystem metadata in RAM changes. This is why while
> > > the hibernation image is being written or restored that SCSI devices
> > 
> > quiesce isn't used only for suspend and resume, And the issue isn't
> > suspend/resume specific too. So please change the title/commit log
> > as sort of 'make SCSI quiesce more reliable/safe'. 
> 
> OK, I will modify the patch title.
> 
> > > -         if (percpu_ref_tryget_live(&q->q_usage_counter))
> > > -                 return 0;
> > > +         if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> > > +                 /*
> > > +                  * The code that sets the PREEMPT_ONLY flag is
> > > +                  * responsible for ensuring that that flag is globally
> > > +                  * visible before the queue is unfrozen.
> > > +                  */
> > > +                 if (preempt || !blk_queue_preempt_only(q)) {
> > 
> > PREEMPT_ONLY flag is checked without RCU read lock held, so the
> > synchronize_rcu() may just wait for completion of pre-exit
> > percpu_ref_tryget_live(), which can be reordered with the
> > reading on blk_queue_preempt_only().
> 
> OK, I will add rcu_read_lock_sched/unlock_sched() calls around this code.
> 
> > >  int
> > >  scsi_device_quiesce(struct scsi_device *sdev)
> > >  {
> > > + struct request_queue *q = sdev->request_queue;
> > >   int err;
> > >  
> > > + blk_mq_freeze_queue(q);
> > > + if (blk_set_preempt_only(q)) {
> > > +         blk_mq_unfreeze_queue(q);
> > > +         return -EINVAL;
> > > + }
> > 
> > This way is wrong, if blk_set_preempt_only() returns true
> > it means the queue has been in PREEMPT_ONLY already,
> > and failing scsi_device_quiesce() can break suspend/resume or
> > sending SCSI domain validation command.
> 
> That's an interesting comment but I propose that what you suggest is 
> implemented
> through a separate patch. The above code preserves the existing behavior, 
> namely
> that scsi_device_quiesce() returns an error code if called when a SCSI device 
> has
> already been quiesced. See also scsi_device_set_state(). BTW, I don't think 
> that

Please see scsi_device_set_state():

        if (state == oldstate)
                return 0;

And believe it or not, there is one real nested calling of scsi_device_quiesce()
in transport_spi.

> it is likely that this will occur. The only code other than the power 
> management
> code I know of that sets the SCSI QUIESCE state is the SCSI sysfs state store
> method and the SCSI parallel code. I don't know any software that sets the 
> SCSI
> QUIESCE state through sysfs. And I'm not sure the SCSI parallel code has a
> significant number of users.

As I know, one famous VM uses that, and the original report is that
this VM may hang after running I/O for days by our customer.
The revalidate path can trigger QUIESCE, and udev should cause that.

> 
> > > + /*
> > > +  * Ensure that the effect of blk_set_preempt_only() will be visible
> > > +  * for percpu_ref_tryget() callers that occur after the queue
> > > +  * unfreeze. See also https://lwn.net/Articles/573497/.
> > > +  */
> > > + synchronize_rcu();
> > 
> > This synchronize_rcu may be saved if we set the PREEMPT_ONLY flag
> > before freezing queue since blk_mq_freeze_queue() may implicate
> > one synchronize_rcu().
> 
> That's an interesting comment. I will look further into this.
> 
> > > @@ -2961,8 +2970,10 @@ void scsi_device_resume(struct scsi_device *sdev)
> > >    */
> > >   mutex_lock(&sdev->state_mutex);
> > >   if (sdev->sdev_state == SDEV_QUIESCE &&
> > > -     scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
> > > +     scsi_device_set_state(sdev, SDEV_RUNNING) == 0) {
> > > +         blk_clear_preempt_only(sdev->request_queue);
> > >           scsi_run_queue(sdev->request_queue);
> > > + }
> > >   mutex_unlock(&sdev->state_mutex);
> > 
> > scsi_run_queue() can be removed, and blk_clear_preempt_only() needn't
> > to be run with holding sdev->state_mutex, just like in quiesce path.
> 
> I will look into removing the scsi_run_queue() call. But I prefer to keep
> the blk_clear_preempt_only() inside the critical section because that call
> doesn't sleep and because it changes state information that is related to
> the SCSI state.

The thing is that the same state change in blk_set_preempt_only() can't
be protected by the lock, and this way may confuse people wrt. inconsistent
lock usage on same state protection.

-- 
Ming

Reply via email to