Re: [PATCH] Avoid that a kernel warning appears during system resume

2019-03-17 Thread Martin Steigerwald
Hi Bart.

Bart Van Assche - 16.03.19, 22:28:
> On 3/16/19 3:43 AM, Martin Steigerwald wrote:
> > Bart Van Assche - 16.03.19, 00:27:
> >> Since scsi_device_quiesce() skips SCSI devices that have another
> >> state than RUNNING, OFFLINE or TRANSPORT_OFFLINE,
> >> scsi_device_resume() should not complain about SCSI devices that
> >> have been skipped. Hence this patch. This patch avoids that the
> >> following warning appears> 
> >> during resume:
> > Am I on CC cause one of those warnings appeared in bug reports from
> > me from quite some time ago?
[…]
> That's correct. I hope that you don't mind that I cc'ed you?

No, I don't mind.

Would you like me to apply the patch in order to test it? As it is just 
about suppressing a kernel warning and not changing any major 
functionality, I wondered whether you like me to do with this or whether 
the CC is more JFYI.

> >> WARNING: CPU: 3 PID: 1039 at blk_clear_pm_only+0x2a/0x30
> >> CPU: 3 PID: 1039 Comm: kworker/u8:49 Not tainted 5.0.0+ #1
> >> Hardware name: LENOVO 4180F42/4180F42, BIOS 83ET75WW (1.45 )
> > 
> > This at least does not appear to be this ThinkPad T520, as I have
> > BIOS version 1.49 already.
> 
> The call trace in the patch description is only an example. I think
> the problem description and the patch applies to all systems that
> have one or more SCSI disks.

Best,
-- 
Martin




Re: [PATCH] Avoid that a kernel warning appears during system resume

2019-03-16 Thread Martin Steigerwald
Hi Bart.

Bart Van Assche - 16.03.19, 00:27:
> Since scsi_device_quiesce() skips SCSI devices that have another state
> than RUNNING, OFFLINE or TRANSPORT_OFFLINE, scsi_device_resume()
> should not complain about SCSI devices that have been skipped. Hence
> this patch. This patch avoids that the following warning appears
> during resume:

Am I on CC cause one of those warnings appeared in bug reports from me 
from quite some time ago?

> WARNING: CPU: 3 PID: 1039 at blk_clear_pm_only+0x2a/0x30
> CPU: 3 PID: 1039 Comm: kworker/u8:49 Not tainted 5.0.0+ #1
> Hardware name: LENOVO 4180F42/4180F42, BIOS 83ET75WW (1.45 )

This at least does not appear to be this ThinkPad T520, as I have BIOS 
version 1.49 already.

Thanks and have good weekend,
Martin

> 05/10/2013 Workqueue: events_unbound async_run_entry_fn
> RIP: 0010:blk_clear_pm_only+0x2a/0x30
> Call Trace:
>  ? scsi_device_resume+0x28/0x50
>  ? scsi_dev_type_resume+0x2b/0x80
>  ? async_run_entry_fn+0x2c/0xd0
>  ? process_one_work+0x1f0/0x3f0
>  ? worker_thread+0x28/0x3c0
>  ? process_one_work+0x3f0/0x3f0
>  ? kthread+0x10c/0x130
>  ? __kthread_create_on_node+0x150/0x150
>  ? ret_from_fork+0x1f/0x30
> 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Ming Lei 
> Cc: Johannes Thumshirn 
> Cc: Oleksandr Natalenko 
> Cc: Martin Steigerwald 
> Cc: 
> Reported-by: Jisheng Zhang 
> Tested-by: Jisheng Zhang 
> Fixes: 3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work
> reliably") # v4.15 Signed-off-by: Bart Van Assche
> 
> ---
>  drivers/scsi/scsi_lib.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 324f830ee9fa..54ad751b42b8 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -2541,8 +2541,10 @@ void scsi_device_resume(struct scsi_device
> *sdev) * device deleted during suspend)
>*/
>   mutex_lock(&sdev->state_mutex);
> - sdev->quiesced_by = NULL;
> - blk_clear_pm_only(sdev->request_queue);
> + if (sdev->quiesced_by) {
> + sdev->quiesced_by = NULL;
> + blk_clear_pm_only(sdev->request_queue);
> + }
>   if (sdev->sdev_state == SDEV_QUIESCE)
>   scsi_device_set_state(sdev, SDEV_RUNNING);
>   mutex_unlock(&sdev->state_mutex);


-- 
Martin




Re: [PATCH] Change synchronize_rcu() in scsi_device_quiesce() into synchronize_sched()

2018-03-19 Thread Martin Steigerwald
Hi Bart.

Bart Van Assche - 16.03.18, 22:51:
> On 03/16/18 14:42, Martin Steigerwald wrote:
> > What is this one about?
> > 
> > Fix for the regression I (and others?) reported?¹
> > 
> > [1] [Bug 199077] [Possible REGRESSION, 4.16-rc4] Error updating SMART data
> > during runtime and boot failures with blk_mq_terminate_expired in
> > backtrace
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=199077
[…]
> This patch is a fix for the fix for the bug that you and others had
> reported. See also "I/O hangs after resuming from suspend-to-ram"
> (https://marc.info/?l=linux-block&m=150340235201348).
> 
> This patch fixes an API violation. For those users for which suspend and
> resume works fine with commit 3a0a529971ec applied it should still work
> fine with this patch applied since this patch may cause
> scsi_device_quiesce() to wait longer.

Okay, so if I understand you correctly, this is not related to the regression 
I mentioned above.

Testing anyway.

Thanks,
-- 
Martin


Re: [PATCH] Change synchronize_rcu() in scsi_device_quiesce() into synchronize_sched()

2018-03-16 Thread Martin Steigerwald
Hello Bart.

What is this one about?

Fix for the regression I (and others?) reported?¹

[1] [Bug 199077] [Possible REGRESSION, 4.16-rc4] Error updating SMART data 
during runtime and boot failures with blk_mq_terminate_expired in backtrace

https://bugzilla.kernel.org/show_bug.cgi?id=199077

Thanks,
Martin

Bart Van Assche - 16.03.18, 18:35:
> Since blk_queue_enter() uses rcu_read_lock_sched() scsi_device_quiesce()
> must use synchronize_sched().
> 
> Reported-by: Tejun Heo 
> Fixes: 3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work
> reliably") Signed-off-by: Bart Van Assche 
> Cc: Hannes Reinecke 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Johannes Thumshirn 
> Cc: Tejun Heo 
> Cc: Oleksandr Natalenko 
> Cc: Martin Steigerwald 
> Cc: sta...@vger.kernel.org # v4.15
> ---
>  drivers/scsi/scsi_lib.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 1d83f29aee74..0b99ee2fbbb5 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -3014,7 +3014,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
>* unfreeze even if the queue was already frozen before this function
>* was called. See also https://lwn.net/Articles/573497/.
>*/
> - synchronize_rcu();
> + synchronize_sched();
>   blk_mq_unfreeze_queue(q);
> 
>   mutex_lock(&sdev->state_mutex);


-- 
Martin


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-30 Thread Martin Steigerwald
Ming Lei - 30.01.18, 02:24:
> > > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> > > V4.13-rc1, it is enabled at default, but later the patch is reverted
> > > in V4.13-rc7, and becomes disabled at default too.
> > > 
> > > Now both the original reported PM issue(actually SCSI quiesce) and
> > > the sequential IO performance issue have been addressed.
> > 
> > Is the blocker bug just not closed because no-one thought to do it:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=178381
> > 
> > (we have confirmed that this issue is now fixed with the original
> > reporter?)
> 
> From a developer view, this issue is fixed by the following commit:
> 3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably),
> and it is verified by kernel list reporter.

I never seen any suspend / hibernate related issues with blk-mq + bfq since 
then. Using heavily utilized BTRFS dual SSD RAID 1.

% egrep "MQ|BFQ" /boot/config-4.15.0-tp520-btrfstrim+
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BLK_WBT_MQ=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_MQ_IOSCHED_DEADLINE=m
CONFIG_MQ_IOSCHED_KYBER=m
CONFIG_IOSCHED_BFQ=m
CONFIG_BFQ_GROUP_IOSCHED=y
CONFIG_NET_SCH_MQPRIO=m
# CONFIG_SCSI_MQ_DEFAULT is not set
# CONFIG_DM_MQ_DEFAULT is not set
CONFIG_DM_CACHE_SMQ=m

% cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.15.0-tp520-btrfstrim+ root=UUID=[…] ro 
rootflags=subvol=debian resume=/dev/mapper/sata-swap init=/bin/systemd 
thinkpad_acpi.fan_control=1 systemd.restore_state=0 scsi_mod.use_blk_mq=1

% cat /sys/block/sda/queue/scheduler 
[bfq] none

Thanks,
-- 
Martin


Re: [PATCH v7 9/9] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-12 Thread Martin Steigerwald
Bart Van Assche - 10.10.17, 15:27:
> On Tue, 2017-10-10 at 09:57 +0200, Martin Steigerwald wrote:
> 
> > Bart Van Assche - 09.10.17, 16:14:
> > 
> > > The contexts from which a SCSI device can be quiesced or resumed are:
> > > [ ... ]
> > 
> > 
> > Does this as reliably fix the issue as the patches from Ming? I mean in
> > *real world* scenarios? Or is it just about the same approach as Ming
> > has taken. 
> > I ask cause I don´t see any Tested-By:´s here? I know I tested Ming´s
> > patch series and I know it fixes the hang after resume from suspend
> > with blk-mq + BFQ issue for me. I have an uptime of 7 days and I didn´t
> > see any uptime even remotely like that in a long time (before that issue
> > Intel gfx drivers caused hangs, but thankfully that seems fixed
> > meanwhile).
> > 
> > I´d be willing to test. Do you have a 4.14.x tree available with these
> > patches applied I can just add as a remote and fetch from?

> A previous version of this patch series passed Oleksandr's tests. This
> series is close to that previous version so I think it is safe to assume
> that this version will also pass Oleksandr's tests. Anyway, more testing is
> definitely welcome so if you want to verify this series please start from
> the following tree (Jens' for-next branch + this series):
> https://github.com/bvanassche/linux/tree/blk-mq-pm-v7 

Several suspend to ram and resume, and suspend to disk and resume cycles 
without issues. So I think this is good.

Tested-By: Martin Steigerwald 

Thanks,
-- 
Martin


Re: [PATCH v7 9/9] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-10 Thread Martin Steigerwald
Bart Van Assche - 09.10.17, 16:14:
> The contexts from which a SCSI device can be quiesced or resumed are:
> * Writing into /sys/class/scsi_device/*/device/state.
> * SCSI parallel (SPI) domain validation.
> * The SCSI device power management methods. See also scsi_bus_pm_ops.
> 
> It is essential during suspend and resume that neither the filesystem
> state nor the filesystem metadata in RAM changes. This is why while
> the hibernation image is being written or restored that SCSI devices
> are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
> and scsi_device_resume(). In the SDEV_QUIESCE state execution of
> non-preempt requests is deferred. This is realized by returning
> BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
> devices. Avoid that a full queue prevents power management requests
> to be submitted by deferring allocation of non-preempt requests for
> devices in the quiesced state. This patch has been tested by running
> the following commands and by verifying that after resume the fio job
> is still running:
> 
> for d in /sys/class/block/sd*[a-z]; do
>   hcil=$(readlink "$d/device")
>   hcil=${hcil#../../../}
>   echo 4 > "$d/queue/nr_requests"
>   echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
> done
> bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
> bdev=${bdev#../../}
> hcil=$(readlink "/sys/block/$bdev/device")
> hcil=${hcil#../../../}
> fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512
> --rw=randread \ --ioengine=libaio --numjobs=4 --iodepth=16
> --iodepth_batch=1 --thread \ --loops=$((2**31)) &
> pid=$!
> sleep 1
> systemctl hibernate
> sleep 10
> kill $pid
> 
> Reported-by: Oleksandr Natalenko 
> References: "I/O hangs after resuming from suspend-to-ram"
> (https://marc.info/?l=linux-block&m=150340235201348). Signed-off-by: Bart
> Van Assche 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 

Does this as reliably fix the issue as the patches from Ming? I mean in *real 
world* scenarios? Or is it just about the same approach as Ming has taken.

I ask cause I don´t see any Tested-By:´s here? I know I tested Ming´s patch 
series and I know it fixes the hang after resume from suspend with blk-mq + BFQ 
issue for me. I have an uptime of 7 days and I didn´t see any uptime even 
remotely like that in a long time (before that issue Intel gfx drivers caused 
hangs, but thankfully that seems fixed meanwhile).

I´d be willing to test. Do you have a 4.14.x tree available with these patches 
applied I can just add as a remote and fetch from?

Thanks,
Martin

> ---
>  block/blk-core.c| 42 +++---
>  block/blk-mq.c  |  4 ++--
>  block/blk-timeout.c |  2 +-
>  drivers/scsi/scsi_lib.c | 28 ++--
>  fs/block_dev.c  |  4 ++--
>  include/linux/blkdev.h  |  2 +-
>  6 files changed, 59 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ed992cbd107f..3847ea42e341 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q)
> 
>   spin_lock_irqsave(q->queue_lock, flags);
>   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> + wake_up_all(&q->mq_freeze_wq);
>   spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
> @@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
> 
> -int blk_queue_enter(struct request_queue *q, bool nowait)
> +/**
> + * blk_queue_enter() - try to increase q->q_usage_counter
> + * @q: request queue pointer
> + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + */
> +int blk_queue_enter(struct request_queue *q, unsigned int flags)
>  {
> + const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
> +
>   while (true) {
> + bool success = false;
>   int ret;
> 
> - if (percpu_ref_tryget_live(&q->q_usage_counter))
> + rcu_read_lock_sched();
> + if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> + /*
> +  * The code that sets the PREEMPT_ONLY flag is
> +  * responsible for ensuring that that flag is globally
> +  * visible before the queue is unfrozen.
> +  */
> + if (preempt || !blk_queue_preempt_only(q)) {
> + success = true;
> + } else {
> + percpu_ref_put(&q->q_usage_counter);
> + WARN_ONCE("%s: Attempt to allocate non-preempt 
> request in 
preempt-only
> mode.\n", + kobject_name(q->kobj.parent));
> + }
> + }
> + rcu_read_unlock_sched();
> +
> +   

Re: [PATCH V7 0/6] block/scsi: safe SCSI quiescing

2017-09-30 Thread Martin Steigerwald
Hi Ming.

Ming Lei - 30.09.17, 14:12:
> Please consider this patchset for V4.15, and it fixes one
> kind of long-term I/O hang issue in either block legacy path
> or blk-mq.
>
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Isn´t that material for -stable as well?

I´d love to see this go into 4.14. Especially as its an LTS release.

Thanks,
Martin

> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
> 
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation in case of transport_spi.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt only mode, and solves the issue
> by allowing RQF_PREEMP only during SCSI quiesce.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all.
> 
> V7:
>   - add Reviewed-by & Tested-by
>   - one line change in patch 5 for checking preempt request
> 
> V6:
>   - borrow Bart's idea of preempt only, with clean
> implementation(patch 5/patch 6)
>   - needn't any external driver's dependency, such as MD's
>   change
> 
> V5:
>   - fix one tiny race by introducing blk_queue_enter_preempt_freeze()
>   given this change is small enough compared with V4, I added
>   tested-by directly
> 
> V4:
>   - reorganize patch order to make it more reasonable
>   - support nested preempt freeze, as required by SCSI transport spi
>   - check preempt freezing in slow path of of blk_queue_enter()
>   - add "SCSI: transport_spi: resume a quiesced device"
>   - wake up freeze queue in setting dying for both blk-mq and legacy
>   - rename blk_mq_[freeze|unfreeze]_queue() in one patch
>   - rename .mq_freeze_wq and .mq_freeze_depth
>   - improve comment
> 
> V3:
>   - introduce q->preempt_unfreezing to fix one bug of preempt freeze
>   - call blk_queue_enter_live() only when queue is preempt frozen
>   - cleanup a bit on the implementation of preempt freeze
>   - only patch 6 and 7 are changed
> 
> V2:
>   - drop the 1st patch in V1 because percpu_ref_is_dying() is
>   enough as pointed by Tejun
>   - introduce preempt version of blk_[freeze|unfreeze]_queue
>   - sync between preempt freeze and normal freeze
>   - fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013&r=3&w=2
> 
> 
> Thanks,
> Ming
> 
> Ming Lei (6):
>   blk-mq: only run hw queues for blk-mq
>   block: tracking request allocation with q_usage_counter
>   block: pass flags to blk_queue_enter()
>   block: prepare for passing RQF_PREEMPT to request allocation
>   block: support PREEMPT_ONLY
>   SCSI: set block queue at preempt only when SCSI device is put into
> quiesce
> 
>  block/blk-core.c| 63
> +++-- block/blk-mq.c  |
> 14 ---
>  block/blk-timeout.c |  2 +-
>  drivers/scsi/scsi_lib.c | 25 +---
>  fs/block_dev.c  |  4 ++--
>  include/linux/blk-mq.h  |  7 +++---
>  include/linux/blkdev.h  | 27 ++---
>  7 files changed, 107 insertions(+), 35 deletions(-)


-- 
Martin


Re: [PATCH V6 0/6] block/scsi: safe SCSI quiescing

2017-09-29 Thread Martin Steigerwald
Ming Lei - 27.09.17, 16:27:
> On Wed, Sep 27, 2017 at 09:57:37AM +0200, Martin Steigerwald wrote:
> > Hi Ming.
> > 
> > Ming Lei - 27.09.17, 13:48:
> > > Hi,
> > > 
> > > The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> > > 
> > > Once SCSI device is put into QUIESCE, no new request except for
> > > RQF_PREEMPT can be dispatched to SCSI successfully, and
> > > scsi_device_quiesce() just simply waits for completion of I/Os
> > > dispatched to SCSI stack. It isn't enough at all.
> > > 
> > > Because new request still can be comming, but all the allocated
> > > requests can't be dispatched successfully, so request pool can be
> > > consumed up easily.
> > > 
> > > Then request with RQF_PREEMPT can't be allocated and wait forever,
> > > meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
> > > then system hangs forever, such as during system suspend or
> > > sending SCSI domain alidation.
> > > 
> > > Both IO hang inside system suspend[1] or SCSI domain validation
> > > were reported before.
> > > 
> > > This patch introduces preempt only mode, and solves the issue
> > > by allowing RQF_PREEMP only during SCSI quiesce.
> > > 
> > > Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> > > them all.
> > > 
> > > V6:
> > >   - borrow Bart's idea of preempt only, with clean
> > >   
> > > implementation(patch 5/patch 6)
> > >   
> > >   - needn't any external driver's dependency, such as MD's
> > >   change
> > 
> > Do you want me to test with v6 of the patch set? If so, it would be nice
> > if
> > you´d make a v6 branch in your git repo.
> 
> Hi Martin,
> 
> I appreciate much if you may run V6 and provide your test result,
> follows the branch:
> 
> https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V6
> 
> https://github.com/ming1/linux.git #blk_safe_scsi_quiesce_V6
> 
> > After an uptime of almost 6 days I am pretty confident that the V5 one
> > fixes the issue for me. So
> > 
> > Tested-by: Martin Steigerwald 
> > 
> > for V5.
> 
> Thanks for your test!

Two days and almost 6 hours, no hang yet. I bet the whole thing works. 
(3e45474d7df3bfdabe4801b5638d197df9810a79)

Tested-By: Martin Steigerwald 

(It could still hang after three days, but usually I got the first hang within 
the first two days.)

Thanks,
-- 
Martin


Re: [PATCH V6 0/6] block/scsi: safe SCSI quiescing

2017-09-27 Thread Martin Steigerwald
Hi Ming.

Ming Lei - 27.09.17, 13:48:
> Hi,
> 
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> 
> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
> 
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt only mode, and solves the issue
> by allowing RQF_PREEMP only during SCSI quiesce.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all.
> 
> V6:
>   - borrow Bart's idea of preempt only, with clean
> implementation(patch 5/patch 6)
>   - needn't any external driver's dependency, such as MD's
>   change

Do you want me to test with v6 of the patch set? If so, it would be nice if 
you´d make a v6 branch in your git repo.

After an uptime of almost 6 days I am pretty confident that the V5 one fixes 
the 
issue for me. So

Tested-by: Martin Steigerwald 

for V5.

Thanks,
Martin

> V5:
>   - fix one tiny race by introducing blk_queue_enter_preempt_freeze()
>   given this change is small enough compared with V4, I added
>   tested-by directly
> 
> V4:
>   - reorganize patch order to make it more reasonable
>   - support nested preempt freeze, as required by SCSI transport spi
>   - check preempt freezing in slow path of of blk_queue_enter()
>   - add "SCSI: transport_spi: resume a quiesced device"
>   - wake up freeze queue in setting dying for both blk-mq and legacy
>   - rename blk_mq_[freeze|unfreeze]_queue() in one patch
>   - rename .mq_freeze_wq and .mq_freeze_depth
>   - improve comment
> 
> V3:
>   - introduce q->preempt_unfreezing to fix one bug of preempt freeze
>   - call blk_queue_enter_live() only when queue is preempt frozen
>   - cleanup a bit on the implementation of preempt freeze
>   - only patch 6 and 7 are changed
> 
> V2:
>   - drop the 1st patch in V1 because percpu_ref_is_dying() is
>   enough as pointed by Tejun
>   - introduce preempt version of blk_[freeze|unfreeze]_queue
>   - sync between preempt freeze and normal freeze
>   - fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013&r=3&w=2
> 
> 
> Thanks,
> Ming
> 
> Ming Lei (6):
>   blk-mq: only run hw queues for blk-mq
>   block: tracking request allocation with q_usage_counter
>   block: pass flags to blk_queue_enter()
>   block: prepare for passing RQF_PREEMPT to request allocation
>   block: support PREEMPT_ONLY
>   SCSI: set block queue at preempt only when SCSI device is put into
> quiesce
> 
>  block/blk-core.c| 62
> ++--- block/blk-mq.c  |
> 14 ---
>  block/blk-timeout.c |  2 +-
>  drivers/scsi/scsi_lib.c | 25 +---
>  fs/block_dev.c  |  4 ++--
>  include/linux/blk-mq.h  |  7 +++---
>  include/linux/blkdev.h  | 27 ++---
>  7 files changed, 106 insertions(+), 35 deletions(-)


-- 
Martin


Re: Race to power off harming SATA SSDs

2017-04-12 Thread Martin Steigerwald
Am Dienstag, 11. April 2017, 11:31:29 CEST schrieb Henrique de Moraes 
Holschuh:
> On Tue, 11 Apr 2017, Martin Steigerwald wrote:
> > I do have a Crucial M500 and I do have an increase of that counter:
> > 
> > martin@merkaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201*
> > smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always   -   1
> > smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174
> > Unexpect_Power_Loss_Ct
> > 0x0032   100   100   000Old_age   Always   -   67
> > smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always   -   105
> > smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always   -   148
> > smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct 
> > 0x0032 100   100   000Old_age   Always   -   201
> > smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100  
> > 000 Old_age   Always   -   272
> > 
> > 
> > I mostly didn´t notice anything, except for one time where I indeed had a
> > BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD
> > (which also has an attribute for unclean shutdown which raises).
> 
> The Crucial M500 has something called "RAIN" which it got unmodified
> from its Micron datacenter siblings of the time, along with a large
> amount of flash overprovisioning.  Too bad it lost the overprovisioned
> supercapacitor bank present on the Microns.

I think I read about this some time ago. I decided for a Crucial M500 cause in 
tests it wasn´t the fastest, but there were hints that it may be one of the 
most reliable mSATA SSDs of that time.

[… RAIN explaination …]

> > The write-up Henrique gave me the idea, that maybe it wasn´t an user
> > triggered unclean shutdown that caused the issue, but an unclean shutdown
> > triggered by the Linux kernel SSD shutdown procedure implementation.
> 
> Maybe.  But that corruption could easily having been caused by something
> else.  There is no shortage of possible culprits.

Yes.

> I expect most damage caused by unclean SSD power-offs to be hidden from
> the user/operating system/filesystem by the extensive recovery
> facilities present on most SSDs.
> 
> Note that the fact that data was transparently (and sucessfully)
> recovered doesn't mean damage did not happen, or that the unit was not
> harmed by it: it likely got some extra flash wear at the very least.

Okay, I understand.

Well my guess back then, I didn´t fully elaborate on it in the initial mail, 
but did so in the blog post, was exactly that I didn´t see any capacitor on 
the mSATA SSD board. But I know the Intel SSD 320 has capacitors. So I 
thought, okay, maybe there really has been a sudden powerloss due to me trying 
to exchange battery during suspend to RAM / standby, without me remembering 
this event. And I thought, okay, without capacitor the SSD then didn´t get a 
chance to write some of the data. But again this also is just a guess.

I can provide to you smart data files in case you want to have a look at them.

> BTW, for the record, Windows 7 also appears to have had (and maybe still
> have) this issue as far as I can tell.  Almost every user report of
> excessive unclean power off alerts (and also of SSD bricking) to be
> found on SSD vendor forums come from Windows users.

Interesting.

Thanks,
-- 
Martin


Re: Race to power off harming SATA SSDs

2017-04-11 Thread Martin Steigerwald
Am Dienstag, 11. April 2017, 08:52:06 CEST schrieb Tejun Heo:
> > Evidently, how often the SSD will lose the race depends on a platform
> > and SSD combination, and also on how often the system is powered off.
> > A sluggish firmware that takes its time to cut power can save the day...
> > 
> > 
> > Observing the effects:
> > 
> > An unclean SSD power-off will be signaled by the SSD device through an
> > increase on a specific S.M.A.R.T attribute.  These SMART attributes can
> > be read using the smartmontools package from www.smartmontools.org,
> > which should be available in just about every Linux distro.
> > 
> > smartctl -A /dev/sd#
> > 
> > The SMART attribute related to unclean power-off is vendor-specific, so
> > one might have to track down the SSD datasheet to know which attribute a
> > particular SSD uses.  The naming of the attribute also varies.
> > 
> > For a Crucial M500 SSD with up-to-date firmware, this would be attribute
> > 174 "Unexpect_Power_Loss_Ct", for example.
> > 
> > NOTE: unclean SSD power-offs are dangerous and may brick the device in
> > the worst case, or otherwise harm it (reduce longevity, damage flash
> > blocks).  It is also not impossible to get data corruption.
> 
> I get that the incrementing counters might not be pretty but I'm a bit
> skeptical about this being an actual issue.  Because if that were
> true, the device would be bricking itself from any sort of power
> losses be that an actual power loss, battery rundown or hard power off
> after crash.

The write-up by Henrique has been a very informative and interesting read for 
me. I wondered about the same question tough.

I do have a Crucial M500 and I do have an increase of that counter:

martin@merkaba:~[…]/Crucial-M500> grep "^174" smartctl-a-201*   
smartctl-a-2014-03-05.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000  
  
Old_age   Always   -   1
smartctl-a-2014-10-11-nach-prüfsummenfehlern.txt:174 Unexpect_Power_Loss_Ct  
0x0032   100   100   000Old_age   Always   -   67
smartctl-a-2015-05-01.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000  
  
Old_age   Always   -   105
smartctl-a-2016-02-06.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000  
  
Old_age   Always   -   148
smartctl-a-2016-07-08-unreadable-sector.txt:174 Unexpect_Power_Loss_Ct  0x0032  
 
100   100   000Old_age   Always   -   201
smartctl-a-2017-04-11.txt:174 Unexpect_Power_Loss_Ct  0x0032   100   100   000  
  
Old_age   Always   -   272


I mostly didn´t notice anything, except for one time where I indeed had a 
BTRFS checksum error, luckily within a BTRFS RAID 1 with an Intel SSD (which 
also has an attribute for unclean shutdown which raises).

I blogged about this in german language quite some time ago:

https://blog.teamix.de/2015/01/19/btrfs-raid-1-selbstheilung-in-aktion/

(I think its easy enough to get the point of the blog post even when not 
understanding german)

Result of scrub:

   scrub started at Thu Oct  9 15:52:00 2014 and finished after 564 seconds
total bytes scrubbed: 268.36GiB with 60 errors
error details: csum=60
corrected errors: 60, uncorrectable errors: 0, unverified errors: 0

Device errors were on:

merkaba:~> btrfs device stats /home
[/dev/mapper/msata-home].write_io_errs   0
[/dev/mapper/msata-home].read_io_errs0
[/dev/mapper/msata-home].flush_io_errs   0
[/dev/mapper/msata-home].corruption_errs 60
[/dev/mapper/msata-home].generation_errs 0
[…]

(thats the Crucial m500)


I didn´t have any explaination of this, but I suspected some unclean shutdown, 
even tough I remembered no unclean shutdown. I take good care to always has a 
battery in this ThinkPad T520, due to unclean shutdown issues with Intel SSD 
320 (bricked device which reports 8 MiB as capacity, probably fixed by the 
firmware update I applied back then).

The write-up Henrique gave me the idea, that maybe it wasn´t an user triggered 
unclean shutdown that caused the issue, but an unclean shutdown triggered by 
the Linux kernel SSD shutdown procedure implementation.

Of course, I don´t know whether this is the case and I think there is no way 
to proof or falsify it years after this happened. I never had this happen 
again.

Thanks,
-- 
Martin