On 2019/08/06 8:56, Dmitry Fomichev wrote:
> This patch fixes a problem in dm-kcopyd that may leave jobs in
> complete queue indefinitely in the event of backing storage failure.
> 
> This behavior has been observed while running 100% write file fio
> workload against an XFS volume created on top of a dm-zoned target
> device. If the underlying storage of dm-zoned goes to offline state
> under I/O, kcopyd sometimes never issues the end copy callback and
> dm-zoned reclaim work hangs indefinitely waiting for that completion.
> 
> This behavior was traced down to the error handling code in
> process_jobs() function that places the failed job to complete_jobs
> queue, but doesn't wake up the job handler. In case of backing device
> failure, all outstanding jobs may end up going to complete_jobs queue
> via this code path and then stay there forever because there are no
> more successful I/O jobs to wake up the job handler.
> 
> This patch adds a wake() call to always wake up kcopyd job wait queue
> for all I/O jobs that fail before dm_io() gets called for that job.
> 
> The patch also sets the write error status in all sub jobs that are
> failed because their master job has failed.
> 
> Fixes: b73c67c2cbb00 ("dm kcopyd: add sequential write feature")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitry Fomichev <dmitry.fomic...@wdc.com>
> ---
>  drivers/md/dm-kcopyd.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index df2011de7be2..1bbe4a34ef4c 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -566,8 +566,10 @@ static int run_io_job(struct kcopyd_job *job)
>        * no point in continuing.
>        */
>       if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) &&
> -         job->master_job->write_err)
> +         job->master_job->write_err) {
> +             job->write_err = job->master_job->write_err;
>               return -EIO;
> +     }
>  
>       io_job_start(job->kc->throttle);
>  
> @@ -619,6 +621,7 @@ static int process_jobs(struct list_head *jobs, struct 
> dm_kcopyd_client *kc,
>                       else
>                               job->read_err = 1;
>                       push(&kc->complete_jobs, job);
> +                     wake(kc);
>                       break;
>               }
>  
> 

Reviewed-by: Damien Le Moal <damien.lem...@wdc.com>

-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Reply via email to