Re: [PATCH] dax: Remove an unused field in struct dax_operations

2024-09-02 Thread Jan Kara
On Sun 01-09-24 17:17:09, Christophe JAILLET wrote:
> .dax_supported() was apparently removed by commit 7b0800d00dae ("dax:
> remove dax_capable") on 2021-11.
> 
> Remove the now unused function pointer from the struct dax_operations.
> 
> Signed-off-by: Christophe JAILLET 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> Slightly compile tested only, but "git grep dax_supported" now returns
> nothing.
> ---
>  include/linux/dax.h | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9d3e3327af4c..df41a0017b31 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -27,12 +27,6 @@ struct dax_operations {
>*/
>   long (*direct_access)(struct dax_device *, pgoff_t, long,
>   enum dax_access_mode, void **, pfn_t *);
> - /*
> -  * Validate whether this device is usable as an fsdax backing
> -  * device.
> -  */
> - bool (*dax_supported)(struct dax_device *, struct block_device *, int,
> - sector_t, sector_t);
>   /* zero_page_range: required operation. Zero page range   */
>   int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>   /*
> -- 
> 2.46.0
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [RFC PATCH v2 3/8] ext2: Use dax_is_supported()

2024-01-30 Thread Jan Kara
On Tue 30-01-24 11:52:50, Mathieu Desnoyers wrote:
> Use dax_is_supported() to validate whether the architecture has
> virtually aliased data caches at mount time. Print an error and disable
> DAX if dax=always is requested as a mount option on an architecture
> which does not support DAX.
> 
> This is relevant for architectures which require a dynamic check
> to validate whether they have virtually aliased data caches.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Jan Kara 
> Cc: linux-e...@vger.kernel.org
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org

OK, yeah, this is better than v1. Feel free to add:

Acked-by: Jan Kara 

Honza

> ---
>  fs/ext2/super.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 01f9addc8b1f..30ff57d47ed4 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
>  
>   if (test_opt(sb, DAX)) {
> - if (!sbi->s_daxdev) {
> + if (!dax_is_supported()) {
> + ext2_msg(sb, KERN_ERR,
> + "DAX unsupported by architecture. Turning off 
> DAX.");
> + clear_opt(sbi->s_mount_opt, DAX);
> + } else if (!sbi->s_daxdev) {
>   ext2_msg(sb, KERN_ERR,
>   "DAX unsupported by block device. Turning off 
> DAX.");
>   clear_opt(sbi->s_mount_opt, DAX);
> -- 
> 2.39.2
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()

2024-01-30 Thread Jan Kara
On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote:
> Use dax_is_supported() to validate whether the architecture has
> virtually aliased caches at mount time.
> 
> This is relevant for architectures which require a dynamic check
> to validate whether they have virtually aliased data caches
> (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Jan Kara 
> Cc: linux-e...@vger.kernel.org
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org

Looks good to me (although I share Dave's opinion it would be nice to CC
the whole series to fsdevel - luckily we have lore these days so it is not
that tedious to find the whole series :)). Feel free to add:

Acked-by: Jan Kara 

Honza

> ---
>  fs/ext2/super.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 01f9addc8b1f..0398e7a90eb6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
> super_block *sb,
>   set_opt(opts->s_mount_opt, XIP);
>   fallthrough;
>   case Opt_dax:
> -#ifdef CONFIG_FS_DAX
> - ext2_msg(sb, KERN_WARNING,
> - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> - set_opt(opts->s_mount_opt, DAX);
> -#else
> - ext2_msg(sb, KERN_INFO, "dax option not supported");
> -#endif
> + if (dax_is_supported()) {
> + ext2_msg(sb, KERN_WARNING,
> +  "DAX enabled. Warning: EXPERIMENTAL, 
> use at your own risk");
> + set_opt(opts->s_mount_opt, DAX);
> + } else {
> + ext2_msg(sb, KERN_INFO, "dax option not 
> supported");
> + }
>   break;
>  
>  #if defined(CONFIG_QUOTA)
> -- 
> 2.39.2
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [PATCH] fs : Fix warning using plain integer as NULL

2023-11-08 Thread Jan Kara
On Wed 08-11-23 10:15:50, Abhinav Singh wrote:
> Sparse static analysis tools generate a warning with this message
> "Using plain integer as NULL pointer". In this case this warning is
> being shown because we are trying to initialize  pointer to NULL using
> integer value 0.
> 
> Signed-off-by: Abhinav Singh 

Nice cleanup. Feel free to add:

Reviewed-by: Jan Kara 
Honza

> ---
>  fs/dax.c   | 2 +-
>  fs/direct-io.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 3380b43cb6bb..423fc1607dfa 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1128,7 +1128,7 @@ static int dax_iomap_copy_around(loff_t pos, uint64_t 
> length, size_t align_size,
>   /* zero the edges if srcmap is a HOLE or IOMAP_UNWRITTEN */
>   bool zero_edge = srcmap->flags & IOMAP_F_SHARED ||
>srcmap->type == IOMAP_UNWRITTEN;
> - void *saddr = 0;
> + void *saddr = NULL;
>   int ret = 0;
>  
>   if (!zero_edge) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 20533266ade6..60456263a338 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1114,7 +1114,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct 
> inode *inode,
>   loff_t offset = iocb->ki_pos;
>   const loff_t end = offset + count;
>   struct dio *dio;
> - struct dio_submit sdio = { 0, };
> + struct dio_submit sdio = { NULL, };
>   struct buffer_head map_bh = { 0, };
>   struct blk_plug plug;
>   unsigned long align = offset | iov_iter_alignment(iter);
> -- 
> 2.39.2
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [PATCH][next] fsdax: remove redundant variable 'error'

2023-06-21 Thread Jan Kara
On Wed 21-06-23 14:02:56, Colin Ian King wrote:
> The variable 'error' is being assigned a value that is never read,
> the assignment and the variable and redundant and can be removed.
> Cleans up clang scan build warning:
> 
> fs/dax.c:1880:10: warning: Although the value stored to 'error' is
> used in the enclosing expression, the value is never actually read
> from 'error' [deadcode.DeadStores]
> 
> Signed-off-by: Colin Ian King 

Yeah, good spotting. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2ababb89918d..cb36c6746fc4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1830,7 +1830,6 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   vm_fault_t ret = VM_FAULT_FALLBACK;
>   pgoff_t max_pgoff;
>   void *entry;
> - int error;
>  
>   if (vmf->flags & FAULT_FLAG_WRITE)
>   iter.flags |= IOMAP_WRITE;
> @@ -1877,7 +1876,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   }
>  
>   iter.pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> - while ((error = iomap_iter(&iter, ops)) > 0) {
> + while (iomap_iter(&iter, ops) > 0) {
>   if (iomap_length(&iter) < PMD_SIZE)
>   continue; /* actually breaks out of the loop */
>  
> -- 
> 2.39.2
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [PATCH v2 2/6] dax: fix cache flush on PMD-mapped pages

2022-02-03 Thread Jan Kara
On Wed 02-02-22 22:33:03, Muchun Song wrote:
> The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
> However, it does not cover the full pages in a THP except a head page.
> Replace it with flush_cache_range() to fix this issue.
> 
> Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean")
> Signed-off-by: Muchun Song 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 88be1c02a151..e031e4b6c13c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -857,7 +857,8 @@ static void dax_entry_mkclean(struct address_space 
> *mapping, pgoff_t index,
>   if (!pmd_dirty(*pmdp) && !pmd_write(*pmdp))
>   goto unlock_pmd;
>  
> - flush_cache_page(vma, address, pfn);
> + flush_cache_range(vma, address,
> +   address + HPAGE_PMD_SIZE);
>   pmd = pmdp_invalidate(vma, address, pmdp);
>   pmd = pmd_wrprotect(pmd);
>   pmd = pmd_mkclean(pmd);
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [PATCH v2 1/6] mm: rmap: fix cache flush on THP pages

2022-02-03 Thread Jan Kara
On Wed 02-02-22 22:33:02, Muchun Song wrote:
> The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
> However, it does not cover the full pages in a THP except a head page.
> Replace it with flush_cache_range() to fix this issue. At least, no
> problems were found due to this. Maybe because the architectures that
> have virtual indexed caches is less.
> 
> Fixes: f27176cfc363 ("mm: convert page_mkclean_one() to use 
> page_vma_mapped_walk()")
> Signed-off-by: Muchun Song 
> Reviewed-by: Yang Shi 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/rmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b0fd9dc19eba..0ba12dc9fae3 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -974,7 +974,8 @@ static bool page_mkclean_one(struct page *page, struct 
> vm_area_struct *vma,
>   if (!pmd_dirty(*pmd) && !pmd_write(*pmd))
>   continue;
>  
> - flush_cache_page(vma, address, page_to_pfn(page));
> + flush_cache_range(vma, address,
> +   address + HPAGE_PMD_SIZE);
>   entry = pmdp_invalidate(vma, address, pmd);
>   entry = pmd_wrprotect(entry);
>   entry = pmd_mkclean(entry);
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR



Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()

2021-04-19 Thread Jan Kara
-   dax_wake_entry(xas, entry, false);
> > > +   if (dax_is_conflict(entry))
> > > +   return;
> > > +
> > > +   dax_wake_entry(xas, entry, false);
> > 
> 
> Hi Dan,
> 
> > How does this work if entry is NULL? dax_entry_waitqueue() will not
> > know if it needs to adjust the index.
> 
> Wake waiters both at current index as well PMD adjusted index. It feels
> little ugly though.
> 
> > I think the fix might be to
> > specify that put_unlocked_entry() in the invalidate path needs to do a
> > wake_up_all().
> 
> Doing a wake_up_all() when we invalidate an entry, sounds good. I will give
> it a try.

Yeah, that's what I'd suggest as well. After invalidating entry, there's no
point to let other waiters sleep. Trying to optimize for thundering herd
problems in face of entry invalidation is really fragile as you noticed.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] bfq: silence lockdep for bfqd/ioc lock inversion

2021-04-15 Thread Jan Kara
On Wed 14-04-21 11:33:14, Khazhy Kumykov wrote:
> On Wed, Apr 14, 2021 at 2:54 AM Jan Kara  wrote:
> >
> > On Thu 18-03-21 23:00:15, Khazhismel Kumykov wrote:
> > > lockdep warns of circular locking due to inversion between
> > > bfq_insert_requests and bfq_exit_icq. If we end freeing a request when
> > > merging, we *may* grab an ioc->lock if that request is the last refcount
> > > to that ioc. bfq_bio_merge also potentially could have this ordering.
> > > bfq_exit_icq, conversely, grabs bfqd but is always called with ioc->lock
> > > held.
> > >
> > > bfq_exit_icq may either be called from put_io_context_active with ioc
> > > refcount raised, ioc_release_fn after the last refcount was already
> > > dropped, or ioc_clear_queue, which is only called while queue is
> > > quiesced or exiting, so the inverted orderings should never conflict.
> > >
> > > Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as
> > > an extra scheduler")
> > >
> > > Signed-off-by: Khazhismel Kumykov 
> >
> > I've just hit the same lockdep complaint. When looking at this another
> > option to solve this complaint seemed to be to modify bfq_bio_merge() like:
> >
> > ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);
> >
> > +   spin_unlock_irq(&bfqd->lock);
> > if (free)
> > blk_mq_free_request(free);
> > -   spin_unlock_irq(&bfqd->lock);
> >
> > return ret;
> >
> > to release request outside of bfqd->lock. Because AFAICT there's no good
> > reason why we are actually freeing the request under bfqd->lock. And it
> > would seem a bit safer than annotating-away the lockdep complaint (as much
> > as I don't see a problem with your analysis). Paolo?
> 
> If we can re-order the locking so we don't need the annotation, that
> seems better ("inversion is OK so long as either we're frozen or we
> have ioc refcount, and we only grab ioc->lock normally if we drop the
> last refcount" is a tad "clever"). Though we still need to deal with
> blk_mq_sched_try_insert_merge which can potentially free a request.

I see, right.

> (See the first stacktrace). Something simple that I wasn't sure of is:
> could we delay bfq_exit_icq work, then avoid the inversion? Simpler to
> analyze then.

That's problematic because ICQ (referencing BFQQs etc.) is going to be
freed after RCU grace period expires. So we cannot really postpone the
teardown of bfq_io_cq. What we could do is to modify
blk_mq_sched_try_insert_merge() so that it returns request to free
similarly to blk_mq_sched_try_merge().  Then we can free the request after
dropping bfqd->lock.

Honza

> > > ---
> > >  block/bfq-iosched.c | 9 -
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > Noticed this lockdep running xfstests (generic/464) on top of a bfq
> > > block device. I was also able to tease it out w/ binary trying to issue
> > > requests that would end up merging while rapidly swapping the active
> > > scheduler. As far as I could see, the deadlock would not actually occur,
> > > so this patch opts to change lock class for the inverted case.
> > >
> > > bfqd -> ioc :
> > > [ 2995.524557] __lock_acquire+0x18f5/0x2660
> > > [ 2995.524562] lock_acquire+0xb4/0x3a0
> > > [ 2995.524565] _raw_spin_lock_irqsave+0x3f/0x60
> > > [ 2995.524569] put_io_context+0x33/0x90.  -> ioc->lock grabbed
> > > [ 2995.524573] blk_mq_free_request+0x51/0x140
> > > [ 2995.524577] blk_put_request+0xe/0x10
> > > [ 2995.524580] blk_attempt_req_merge+0x1d/0x30
> > > [ 2995.524585] elv_attempt_insert_merge+0x56/0xa0
> > > [ 2995.524590] blk_mq_sched_try_insert_merge+0x4b/0x60
> > > [ 2995.524595] bfq_insert_requests+0x9e/0x18c0.-> bfqd->lock grabbed
> > > [ 2995.524598] blk_mq_sched_insert_requests+0xd6/0x2b0
> > > [ 2995.524602] blk_mq_flush_plug_list+0x154/0x280
> > > [ 2995.524606] blk_finish_plug+0x40/0x60
> > > [ 2995.524609] ext4_writepages+0x696/0x1320
> > > [ 2995.524614] do_writepages+0x1c/0x80
> > > [ 2995.524621] __filemap_fdatawrite_range+0xd7/0x120
> > > [ 2995.524625] sync_file_range+0xac/0xf0
> > > [ 2995.524642] __x64_sys_sync_file_range+0x44/0x70
> > > [ 2995.524646] do_syscall_64+0x31/0x40
> > > [ 2995.524649] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >

Re: [PATCH] direct-io: use read lock for DIO_LOCKING flag

2021-04-15 Thread Jan Kara
On Thu 15-04-21 17:43:32, Chao Yu wrote:
> 9902af79c01a ("parallel lookups: actual switch to rwsem") changes inode
> lock from mutex to rwsem, however, we forgot to adjust lock for
> DIO_LOCKING flag in do_blockdev_direct_IO(), so let's change to hold read
> lock to mitigate performance regression in the case of read DIO vs read DIO,
> meanwhile it still keeps original functionality of avoiding buffered access
> vs direct access.
> 
> Signed-off-by: Chao Yu 

Thanks for the patch but this is not safe. Originally we had exclusive lock
(with i_mutex), switching to rwsem doesn't change that requirement. It may
be OK for some filesystems to actually use shared acquisition of rwsem for
DIO reads but it is not clear that is fine for all filesystems (and I
suspect those filesystems that actually do care already don't use
DIO_LOCKING flag or were already converted to iomap_dio_rw()). So unless
you do audit of all filesystems using do_blockdev_direct_IO() with
DIO_LOCKING flag and make sure they are all fine with inode lock in shared
mode, this is a no-go.

Honza

> ---
>  fs/direct-io.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index b2e86e739d7a..93ff912f2749 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1166,7 +1166,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
> *inode,
>   dio->flags = flags;
>   if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
>   /* will be released by direct_io_worker */
> - inode_lock(inode);
> + inode_lock_shared(inode);
>   }
>  
>   /* Once we sampled i_size check for reads beyond EOF */
> @@ -1316,7 +1316,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
> *inode,
>* of protecting us from looking up uninitialized blocks.
>*/
>   if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
> - inode_unlock(dio->inode);
> + inode_unlock_shared(dio->inode);
>  
>   /*
>* The only time we want to leave bios in flight is when a successful
> @@ -1341,7 +1341,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode 
> *inode,
>  
>  fail_dio:
>   if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ)
> - inode_unlock(inode);
> + inode_unlock_shared(inode);
>  
>   kmem_cache_free(dio_cache, dio);
>   return retval;
> -- 
> 2.29.2
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] bfq: silence lockdep for bfqd/ioc lock inversion

2021-04-14 Thread Jan Kara
 * under bfqd->lock if we merge requests and drop the last ioc
> +  * refcount. Since exit_icq is either called with a refcount,
> +  * or with queue quiesced, use a differnet lock class to
> +  * silence lockdep
> +  */
> + spin_lock_irqsave_nested(&bfqd->lock, flags, 1);
>   bfqq->bic = NULL;
>   bfq_exit_bfqq(bfqd, bfqq);
>   bic_set_bfqq(bic, NULL, is_sync);
> -- 
> 2.31.0.rc2.261.g7f71774620-goog
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: More KCSAN data-race Reports

2021-04-12 Thread Jan Kara
On Mon 12-04-21 18:42:58, Hao Sun wrote:
> Jan Kara  于2021年4月12日周一 下午5:02写道:
> >
> > Hello,
> >
> > On Sun 11-04-21 11:42:05, Hao Sun wrote:
> > > Since the last KCSAN report[1], I found two more KCSAN reports that
> > > Syzbot had not reported.
> > > Not sure if they are valid bugs, I hope the stack information in
> > > reports can help you locate the problem.
> > > Kernel config can be found in the attachment.
> >
> > Do we have symbolic decoding of the traces below? Because involved
> > functions are really big so it's difficult to guess what KCSAN is
> > complaining about... At least I wasn't able to guess it after looking into
> > the stacktraces for a while.
> >
> Sorry, the log processing module of Fuzzer still has some logic bugs,
> only some of the symbolized reports are stored in the disk.
> Interestingly, however, the read-write end that causes data racing in
> both reports are in the same location (fs/jbd2/commit.c:443), and this
> information should help locate the problem.
> 
> Partial symbolized report 1:
> ==
> BUG: KCSAN: data-race in ext4_mark_iloc_dirty / 
> jbd2_journal_commit_transaction
> read-write to 0x88804451d800 of 8 bytes by task 4821 on cpu 1:
>  jbd2_journal_commit_transaction+0x222/0x3200 fs/jbd2/commit.c:443
>  kjournald2+0x253/0x470 fs/jbd2/journal.c:213
>  kthread+0x1f0/0x220 kernel/kthread.c:292
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

OK, that is:
journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;

So likely this is a complaint about j_flags update vs j_flags check race
(we check for JBD2_ABORT flag) all around the code.

So again this is harmless unless the compiler plays some devilish tricks
and doesn't store bogus intermediate values in j_flags during RMW
operations. Not sure how to deal with this one. Just putting data_race()
here doesn't seem right - if the compiler does something unexpected, we are
indeed in trouble. Maybe using bitops for j_flags would be beneficial for
other reasons as well as silencing KCSAN. But it needs more thought.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: More KCSAN data-race Reports

2021-04-12 Thread Jan Kara
Hello,

On Sun 11-04-21 11:42:05, Hao Sun wrote:
> Since the last KCSAN report[1], I found two more KCSAN reports that
> Syzbot had not reported.
> Not sure if they are valid bugs, I hope the stack information in
> reports can help you locate the problem.
> Kernel config can be found in the attachment.

Do we have symbolic decoding of the traces below? Because involved
functions are really big so it's difficult to guess what KCSAN is
complaining about... At least I wasn't able to guess it after looking into
the stacktraces for a while.

Honza

> 
> Here is the detailed information:
> commit:   3b9cdafb5358eb9f3790de2f728f765fef100731
> version:   linux 5.11
> git tree:upstream
> 
> Report-1
> ==
> BUG: KCSAN: data-race in ext4_mark_iloc_dirty / 
> jbd2_journal_commit_transaction
> 
> read-write to 0x88804451d800 of 8 bytes by task 4821 on cpu 1:
>  jbd2_journal_commit_transaction+0x222/0x3200
>  kjournald2+0x253/0x470
>  kthread+0x1f0/0x220
>  ret_from_fork+0x1f/0x30
> 
> read to 0x88804451d800 of 8 bytes by task 8418 on cpu 0:
>  ext4_mark_iloc_dirty+0x14ec/0x16e0
>  __ext4_mark_inode_dirty+0x4d2/0x5d0
>  ext4_evict_inode+0xb9f/0xed0
>  evict+0x1a6/0x410
>  iput+0x3fc/0x510
>  do_unlinkat+0x2c9/0x4d0
>  __x64_sys_unlink+0x2c/0x30
>  do_syscall_64+0x39/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Report-2
> ==
> BUG: KCSAN: data-race in __ext4_handle_dirty_metadata /
> jbd2_journal_commit_transaction
> 
> read-write to 0x88800e142800 of 8 bytes by task 4823 on cpu 0:
>  jbd2_journal_commit_transaction+0x222/0x3200
>  kjournald2+0x253/0x470
>  kthread+0x1f0/0x220
>  ret_from_fork+0x1f/0x30
> 
> read to 0x88800e142800 of 8 bytes by task 7925 on cpu 1:
>  __ext4_handle_dirty_metadata+0x11a/0x590
>  ext4_mark_iloc_dirty+0x12dd/0x16e0
>  __ext4_mark_inode_dirty+0x4d2/0x5d0
>  ext4_dirty_inode+0x86/0xa0
>  __mark_inode_dirty+0x70/0x6b0
>  file_update_time+0x3ab/0x3f0
>  file_modified+0x62/0x80
>  ext4_buffered_write_iter+0x1f9/0x3d0
>  ext4_file_write_iter+0x45e/0x10d0
>  vfs_write+0x6db/0x7c0
>  ksys_write+0xce/0x180
>  __x64_sys_write+0x3e/0x50
>  do_syscall_64+0x39/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> 
> [1] 
> https://lore.kernel.org/lkml/cackbjszw5sp4jb51+c5mrmssgq73x8ieko_ev6ctxvvtva7...@mail.gmail.com/


-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

2021-04-09 Thread Jan Kara
On Fri 09-04-21 11:17:33, riteshh wrote:
> On 21/04/09 02:50AM, Wen Yang wrote:
> > > On Apr 7, 2021, at 5:16 AM, riteshh  wrote:
> > >>
> > >> On 21/04/07 03:01PM, Wen Yang wrote:
> > >>> From: Wen Yang 
> > >>>
> > >>> The kworker has occupied 100% of the CPU for several days:
> > >>> PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
> > >>> 68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11
> > >>>
> > >>> And the stack obtained through sysrq is as follows:
> > >>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000
> > >>> [20613144.850438] Call Trace:
> > >>> [20613144.850439] []ext4_mb_new_blocks+0x429/0x550
> > [ext4]
> > >>> [20613144.850439]  [] ext4_ext_map_blocks+0xb5e/0xf30
> > [ext4]
> > >>> [20613144.850441]  [] ext4_map_blocks+0x172/0x620
> > [ext4]
> > >>> [20613144.850442]  [] ext4_writepages+0x7e5/0xf00
> > [ext4]
> > >>> [20613144.850443]  [] do_writepages+0x1e/0x30
> > >>> [20613144.850444]  []
> > __writeback_single_inode+0x45/0x320
> > >>> [20613144.850444]  [] writeback_sb_inodes+0x272/0x600
> > >>> [20613144.850445]  [] __writeback_inodes_wb+0x92/0xc0
> > >>> [20613144.850445]  [] wb_writeback+0x268/0x300
> > >>> [20613144.850446]  [] wb_workfn+0xb4/0x380
> > >>> [20613144.850447]  [] process_one_work+0x189/0x420
> > >>> [20613144.850447]  [] worker_thread+0x4e/0x4b0
> > >>>
> > >>> The cpu resources of the cloud server are precious, and the server
> > >>> cannot be restarted after running for a long time, so a configuration
> > >>> parameter is added to prevent this endless loop.
> > >>
> > >> Strange, if there is a endless loop here. Then I would definitely see
> > >> if there is any accounting problem in pa->pa_count. Otherwise busy=1
> > >> should not be set everytime. ext4_mb_show_pa() function may help debug
> > this.
> > >>
> > >> If yes, then that means there always exists either a file preallocation
> > >> or a group preallocation. Maybe it is possible, in some use case.
> > >> Others may know of such use case, if any.
> >
> > > If this code is broken, then it doesn't make sense to me that we would
> > > leave it in the "run forever" state after the patch, and require a sysfs
> > > tunable to be set to have a properly working system?
> >
> > > Is there anything particularly strange about the workload/system that
> > > might cause this?  Filesystem is very full, memory is very low, etc?
> >
> > Hi Ritesh and Andreas,
> >
> > Thank you for your reply. Since there is still a faulty machine, we have
> > analyzed it again and found it is indeed a very special case:
> >
> >
> > crash> struct ext4_group_info 8813bb5f72d0
> > struct ext4_group_info {
> >   bb_state = 0,
> >   bb_free_root = {
> > rb_node = 0x0
> >   },
> >   bb_first_free = 1681,
> >   bb_free = 0,
> 
> Not related to this issue, but above two variables values doesn't looks
> consistent.
> 
> >   bb_fragments = 0,
> >   bb_largest_free_order = -1,
> >   bb_prealloc_list = {
> > next = 0x880268291d78,
> > prev = 0x880268291d78 ---> *** The list is empty
> >   },
> 
> Ok. So when you collected the dump this list was empty.

No, it is not empty. It has a single element. Note that the structure is at
8813bb5f72d0 so the pointers would have to be like 8813bb5f7xxx.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] inotify: fix minmax.cocci warnings

2021-04-07 Thread Jan Kara
On Tue 06-04-21 22:49:26, Julia Lawall wrote:
> From: kernel test robot 
> 
> Opportunity for min().
> 
> Generated by: scripts/coccinelle/misc/minmax.cocci
> 
> Fixes: 8636e3295ce3 ("coccinelle: misc: add minmax script")
> CC: Denis Efremov 
> Reported-by: kernel test robot 
> Signed-off-by: kernel test robot 
> Signed-off-by: Julia Lawall 
...
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -382,7 +382,7 @@ static int inotify_add_to_idr(struct idr
> 
>   spin_unlock(idr_lock);
>   idr_preload_end();
> - return ret < 0 ? ret : 0;
> + return min(ret, 0);
>  }

Honestly, while previous expression is a standard idiom for "if 'ret' holds
an error, return it", the new expression is harder to understand for me. So
I prefer to keep things as they are in this particular case...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: KCSAN: data-race in __jbd2_journal_file_buffer / jbd2_journal_dirty_metadata

2021-04-06 Thread Jan Kara
On Tue 06-04-21 11:01:39, Theodore Ts'o wrote:
> On Tue, Apr 06, 2021 at 02:32:33PM +0200, Jan Kara wrote:
> > And the comment explains, why we do this unreliable check. Again, if we
> > wanted to silence KCSAN, we could use data_race() macro but AFAIU Ted isn't
> > very fond of that annotation.
> 
> I'm not fond of the data_race macro, but I like bogus KCSAN reports
> even less.  My main complaint is if we're going to have to put the
> data_race() macro in place, we're going to need to annotate each
> location with an explanation of why it's there (suppress a KCSAN false
> positive), and why's it's safe.  If it's only one or two places, it'll
> probably be fine.  If it's dozens, then I would say that KCSAN is
> becoming a net negative in terms of making the Linux kernel code
> maintainable.

OK, I'll send you patches for the two I've seen in the last month. We'll
see how many accumulate over time.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: KCSAN: data-race in __jbd2_journal_file_buffer / jbd2_journal_dirty_metadata

2021-04-06 Thread Jan Kara
On Tue 06-04-21 21:27:48, Hao Sun wrote:
> > Thanks for report but I'm not sure what KCSAN is complaining about - isn't 
> > the report truncated?
> 
> Yes, the full KCSAN report is available in the attached log file.

Sorry, I missed that in your original email.

> Here is the report :
>  ==
> BUG: KCSAN: data-race in __jbd2_journal_file_buffer /
> jbd2_journal_dirty_metadata
> 
> write to 0x88800af6da38 of 8 bytes by task 4822 on cpu 1:
>  __jbd2_journal_file_buffer+0x18d/0x370
>  __jbd2_journal_refile_buffer+0x155/0x230
>  jbd2_journal_commit_transaction+0x24c6/0x3200
>  kjournald2+0x253/0x470
>  kthread+0x1f0/0x220
>  ret_from_fork+0x1f/0x30
> 
> read to 0x88800af6da38 of 8 bytes by task 1955 on cpu 0:
>  jbd2_journal_dirty_metadata+0x17f/0x670
>  __ext4_handle_dirty_metadata+0xc6/0x590
>  ext4_mark_iloc_dirty+0x12dd/0x16e0
>  __ext4_mark_inode_dirty+0x4d2/0x5d0
>  ext4_writepages+0x1262/0x1e50
>  do_writepages+0x7b/0x150
>  __writeback_single_inode+0x84/0x4e0
>  writeback_sb_inodes+0x69f/0x1020
>  __writeback_inodes_wb+0xb0/0x2a0
>  wb_writeback+0x290/0x650
>  wb_do_writeback+0x582/0x5d0
>  wb_workfn+0xb8/0x410
>  process_one_work+0x3e1/0x940
>  worker_thread+0x64a/0xaa0
>  kthread+0x1f0/0x220
>  ret_from_fork+0x1f/0x30
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 0 PID: 1955 Comm: kworker/u5:2 Not tainted 5.11.0+ #5
> 
> 
> Sorry, I couldn't symbolize it because the original Linux binary was lost.
> Do you think this is an actual bug?

So it is difficult to be 100% sure without knowing which particular access
caused the KCSAN warning but I'm quite confident it was caused by one of
unlocked accesses to jh->b_transaction in jbd2_journal_dirty_metadata().
And as the comments explain, these are only assertions which we redo under
proper lock if they look like they would fail. So the code is in fact
correct.

Honza

> Jan Kara  于2021年4月6日周二 下午8:32写道:
> >
> > Hello!
> >
> > On Sun 04-04-21 17:40:44, Hao Sun wrote:
> > > When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> > > the Linux kernel, I found a data-race vulnerability in
> > > __jbd2_journal_file_buffer / jbd2_journal_dirty_metadata.
> > > Sorry, data-race is usually difficult to reproduce. I cannot provide
> > > you with a reproducing program.
> > > I hope that the call stack information in the crash log can help you
> > > locate the problem.
> > > Kernel config and full log can be found in the attachment.
> > >
> > > Here is the detailed information:
> > > commit:   3b9cdafb5358eb9f3790de2f728f765fef100731
> > > version:   linux 5.11
> > > git tree:upstream
> > > report:
> > > ==
> > > BUG: KCSAN: data-race in __jbd2_journal_file_buffer /
> > > jbd2_journal_dirty_metadata
> > > write to 0x88800af6da38 of 8 bytes by task 4822 on cpu 1:
> > >  __jbd2_journal_file_buffer+0x18d/0x370 linux/fs/jbd2/transaction.c:2518
> > >  __jbd2_journal_refile_buffer+0x155/0x230 linux/fs/jbd2/transaction.c:2612
> > >  jbd2_journal_commit_transaction+0x24c6/0x3200 linux/fs/jbd2/commit.c:1084
> > >  kjournald2+0x253/0x470 linux/fs/jbd2/journal.c:213
> > >  kthread+0x1f0/0x220 linux/kernel/kthread.c:292
> > >  ret_from_fork+0x1f/0x30 linux/arch/x86/entry/entry_64.S:294
> >
> > Thanks for report but I'm not sure what KCSAN is complaining about - isn't
> > the report truncated? I'm missing 'read' part of the report... The complaint
> > is on line:
> >
> > jh->b_transaction = transaction;
> >
> > I would guess the complaint is because of the check:
> >
> > /*
> >  * This and the following assertions are unreliable since we may 
> > see jh
> >  * in inconsistent state unless we grab bh_state lock. But this is
> >  * crucial to catch bugs so let's do a reliable check until the
> >  * lockless handling is fully proven.
> >  */
> > if (jh->b_transaction != transaction &&
> > jh->b_next_transaction != transaction) {
> >
> > And the comment explains, why we do this unreliable check. Again, if we
> > wanted to silence KCSAN, we could use data_race() macro but AFAIU Ted isn't
> > very fond of that annotation.
> >
> > Honza
> >
> > --
> > Jan Kara 
> > SUSE Labs, CR
-- 
Jan Kara 
SUSE Labs, CR


Re: KCSAN: data-race in __jbd2_journal_file_buffer / jbd2_journal_dirty_metadata

2021-04-06 Thread Jan Kara
Hello!

On Sun 04-04-21 17:40:44, Hao Sun wrote:
> When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> the Linux kernel, I found a data-race vulnerability in
> __jbd2_journal_file_buffer / jbd2_journal_dirty_metadata.
> Sorry, data-race is usually difficult to reproduce. I cannot provide
> you with a reproducing program.
> I hope that the call stack information in the crash log can help you
> locate the problem.
> Kernel config and full log can be found in the attachment.
> 
> Here is the detailed information:
> commit:   3b9cdafb5358eb9f3790de2f728f765fef100731
> version:   linux 5.11
> git tree:upstream
> report:
> ==
> BUG: KCSAN: data-race in __jbd2_journal_file_buffer /
> jbd2_journal_dirty_metadata
> write to 0x88800af6da38 of 8 bytes by task 4822 on cpu 1:
>  __jbd2_journal_file_buffer+0x18d/0x370 linux/fs/jbd2/transaction.c:2518
>  __jbd2_journal_refile_buffer+0x155/0x230 linux/fs/jbd2/transaction.c:2612
>  jbd2_journal_commit_transaction+0x24c6/0x3200 linux/fs/jbd2/commit.c:1084
>  kjournald2+0x253/0x470 linux/fs/jbd2/journal.c:213
>  kthread+0x1f0/0x220 linux/kernel/kthread.c:292
>  ret_from_fork+0x1f/0x30 linux/arch/x86/entry/entry_64.S:294

Thanks for report but I'm not sure what KCSAN is complaining about - isn't
the report truncated? I'm missing 'read' part of the report... The complaint
is on line:

jh->b_transaction = transaction;

I would guess the complaint is because of the check:

/*
 * This and the following assertions are unreliable since we may see jh
 * in inconsistent state unless we grab bh_state lock. But this is
 * crucial to catch bugs so let's do a reliable check until the
 * lockless handling is fully proven.
 */
if (jh->b_transaction != transaction &&
jh->b_next_transaction != transaction) {

And the comment explains, why we do this unreliable check. Again, if we
wanted to silence KCSAN, we could use data_race() macro but AFAIU Ted isn't
very fond of that annotation.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/2] ext2: Convert kmap to kmap_local_page

2021-03-31 Thread Jan Kara
Hi Ira!

On Sun 28-03-21 23:54:00, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> kmap is inefficient and can be abused so it is being phased out in favor of
> kmap_local_page where possible.
> 
> ext2 uses kmap in ext2_[get|put]_page().  All of the calls to
> ext2_[get|put]_page() occur in single threads so it is perfectly safe and
> preferable to use kmap_local_page().
> 
> This series has a clean up which matches ext2_put_page() with ext2_dotdot() 
> and
> ext2_find_entry().  Those calls use ext2_get_page() to map the page prior to
> returning it to their callers.  And they document that ext2_put_page() should
> be matched up with them.  This was the case but the ext2_put_page() calls were
> hidden in other functions.  We lift the ext2_put_page() calls to match up to
> the functions where ext2_dotdot() and ext2_find_entry() are called.
> 
> After that clean up convert ext2_[get|put]_page() to kmap and adjust for
> kunmap_local() requiring the page address.
> 
> Nesting of kmap_local_page() calls is maintained with minor changes.

Pulled into my tree. Thanks!

Honza

> 
> Ira Weiny (2):
>   ext2: Match up ext2_put_page() with ext2_dotdot() and
> ext2_find_entry()
>   fs/ext2: Replace kmap() with kmap_local_page()
> 
>  fs/ext2/dir.c   | 94 +++--
>  fs/ext2/ext2.h  | 12 ---
>  fs/ext2/namei.c | 34 +++---
>  3 files changed, 89 insertions(+), 51 deletions(-)
> 
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] jbd2: avoid -Wempty-body warnings

2021-03-30 Thread Jan Kara
On Mon 22-03-21 11:21:38, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Building with 'make W=1' shows a harmless -Wempty-body warning:
> 
> fs/jbd2/recovery.c: In function 'fc_do_one_pass':
> fs/jbd2/recovery.c:267:75: error: suggest braces around empty body in an 'if' 
> statement [-Werror=empty-body]
>   267 | jbd_debug(3, "Fast commit replay failed, err = %d\n", 
> err);
>   |   
> ^
> 
> Change the empty dprintk() macros to no_printk(), which avoids this
> warning and adds format string checking.
> 
> Signed-off-by: Arnd Bergmann 

Sure. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/jbd2.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 99d3cd051ac3..232e6285536a 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -61,7 +61,7 @@ void __jbd2_debug(int level, const char *file, const char 
> *func,
>  #define jbd_debug(n, fmt, a...) \
>   __jbd2_debug((n), __FILE__, __func__, __LINE__, (fmt), ##a)
>  #else
> -#define jbd_debug(n, fmt, a...)/**/
> +#define jbd_debug(n, fmt, a...)  no_printk(fmt, ##a)
>  #endif
>  
>  extern void *jbd2_alloc(size_t size, gfp_t flags);
> -- 
> 2.29.2
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [BUG] fs/notify/mark: A potential use after free in fsnotify_put_mark_wake

2021-03-29 Thread Jan Kara
Hello!


On Sun 28-03-21 17:11:43, lyl2...@mail.ustc.edu.cn wrote:
> My static analyzer tool reported a use after free in 
> fsnotify_put_mark_wake
> of the file: fs/notify/mark.c.
> 
> In fsnotify_put_mark_wake, it calls fsnotify_put_mark(mark). Inside the 
> function
> fsnotify_put_mark(), if conn is NULL, it will call 
> fsnotify_final_mark_destroy(mark)
> to free mark->group by fsnotify_put_group(group) and return. I also had 
> inspected
> the implementation of fsnotify_put_group() and found that there is no cleanup 
> operation
> about group->user_waits.
> 
> But after fsnotify_put_mark_wake() returned, mark->group is still used by 
> if (atomic_dec_and_test(&group->user_waits) && group->shutdown) and later.
> 
> Is this an issue?

I don't think this scenario is possible. fsnotify_put_mark_wake() can be
called only for marks attached to objects and these have mark->conn !=
NULL and we are sure that fsnotify_destroy_group() will wait for all such
marks to be torn down and freed before dropping last group reference and
freeing the group.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] Updated locking documentation for transaction_t

2021-03-29 Thread Jan Kara
On Fri 26-03-21 09:18:45, Alexander Lochmann wrote:
> On 11.02.21 10:30, Jan Kara wrote:
> >> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> >> index 99d3cd051ac3..18f77d9b1745 100644
> >> --- a/include/linux/jbd2.h
> >> +++ b/include/linux/jbd2.h
> >> @@ -594,18 +594,18 @@ struct transaction_s
> >> */
> >>unsigned long   t_log_start;
> >>  
> >> -  /* Number of buffers on the t_buffers list [j_list_lock] */
> >> +  /* Number of buffers on the t_buffers list [j_list_lock, no lock for 
> >> quick racy checks] */
> >>int t_nr_buffers;
> > 
> > So this case is actually somewhat different now that I audited the uses.
> > There are two types of users - commit code (fs/jbd2/commit.c) and others.
> > Other users properly use j_list_lock to access t_nr_buffers. Commit code
> > does not use any locks because committing transaction is fully in
> > ownership of the jbd2 thread and all other users need to check & wait for
> > commit to be finished before doing anything with the transaction's buffers.
> 
> I'm still trying understand how thinks work:
> Accesses to transaction_t might occur from different contexts. Thus,
> locks are necessary. If it comes to the commit phase, every other
> context has to wait until jbd2 thread has done its work. Therefore, jbd2
> thread does not need any locks to access a transaction_t (or just parts
> of it?) during commit phase.
> Is that correct?

Yes, that is correct.

> If so: I was thinking whether it make sense to ignore all memory
> accesses to a transaction_t (or parts of it) that happen in the commit
> phase. They deliberately ignore the locking policy, and would confuse
> our approach.
> 
> Is the commit phase performed by jbd2_journal_commit_transaction()?
> We would add this function to our blacklist for transaction_t.

Yes, commit phase is implemented by jbd2_journal_commit_transaction() and
the functions it calls.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: ext2_set_link()->ext2_put_page() question

2021-03-25 Thread Jan Kara
On Mon 22-03-21 17:49:48, Ira Weiny wrote:
> Jan,
> 
> Why does ext2_set_link() need to call ext2_put_page()?
> 
> I don't see any reason that we could not match up the ext2_put_page() calls
> with the ext2_find_entry().
> 
> Similarly am I missing something by moving the ext2_put_page() out of
> ext2_delete_entry()?
> 
> See below patch.

I agree that your patch improves readability. But please fixup comments
about releasing a page as well. Thanks!

> I'm in the process of changing the kmap() calls in ext2_[get|put]_page() into
> kmap_local_page() and I noticed this imbalance.  It does not really save me
> anything because I need to pass the kaddr into these calls but IMO it makes 
> the
> code a bit easier to follow.
> 
> If you agree I will get a patch together to submit with the kmap_local_page()
> patch.

OK :).

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fanotify_user: use upper_32_bits() to verify mask

2021-03-25 Thread Jan Kara
On Thu 25-03-21 09:37:43, Christian Brauner wrote:
> From: Christian Brauner 
> 
> I don't see an obvious reason why the upper 32 bit check needs to be
> open-coded this way. Switch to upper_32_bits() which is more idiomatic and
> should conceptually be the same check.
> 
> Cc: Amir Goldstein 
> Cc: Jan Kara 
> Signed-off-by: Christian Brauner 

Thanks for the cleanup. I've added it to my tree.

Honza

> ---
>  fs/notify/fanotify/fanotify_user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/notify/fanotify/fanotify_user.c 
> b/fs/notify/fanotify/fanotify_user.c
> index 9e0c1afac8bd..d5683fa9d495 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1126,7 +1126,7 @@ static int do_fanotify_mark(int fanotify_fd, unsigned 
> int flags, __u64 mask,
>__func__, fanotify_fd, flags, dfd, pathname, mask);
>  
>   /* we only use the lower 32 bits as of right now. */
> - if (mask & ((__u64)0x << 32))
> + if (upper_32_bits(mask))
>   return -EINVAL;
>  
>   if (flags & ~FANOTIFY_MARK_FLAGS)
> 
> base-commit: 0d02ec6b3136c73c09e7859f0d0e4e2c4c07b49b
> -- 
> 2.27.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH AUTOSEL 5.11 03/44] ext4: add reclaim checks to xattr code

2021-03-25 Thread Jan Kara
Sasha, just be aware that this commit was added to help tracking down a
particular syzbot report. As such there's no point in carrying it in
-stable but there's no big harm either... Just one patch more.

Honza

On Thu 25-03-21 07:24:18, Sasha Levin wrote:
> From: Jan Kara 
> 
> [ Upstream commit 163f0ec1df33cf468509ff38cbcbb5eb0d7fac60 ]
> 
> Syzbot is reporting that ext4 can enter fs reclaim from kvmalloc() while
> the transaction is started like:
> 
>   fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
>   might_alloc include/linux/sched/mm.h:193 [inline]
>   slab_pre_alloc_hook mm/slab.h:493 [inline]
>   slab_alloc_node mm/slub.c:2817 [inline]
>   __kmalloc_node+0x5f/0x430 mm/slub.c:4015
>   kmalloc_node include/linux/slab.h:575 [inline]
>   kvmalloc_node+0x61/0xf0 mm/util.c:587
>   kvmalloc include/linux/mm.h:781 [inline]
>   ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
>   ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
>   ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
>   ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
>   ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
>   ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> 
> This should be impossible since transaction start sets PF_MEMALLOC_NOFS.
> Add some assertions to the code to catch if something isn't working as
> expected early.
> 
> Link: 
> https://lore.kernel.org/linux-ext4/563a0205bafb7...@google.com/
> Signed-off-by: Jan Kara 
> Link: https://lore.kernel.org/r/20210222171626.21884-1-j...@suse.cz
> Signed-off-by: Theodore Ts'o 
> Signed-off-by: Sasha Levin 
> ---
>  fs/ext4/xattr.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 372208500f4e..083c95126781 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1462,6 +1462,9 @@ ext4_xattr_inode_cache_find(struct inode *inode, const 
> void *value,
>   if (!ce)
>   return NULL;
>  
> + WARN_ON_ONCE(ext4_handle_valid(journal_current_handle()) &&
> +  !(current->flags & PF_MEMALLOC_NOFS));
> +
>   ea_data = kvmalloc(value_len, GFP_KERNEL);
>   if (!ea_data) {
>   mb_cache_entry_put(ea_inode_cache, ce);
> @@ -2327,6 +2330,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode 
> *inode, int name_index,
>   error = -ENOSPC;
>   goto cleanup;
>   }
> + WARN_ON_ONCE(!(current->flags & PF_MEMALLOC_NOFS));
>   }
>  
>   error = ext4_reserve_inode_write(handle, inode, &is.iloc);
> -- 
> 2.30.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] Updated locking documentation for journal_head

2021-03-19 Thread Jan Kara
On Fri 19-03-21 10:18:17, Alexander Lochmann wrote:
> LockDoc's results show that t_list_lock has been
> replaced by j_list_lock for b_next_transaction,
> b_tnext, and b_tprev.
> We, therefore, updated the documentation
> accordingly.
> 
> Signed-off-by: Alexander Lochmann 
> Signed-off-by: Horst Schirmeier 

Yeah, I think that was a typo since the beginning. Thanks for the fix. Feel
free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/journal-head.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
> index 75bc56109031..d68ae72784eb 100644
> --- a/include/linux/journal-head.h
> +++ b/include/linux/journal-head.h
> @@ -80,13 +80,13 @@ struct journal_head {
>* Pointer to the running compound transaction which is currently
>* modifying the buffer's metadata, if there was already a transaction
>* committing it when the new transaction touched it.
> -  * [t_list_lock] [b_state_lock]
> +  * [j_list_lock] [b_state_lock]
>*/
>   transaction_t *b_next_transaction;
>  
>   /*
>* Doubly-linked list of buffers on a transaction's data, metadata or
> -  * forget queue. [t_list_lock] [b_state_lock]
> +  * forget queue. [j_list_lock] [b_state_lock]
>*/
>   struct journal_head *b_tnext, *b_tprev;
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/ext2/: fix misspellings using codespell tool

2021-03-19 Thread Jan Kara
On Thu 18-03-21 17:31:31, menglong8.d...@gmail.com wrote:
> From: Liu xuzhi 
> 
> A typo is found out by codespell tool in 1107th lines of super.c:
> 
> $ codespell ./fs/ext2/
> ./super.c:1107: fileystem  ==> filesystem
> 
> Fix a typo found by codespell.
> 
> Signed-off-by: Liu xuzhi 

Thanks. I've merged the patch to my tree.

Honza

> ---
>  fs/ext2/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 6c4753277916..d2fd9707e953 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -1104,7 +1104,7 @@ static int ext2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   get_random_bytes(&sbi->s_next_generation, sizeof(u32));
>   spin_lock_init(&sbi->s_next_gen_lock);
>  
> - /* per fileystem reservation list head & lock */
> + /* per filesystem reservation list head & lock */
>   spin_lock_init(&sbi->s_rsv_window_lock);
>   sbi->s_rsv_window_root = RB_ROOT;
>   /*
> -- 
> 2.25.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: linux-next: Tree for Mar 17

2021-03-17 Thread Jan Kara
On Wed 17-03-21 13:00:45, Heiko Carstens wrote:
> On Wed, Mar 17, 2021 at 07:42:41PM +1100, Stephen Rothwell wrote:
> > Hi all,
> > 
> > News: there will be no linux-next release on Friday this week.
> > 
> > Warning: Some of the branches in linux-next are still based on v5.12-rc1,
> > so please be careful if you are trying to bisect a bug.
> > 
> > News: if your -next included tree is based on Linus' tree tag
> > v5.12-rc1{,-dontuse} (or somewhere between v5.11 and that tag), please
> > consider rebasing it onto v5.12-rc2. Also, please check any branches
> > merged into your branch.
> > 
> > Changes since 20210316:
> > 
> > New tree: cifsd
> > 
> > The cifsd tree gained a build failure for which I applied a patch.
> > 
> > The drm-intel tree gained a conflict against the drm tree.
> > 
> > The tip tree gained a build failure so I used the version from
> > next-20210316.
> > 
> > The rcu tree gained a build failure so I used the version from
> > next-20210316.
> > 
> > Non-merge commits (relative to Linus' tree): 4404
> >  4125 files changed, 288074 insertions(+), 79674 deletions(-)
> > 
> > 
> > 
> > I have created today's linux-next tree at
> > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> > (patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
> 
> This does unfortunately not compile on s390 due to commit 72dd1ce7ebd3
> ("quota: wire up quotactl_path").

Thanks for the fix and sorry for inconvenience. Fixed up patch pushed out
so tomorrow's next tree should be fine.

Honza

> 
> The patch below fixes it:
> 
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl 
> b/arch/s390/kernel/syscalls/syscall.tbl
> index 4aeaa89fa774..a421905c36e8 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -445,4 +445,4 @@
>  440  common  process_madvise sys_process_madvise 
> sys_process_madvise
>  441  common  epoll_pwait2sys_epoll_pwait2        
> compat_sys_epoll_pwait2
>  442  common  mount_setattr   sys_mount_setattr   
> sys_mount_setattr
> -443  common  quotactl_path   sys_quotactl_path
> +443  common  quotactl_path   sys_quotactl_path   
> sys_quotactl_path
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC] inode.i_opflags - Usage of two different locking schemes

2021-03-16 Thread Jan Kara
On Mon 08-03-21 15:05:33, Alexander Lochmann wrote:
> On 05.03.21 17:04, Theodore Ts'o wrote:
> > On Fri, Mar 05, 2021 at 04:35:47PM +0100, Alexander Lochmann wrote:
> > > 
> > > 
> > > On 05.03.21 16:18, Theodore Ts'o wrote:
> > > > 1)  I don't see where i_opflags is being read in ipc/mqueue.c at all,
> > > > either with or without i_rwsem.
> > > > 
> > > It is read in fs/dcache.c
> > 
> > So why is this unique to the mqueue inode then?  It might be helpful
> > to have explicit call stacks in the e-mail, in text form, when you
> > resend to LKML.
> It is unique to mqeue inode, because the control flow goes through
> ipc/mqueue.c where almost always the i_rwsem is taken.
> Hence, we see more memory accesses to an mqueue inode with the i_rwsem.
> The i_lock is less often hold compared to the i_rwsem.
> We conclude the i_rwsem is needed. So it might not be a contradiction at
> all. It rather could be a flaw in our approach. :-/
> 
> Besides from our current discussion:
> Does the i_lock protect i_opflags for both reading and writing?

So i_lock is supposed to protect i_opflags for writing AFAICT. For reading
we don't seem to bother in some cases and I agree that is potentially
problematic. It is *mostly* OK because we initialize i_opflags when loading
inode into memory / adding it to dcache. But sometimes we also update them
while the inode is alive. Now this is fine for the particular flag we
update but in theory, if the compiler wants to screw us and stores
temporarily some nonsensical value in i_opflags we'd have a problem. This
is mostly a theoretical issue but eventually we probably want to fix this.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v3 0/2] quota: Add mountpath based quota support

2021-03-16 Thread Jan Kara
On Thu 04-03-21 13:35:38, Sascha Hauer wrote:
> Current quotactl syscall uses a path to a block device to specify the
> filesystem to work on which makes it unsuitable for filesystems that
> do not have a block device. This series adds a new syscall quotactl_path()
> which replaces the path to the block device with a mountpath, but otherwise
> behaves like original quotactl.
> 
> This is done to add quota support to UBIFS. UBIFS quota support has been
> posted several times with different approaches to put the mountpath into
> the existing quotactl() syscall until it has been suggested to make it a
> new syscall instead, so here it is.
> 
> I'm not posting the full UBIFS quota series here as it remains unchanged
> and I'd like to get feedback to the new syscall first. For those interested
> the most recent series can be found here: https://lwn.net/Articles/810463/

Thanks. I've merged the two patches into my tree and will push them to
Linus for the next merge window.

Honza

> 
> Changes since v2:
> - Rebase on v5.12-rc1
> - replace mountpath.dentry->d_inode->i_sb with mountpath.mnt->mnt_sb
> - fix wrong macro usage in arch/x86/entry/syscalls/syscall_32.tbl
> - +Cc linux-...@vger.kernel.org
> 
> Changes since (implicit) v1:
> - Ignore second path argument to Q_QUOTAON. With this quotactl_path() can
>   only do the Q_QUOTAON operation on filesystems which use hidden inodes
>   for quota metadata storage
> - Drop unnecessary quotactl_cmd_onoff() check
> 
> Sascha Hauer (2):
>   quota: Add mountpath based quota support
>   quota: wire up quotactl_path
> 
>  arch/alpha/kernel/syscalls/syscall.tbl  |  1 +
>  arch/arm/tools/syscall.tbl  |  1 +
>  arch/arm64/include/asm/unistd.h |  2 +-
>  arch/arm64/include/asm/unistd32.h   |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl   |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl   |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl|  1 +
>  arch/s390/kernel/syscalls/syscall.tbl   |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl  |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl  |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl |  1 +
>  fs/quota/quota.c| 49 +++--
>  include/linux/syscalls.h|  2 +
>  include/uapi/asm-generic/unistd.h   |  4 +-
>  kernel/sys_ni.c |  1 +
>  22 files changed, 71 insertions(+), 5 deletions(-)
> 
> -- 
> 2.29.2
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [syzbot] WARNING: ODEBUG bug in ext4_fill_super (3)

2021-03-15 Thread Jan Kara
On Sat 13-03-21 20:08:14, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:28806e4d Merge tag 'media/v5.12-2' of git://git.kernel.org..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=136d1bbcd0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6bcf96204c1b8e77
> dashboard link: https://syzkaller.appspot.com/bug?extid=628472a2aac693ab0fcd
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1133abfad0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1191aab2d0
> 
> The issue was bisected to:
> 
> commit 2d01ddc86606564fb08c56e3bc93a0693895f710
> Author: Jan Kara 
> Date:   Wed Dec 16 10:18:40 2020 +
> 
> ext4: save error info to sb through journal if available

After some head scratching I think [1] should fix the problem.

[1] https://lore.kernel.org/linux-ext4/20210315165906.2175-1-j...@suse.cz

Honza

> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=152b9d56d0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=172b9d56d0
> console output: https://syzkaller.appspot.com/x/log.txt?x=132b9d56d0
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+628472a2aac693ab0...@syzkaller.appspotmail.com
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if 
> available")
> 
> ODEBUG: free active (active state 0) object type: timer_list hint: 
> print_daily_error_info+0x0/0x1f0 fs/ext4/super.c:1334
> WARNING: CPU: 1 PID: 12723 at lib/debugobjects.c:505 
> debug_print_object+0x16e/0x250 lib/debugobjects.c:505
> Modules linked in:
> CPU: 0 PID: 12723 Comm: syz-executor932 Not tainted 5.12.0-rc2-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
> Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd a0 
> 06 c2 89 4c 89 ee 48 c7 c7 a0 fa c1 89 e8 fc 41 f8 04 <0f> 0b 83 05 05 7e fb 
> 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
> RSP: 0018:c9000e6ef980 EFLAGS: 00010286
> 
> RAX:  RBX: 0003 RCX: 
> RDX: 88801d5e9bc0 RSI: 815c0d25 RDI: f52001cddf22
> RBP: 0001 R08:  R09: 
> R10: 815b9abe R11:  R12: 896d7da0
> R13: 89c200e0 R14: 81629d00 R15: dc00
> FS:  00f93300() GS:8880b9c0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fcda3ec CR3: 155a1000 CR4: 00350ef0
> Call Trace:
>  __debug_check_no_obj_freed lib/debugobjects.c:987 [inline]
>  debug_check_no_obj_freed+0x301/0x420 lib/debugobjects.c:1018
>  slab_free_hook mm/slub.c:1554 [inline]
>  slab_free_freelist_hook+0x147/0x210 mm/slub.c:1600
>  slab_free mm/slub.c:3161 [inline]
>  kfree+0xe5/0x7f0 mm/slub.c:4213
>  ext4_fill_super+0x84f/0xded0 fs/ext4/super.c:5182
>  mount_bdev+0x34d/0x410 fs/super.c:1367
>  legacy_get_tree+0x105/0x220 fs/fs_context.c:592
>  vfs_get_tree+0x89/0x2f0 fs/super.c:1497
>  do_new_mount fs/namespace.c:2903 [inline]
>  path_mount+0x132a/0x1f90 fs/namespace.c:3233
>  do_mount fs/namespace.c:3246 [inline]
>  __do_sys_mount fs/namespace.c:3454 [inline]
>  __se_sys_mount fs/namespace.c:3431 [inline]
>  __x64_sys_mount+0x27f/0x300 fs/namespace.c:3431
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x445c0a
> Code: 48 c7 c2 c0 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff eb d2 e8 a8 00 00 00 
> 0f 1f 84 00 00 00 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7ffe8bf4c3b8 EFLAGS: 0202
>  ORIG_RAX: 00a5
> RAX: ffda RBX: 7ffe8bf4c410 RCX: 00445c0a
> RDX: 2040 RSI: 2100 RDI: 7ffe8bf4c3d0
> RBP: 7ffe8bf4c3d0 R08: 7ffe8bf4c410 R09: 
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
-- 
Jan Kara 
SUSE Labs, CR


Re: [syzbot] KCSAN: data-race in start_this_handle / start_this_handle

2021-03-11 Thread Jan Kara
On Thu 11-03-21 02:59:14, syzbot wrote:
> HEAD commit:a74e6a01 Merge tag 's390-5.12-3' of git://git.kernel.org/p..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=159f69ecd0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=de394bbaade74fb7
> dashboard link: https://syzkaller.appspot.com/bug?extid=30774a6acf6a2cf6d535
> compiler:   Debian clang version 11.0.1-2
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+30774a6acf6a2cf6d...@syzkaller.appspotmail.com
> 
> ==
> BUG: KCSAN: data-race in start_this_handle / start_this_handle
> 
> write to 0x888103880870 of 8 bytes by task 29956 on cpu 1:
>  jbd2_get_transaction fs/jbd2/transaction.c:125 [inline]
>  start_this_handle+0xceb/0x1010 fs/jbd2/transaction.c:400
>  jbd2__journal_start+0x1fc/0x3f0 fs/jbd2/transaction.c:503
>  __ext4_journal_start_sb+0x159/0x310 fs/ext4/ext4_jbd2.c:105
>  __ext4_journal_start fs/ext4/ext4_jbd2.h:320 [inline]
>  ext4_da_write_begin+0x460/0xaf0 fs/ext4/inode.c:2998
>  generic_perform_write+0x196/0x3a0 mm/filemap.c:3575
>  ext4_buffered_write_iter+0x2e5/0x3e0 fs/ext4/file.c:269
>  ext4_file_write_iter+0x48a/0x10b0 fs/ext4/file.c:502
>  call_write_iter include/linux/fs.h:1977 [inline]
>  do_iter_readv_writev+0x2cb/0x360 fs/read_write.c:740
>  do_iter_write+0x112/0x4c0 fs/read_write.c:866
>  vfs_iter_write+0x4c/0x70 fs/read_write.c:907
>  iter_file_splice_write+0x40a/0x750 fs/splice.c:689
>  do_splice_from fs/splice.c:767 [inline]
>  direct_splice_actor+0x80/0xa0 fs/splice.c:936
>  splice_direct_to_actor+0x345/0x650 fs/splice.c:891
>  do_splice_direct+0xf5/0x170 fs/splice.c:979
>  do_sendfile+0x7a6/0xe20 fs/read_write.c:1260
>  __do_sys_sendfile64 fs/read_write.c:1319 [inline]
>  __se_sys_sendfile64 fs/read_write.c:1311 [inline]
>  __x64_sys_sendfile64+0xa9/0x130 fs/read_write.c:1311
>  do_syscall_64+0x39/0x80 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> read to 0x888103880870 of 8 bytes by task 29936 on cpu 0:
>  start_this_handle+0x1c1/0x1010 fs/jbd2/transaction.c:352
>  jbd2__journal_start+0x1fc/0x3f0 fs/jbd2/transaction.c:503
>  __ext4_journal_start_sb+0x159/0x310 fs/ext4/ext4_jbd2.c:105
>  __ext4_journal_start fs/ext4/ext4_jbd2.h:320 [inline]
>  ext4_da_write_begin+0x460/0xaf0 fs/ext4/inode.c:2998
>  generic_perform_write+0x196/0x3a0 mm/filemap.c:3575
>  ext4_buffered_write_iter+0x2e5/0x3e0 fs/ext4/file.c:269
>  ext4_file_write_iter+0x48a/0x10b0 fs/ext4/file.c:502
>  call_write_iter include/linux/fs.h:1977 [inline]
>  do_iter_readv_writev+0x2cb/0x360 fs/read_write.c:740
>  do_iter_write+0x112/0x4c0 fs/read_write.c:866
>  vfs_iter_write+0x4c/0x70 fs/read_write.c:907
>  iter_file_splice_write+0x40a/0x750 fs/splice.c:689
>  do_splice_from fs/splice.c:767 [inline]
>  direct_splice_actor+0x80/0xa0 fs/splice.c:936
>  splice_direct_to_actor+0x345/0x650 fs/splice.c:891
>  do_splice_direct+0xf5/0x170 fs/splice.c:979
>  do_sendfile+0x7a6/0xe20 fs/read_write.c:1260
>  __do_sys_sendfile64 fs/read_write.c:1319 [inline]
>  __se_sys_sendfile64 fs/read_write.c:1311 [inline]
>  __x64_sys_sendfile64+0xa9/0x130 fs/read_write.c:1311
>  do_syscall_64+0x39/0x80 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 0 PID: 29936 Comm: syz-executor.5 Not tainted 5.12.0-rc2-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> ==

So this case is harmless. start_this_handle() does indeed check
journal->j_running_transaction without any protection and this is only a
racy check to opportunistically preallocate a transaction if we are likely
to need it. There was some macro to instruct KCSAN that the read is
actually fine, wasn't there?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC] Better page cache error handling

2021-02-24 Thread Jan Kara
On Wed 24-02-21 13:41:15, Matthew Wilcox wrote:
> On Wed, Feb 24, 2021 at 01:38:48PM +0100, Jan Kara wrote:
> > > We allocate a page and try to read it.  29 threads pile up waiting
> > > for the page lock in filemap_update_page().  The error returned by the
> > > original I/O is shared between all 29 waiters as well as being returned
> > > to the requesting thread.  The next request for index.html will send
> > > another I/O, and more waiters will pile up trying to get the page lock,
> > > but at no time will more than 30 threads be waiting for the I/O to fail.
> > 
> > Interesting idea. It certainly improves current behavior. I just wonder
> > whether this isn't a partial solution to a problem and a full solution of
> > it would have to go in a different direction? I mean it just seems
> > wrong that each reader (let's assume they just won't overlap) has to retry
> > the failed IO and wait for the HW to figure out it's not going to work.
> > Shouldn't we cache the error state with the page? And I understand that we
> > then also have to deal with the problem how to invalidate the error state
> > when the block might eventually become readable (for stuff like temporary
> > IO failures). That would need some signalling from the driver to the page
> > cache, maybe in a form of some error recovery sequence counter or something
> > like that. For stuff like iSCSI, multipath, or NBD it could be doable I
> > believe...
> 
> That felt like a larger change than I wanted to make.  I already have
> a few big projects on my plate!

I can understand that ;)

> Also, it's not clear to me that the host can necessarily figure out when
> a device has fixed an error -- certainly for the three cases you list
> it can be done.  I think we'd want a timer to indicate that it's worth
> retrying instead of returning the error.
> 
> Anyway, that seems like a lot of data to cram into a struct page.  So I
> think my proposal is still worth pursuing while waiting for someone to
> come up with a perfect solution.

Yes, timer could be a fallback. Or we could just schedule work to discard
all 'error' pages in the fs in an hour or so. Not perfect but more or less
workable I'd say. Also I don't think we need to cram this directly into
struct page - I think it is perfectly fine to kmalloc() structure we need
for caching if we hit error and just don't cache if the allocation fails.
Then we might just reference it from appropriate place... didn't put too
much thought to this...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in evict

2021-02-24 Thread Jan Kara
Hi!

On Sat 13-02-21 02:38:18, syzbot wrote:
> syzbot found the following issue on:
> 
> HEAD commit:c6d8570e Merge tag 'io_uring-5.11-2021-02-12' of git://git..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=123a4be2d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=bec717fd4ac4bf03
> dashboard link: https://syzkaller.appspot.com/bug?extid=1b2c6989ec12e467d65c
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+1b2c6989ec12e467d...@syzkaller.appspotmail.com
> 
> ==
> WARNING: possible circular locking dependency detected
> 5.11.0-rc7-syzkaller #0 Not tainted
> --
> kswapd0/2232 is trying to acquire lock:
> 88801f552650 (sb_internal){.+.+}-{0:0}, at: evict+0x2ed/0x6b0 
> fs/inode.c:577
> 
> but task is already holding lock:
> 8be89240 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 
> mm/page_alloc.c:5195
> 
> which lock already depends on the new lock.

So this is an interesting problem. The stacktrace below shows that we can
end up with inode reclaim deleting inode. It likely happens like:

CPU1CPU2
prune_icache_sb()
  ...
  inode_lru_isolate()
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
  __iget(inode);
  ...
unlink(inode);
d_delete(dentry);
  ...
  iput(inode)
...going to delete inode...
  
And this introduces interesting lock dependency with filesystem freezing -
fs reclaim can block on filesystem being frozen. That inherently means that
anything between freeze_super() and thaw_super() that enters fs reclaim is
a potential deadlock. But among lot of kernel code in there, there's also
userspace running inbetween those so we have no sane way to avoid entering
fs reclaim there.

So I belive the only sane way of avoiding this deadlock is to really
avoiding deleting inode from fs reclaim. And the best idea how to achieve
that I have is to simply avoid the 'inode_has_buffers(inode) ||
inode->i_data.nrpages' branch if we are running in direct reclaim. Any
better idea?

> stack backtrace:
> CPU: 3 PID: 2232 Comm: kswapd0 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:120
>  check_noncircular+0x25f/0x2e0 kernel/locking/lockdep.c:2117
>  check_prev_add kernel/locking/lockdep.c:2868 [inline]
>  check_prevs_add kernel/locking/lockdep.c:2993 [inline]
>  validate_chain kernel/locking/lockdep.c:3608 [inline]
>  __lock_acquire+0x2b26/0x54f0 kernel/locking/lockdep.c:4832
>  lock_acquire kernel/locking/lockdep.c:5442 [inline]
>  lock_acquire+0x1a8/0x720 kernel/locking/lockdep.c:5407
>  percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
>  __sb_start_write include/linux/fs.h:1592 [inline]
>  sb_start_intwrite include/linux/fs.h:1709 [inline]
>  ext4_evict_inode+0xe6f/0x1940 fs/ext4/inode.c:241
>  evict+0x2ed/0x6b0 fs/inode.c:577
>  iput_final fs/inode.c:1653 [inline]
>  iput.part.0+0x57e/0x810 fs/inode.c:1679
>  iput fs/inode.c:1669 [inline]
>  inode_lru_isolate+0x301/0x4f0 fs/inode.c:778
>  __list_lru_walk_one+0x178/0x5c0 mm/list_lru.c:222
>  list_lru_walk_one+0x99/0xd0 mm/list_lru.c:266
>  list_lru_shrink_walk include/linux/list_lru.h:195 [inline]
>  prune_icache_sb+0xdc/0x140 fs/inode.c:803
>  super_cache_scan+0x38d/0x590 fs/super.c:107
>  do_shrink_slab+0x3e4/0x9f0 mm/vmscan.c:511
>  shrink_slab+0x16f/0x5d0 mm/vmscan.c:672
>  shrink_node_memcgs mm/vmscan.c:2665 [inline]
>  shrink_node+0x8cc/0x1de0 mm/vmscan.c:2780
>  kswapd_shrink_node mm/vmscan.c:3523 [inline]
>  balance_pgdat+0x745/0x1270 mm/vmscan.c:3681
>  kswapd+0x5b1/0xdb0 mm/vmscan.c:3938
>  kthread+0x3b1/0x4a0 kernel/kthread.c:292
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC] Better page cache error handling

2021-02-24 Thread Jan Kara
err;
>   smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN);
>   wake_up_state(wait->private, mode);
>  
> @@ -1094,7 +1095,7 @@ static int wake_page_function(wait_queue_entry_t *wait, 
> unsigned mode, int sync,
>   return (flags & WQ_FLAG_EXCLUSIVE) != 0;
>  }
>  
> -static void wake_up_page_bit(struct page *page, int bit_nr)
> +static void wake_up_page_bit(struct page *page, int bit_nr, int err)
>  {
>   wait_queue_head_t *q = page_waitqueue(page);
>   struct wait_page_key key;
> @@ -1103,6 +1104,7 @@ static void wake_up_page_bit(struct page *page, int 
> bit_nr)
>  
>   key.page = page;
>   key.bit_nr = bit_nr;
> + key.err = err;
>   key.page_match = 0;
>  
>   bookmark.flags = 0;
> @@ -1152,7 +1154,7 @@ static void wake_up_page(struct page *page, int bit)
>  {
>   if (!PageWaiters(page))
>   return;
> - wake_up_page_bit(page, bit);
> + wake_up_page_bit(page, bit, 0);
>  }
>  
>  /*
> @@ -1214,6 +1216,7 @@ static inline int 
> wait_on_page_bit_common(wait_queue_head_t *q,
>   wait->func = wake_page_function;
>   wait_page.page = page;
>   wait_page.bit_nr = bit_nr;
> + wait_page.err = 0;
>  
>  repeat:
>   wait->flags = 0;
> @@ -1325,8 +1328,10 @@ static inline int 
> wait_on_page_bit_common(wait_queue_head_t *q,
>*/
>   if (behavior == EXCLUSIVE)
>   return wait->flags & WQ_FLAG_DONE ? 0 : -EINTR;
> + if (behavior != DROP)
> + wait_page.err = 0;
>  
> - return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
> + return wait->flags & WQ_FLAG_WOKEN ? wait_page.err : -EINTR;
>  }
>  
>  void wait_on_page_bit(struct page *page, int bit_nr)
> @@ -1408,8 +1413,9 @@ static inline bool 
> clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
>  #endif
>  
>  /**
> - * unlock_page - unlock a locked page
> + * unlock_page_err - unlock a locked page
>   * @page: the page
> + * @err: errno to be communicated to waiters
>   *
>   * Unlocks the page and wakes up sleepers in wait_on_page_locked().
>   * Also wakes sleepers in wait_on_page_writeback() because the wakeup
> @@ -1422,13 +1428,19 @@ static inline bool 
> clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
>   * portably (architectures that do LL/SC can test any bit, while x86 can
>   * test the sign bit).
>   */
> -void unlock_page(struct page *page)
> +void unlock_page_err(struct page *page, int err)
>  {
>   BUILD_BUG_ON(PG_waiters != 7);
>   page = compound_head(page);
>   VM_BUG_ON_PAGE(!PageLocked(page), page);
>   if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
> - wake_up_page_bit(page, PG_locked);
> + wake_up_page_bit(page, PG_locked, err);
> +}
> +EXPORT_SYMBOL(unlock_page_err);
> +
> +void unlock_page(struct page *page)
> +{
> + unlock_page_err(page, 0);
>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> @@ -1446,7 +1458,7 @@ void unlock_page_fscache(struct page *page)
>   page = compound_head(page);
>   VM_BUG_ON_PAGE(!PagePrivate2(page), page);
>   clear_bit_unlock(PG_fscache, &page->flags);
> - wake_up_page_bit(page, PG_fscache);
> + wake_up_page_bit(page, PG_fscache, 0);
>  }
>  EXPORT_SYMBOL(unlock_page_fscache);
>  
> @@ -2298,8 +2310,11 @@ static int filemap_update_page(struct kiocb *iocb,
>   if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO))
>   return -EAGAIN;
>   if (!(iocb->ki_flags & IOCB_WAITQ)) {
> - put_and_wait_on_page_locked(page, TASK_KILLABLE);
> - return AOP_TRUNCATED_PAGE;
> + error = put_and_wait_on_page_locked(page,
> + TASK_KILLABLE);
> + if (!error)
> + return AOP_TRUNCATED_PAGE;
> + return error;
>   }
>   error = __lock_page_async(page, iocb->ki_waitq);
>   if (error)
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] fs/ext4: fix integer overflow in s_log_groups_per_flex

2021-02-24 Thread Jan Kara
On Wed 24-02-21 15:58:00, Sabyrzhan Tasbolatov wrote:
> syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
> 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
> where sbi->s_mb_prefetch is unsigned integer type.
> 
> 32 is the maximum allowed power of s_log_groups_per_flex. Following if
> check will also trigger UBSAN shift-out-of-bound:
> 
> if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) {
> 
> So I'm checking it against the raw number, perhaps there is another way
> to calculate UINT_MAX max power. Also use min_t as to make sure it's
> uint type.
> 
> [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
> shift exponent 60 is too large for 32-bit type 'int'
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x137/0x1be lib/dump_stack.c:120
>  ubsan_epilogue lib/ubsan.c:148 [inline]
>  __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
>  ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
>  ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
>  ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983
> 
> Reported-by: syzbot+a8b4b0c60155e87e9...@syzkaller.appspotmail.com
> Signed-off-by: Sabyrzhan Tasbolatov 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> v2: updated > 32 condition to >= 32
> 
> > > + if (sbi->s_es->s_log_groups_per_flex > 32) {
> > ^^ >= 32?
> > 
> > Otherwise the patch looks good.
> > 
> 
> Thanks! Updated to >= 32 condition.
> ---
>  fs/ext4/mballoc.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 99bf091fee10..a02fadf4fc84 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2709,8 +2709,15 @@ static int ext4_mb_init_backend(struct super_block *sb)
>   }
>  
>   if (ext4_has_feature_flex_bg(sb)) {
> - /* a single flex group is supposed to be read by a single IO */
> - sbi->s_mb_prefetch = min(1 << sbi->s_es->s_log_groups_per_flex,
> + /* a single flex group is supposed to be read by a single IO.
> +  * 2 ^ s_log_groups_per_flex != UINT_MAX as s_mb_prefetch is
> +  * unsigned integer, so the maximum shift is 32.
> +  */
> + if (sbi->s_es->s_log_groups_per_flex >= 32) {
> + ext4_msg(sb, KERN_ERR, "too many log groups per 
> flexible block group");
> + goto err_freesgi;
> +         }
> + sbi->s_mb_prefetch = min_t(uint, 1 << 
> sbi->s_es->s_log_groups_per_flex,
>   BLK_MAX_SEGMENT_SIZE >> (sb->s_blocksize_bits - 9));
>   sbi->s_mb_prefetch *= 8; /* 8 prefetch IOs in flight at most */
>   } else {
> -- 
> 2.25.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/ext4: fix integer overflow in s_log_groups_per_flex

2021-02-23 Thread Jan Kara
On Wed 03-02-21 19:43:51, Sabyrzhan Tasbolatov wrote:
> syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
> 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
> where sbi->s_mb_prefetch is unsigned integer type.
> 
> 32 is the maximum allowed power of s_log_groups_per_flex. Following if
> check will also trigger UBSAN shift-out-of-bound:
> 
> if (1 << sbi->s_es->s_log_groups_per_flex > UINT_MAX) {
> 
> So I'm checking it against the raw number, perhaps there is another way
> to calculate UINT_MAX max power. Also use min_t as to make sure it's
> uint type.
> 
> [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
> shift exponent 60 is too large for 32-bit type 'int'
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x137/0x1be lib/dump_stack.c:120
>  ubsan_epilogue lib/ubsan.c:148 [inline]
>  __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
>  ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
>  ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
>  ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983
> 
> Reported-by: syzbot+a8b4b0c60155e87e9...@syzkaller.appspotmail.com
> Signed-off-by: Sabyrzhan Tasbolatov 
> ---
>  fs/ext4/mballoc.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 99bf091fee10..e1e7ffbba1a6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2709,8 +2709,15 @@ static int ext4_mb_init_backend(struct super_block *sb)
>   }
>  
>   if (ext4_has_feature_flex_bg(sb)) {
> - /* a single flex group is supposed to be read by a single IO */
> - sbi->s_mb_prefetch = min(1 << sbi->s_es->s_log_groups_per_flex,
> + /* a single flex group is supposed to be read by a single IO.
> +  * 2 ^ s_log_groups_per_flex != UINT_MAX as s_mb_prefetch is
> +  * unsigned integer, so the maximum shift is 32.
> +  */
> + if (sbi->s_es->s_log_groups_per_flex > 32) {
^^ >= 32?

Otherwise the patch looks good.

Honza


> + ext4_msg(sb, KERN_ERR, "too many log groups per 
> flexible block group");
> + goto err_freesgi;
> + }
> + sbi->s_mb_prefetch = min_t(uint, 1 << 
> sbi->s_es->s_log_groups_per_flex,
>   BLK_MAX_SEGMENT_SIZE >> (sb->s_blocksize_bits - 9));
>   sbi->s_mb_prefetch *= 8; /* 8 prefetch IOs in flight at most */
>   } else {
> -- 
> 2.25.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in start_this_handle (2)

2021-02-15 Thread Jan Kara
On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
> On 2021/02/15 21:45, Jan Kara wrote:
> > On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> >> Excuse me, but it seems to me that nothing prevents
> >> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> >> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> >> Will you explain when ext4_get_nojournal() path is executed?
> > 
> > That's a good question but sadly I don't think that's it.
> > ext4_get_nojournal() is called when the filesystem is created without a
> > journal. In that case we also don't acquire jbd2_handle lockdep map. In the
> > syzbot report we can see:
> 
> Since syzbot can test filesystem images, syzbot might have tested a filesystem
> image created both with and without journal within this boot.

a) I think that syzbot reboots the VM between executing different tests to
get reproducible conditions. But in theory I agree the test may have
contained one image with and one image without a journal.

*but*

b) as I wrote in the email you are replying to, the jbd2_handle key is
private per filesystem. Thus for lockdep to complain about
jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
maps must come from the same filesystem.

*and*

c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
for such filesystems lockdep creates no dependency for jbd2_handle map.

Honza

> 
> > 
> > kswapd0/2246 is trying to acquire lock:
> > 888041a988e0 (jbd2_handle){}-{0:0}, at: 
> > start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > 
> > but task is already holding lock:
> > 8be892c0 (fs_reclaim){+.+.}-{0:0}, at: 
> > __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > 
> > So this filesystem has very clearly been created with a journal. Also the
> > journal lockdep tracking machinery uses:
> 
> While locks held by kswapd0/2246 are fs_reclaim, shrinker_rwsem, 
> &type->s_umount_key#38
> and jbd2_handle, isn't the dependency lockdep considers problematic is
> 
>   Chain exists of:
> jbd2_handle --> &ei->xattr_sem --> fs_reclaim
> 
>Possible unsafe locking scenario:
> 
>  CPU0CPU1
>  
> lock(fs_reclaim);
>  lock(&ei->xattr_sem);
>  lock(fs_reclaim);
> lock(jbd2_handle);
> 
> where CPU0 is kswapd/2246 and CPU1 is the case of ext4_get_nojournal() path?
> If someone has taken jbd2_handle and &ei->xattr_sem in this order, isn't this
> dependency true?
> 
> > 
> > rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
> > 
> > so a lockdep key is per-filesystem. Thus it is not possible that lockdep
> > would combine lock dependencies from two different filesystems.
> > 
> > But I guess we could narrow the search for this problem by adding WARN_ONs
> > to ext4_xattr_set_handle() and ext4_xattr_inode_lookup_create() like:
> > 
> > WARN_ON(ext4_handle_valid(handle) && !(current->flags & PF_MEMALLOC_NOFS));
> > 
> > It would narrow down a place in which PF_MEMALLOC_NOFS flag isn't set
> > properly... At least that seems like the most plausible way forward to me.
> 
> You can use CONFIG_DEBUG_AID_FOR_SYZBOT for adding such WARN_ONs on 
> linux-next.
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in start_this_handle (2)

2021-02-15 Thread Jan Kara
On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> On 2021/02/11 19:49, Jan Kara wrote:
> > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > That internally goes through start_this_handle() which calls:
> > 
> > handle->saved_alloc_context = memalloc_nofs_save();
> > 
> > and we restore the allocation context only in stop_this_handle() when
> > stopping the handle. And with this fs_reclaim_acquire() should remove
> > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> 
> Excuse me, but it seems to me that nothing prevents
> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> Will you explain when ext4_get_nojournal() path is executed?

That's a good question but sadly I don't think that's it.
ext4_get_nojournal() is called when the filesystem is created without a
journal. In that case we also don't acquire jbd2_handle lockdep map. In the
syzbot report we can see:

kswapd0/2246 is trying to acquire lock:
888041a988e0 (jbd2_handle){}-{0:0}, at: start_this_handle+0xf81/0x1380 
fs/jbd2/transaction.c:444

but task is already holding lock:
8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 
mm/page_alloc.c:5195

So this filesystem has very clearly been created with a journal. Also the
journal lockdep tracking machinery uses:

rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);

so a lockdep key is per-filesystem. Thus it is not possible that lockdep
would combine lock dependencies from two different filesystems.

But I guess we could narrow the search for this problem by adding WARN_ONs
to ext4_xattr_set_handle() and ext4_xattr_inode_lookup_create() like:

WARN_ON(ext4_handle_valid(handle) && !(current->flags & PF_MEMALLOC_NOFS));

It would narrow down a place in which PF_MEMALLOC_NOFS flag isn't set
properly... At least that seems like the most plausible way forward to me.

Honza

> 
> ext4_xattr_set() {
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits) == 
> __ext4_journal_start() {
>   return __ext4_journal_start_sb() {
> journal = EXT4_SB(sb)->s_journal;
> if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
>   return ext4_get_nojournal(); // Never calls memalloc_nofs_save() 
> despite returning !IS_ERR() value.
> return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds, 
> GFP_NOFS, type, line); // Calls memalloc_nofs_save() when start_this_handle() 
> returns 0.
>   }
> }
>   }
>   error = ext4_xattr_set_handle(handle, inode, name_index, name, value, 
> value_len, flags); {
> ext4_write_lock_xattr(inode, &no_expand); // Grabs &ei->xattr_sem
> error = ext4_xattr_ibody_set(handle, inode, &i, &is) {
>   error = ext4_xattr_set_entry(i, s, handle, inode, false /* is_block */) 
> {
> ret = ext4_xattr_inode_lookup_create(handle, inode, i->value, 
> i->value_len, &new_ea_inode); // Using GFP_KERNEL based on assumption that 
> ext4_journal_start() called memalloc_nofs_save().
>   }
> }
>   }
> }
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] quota: Add mountpath based quota support

2021-02-12 Thread Jan Kara
On Fri 12-02-21 11:29:00, Sascha Hauer wrote:
> On Fri, Feb 12, 2021 at 11:05:05AM +0100, Jan Kara wrote:
> > On Fri 12-02-21 09:38:35, Sascha Hauer wrote:
> > > On Thu, Feb 11, 2021 at 03:38:13PM +, Christoph Hellwig wrote:
> > > > > + if (!mountpoint)
> > > > > + return -ENODEV;
> > > > > +
> > > > > + ret = user_path_at(AT_FDCWD, mountpoint,
> > > > > +  LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT, 
> > > > > &mountpath);
> > > > 
> > > > user_path_at handles an empty path, although you'll get EFAULT instead.
> > > > Do we care about the -ENODEV here?
> > > 
> > > The quotactl manpage documents EFAULT as error code for invalid addr or
> > > special argument, so we really should return -EFAULT here.
> > > 
> > > Existing quotactl gets this wrong as well:
> > > 
> > >   if (!special) {
> > >   if (cmds == Q_SYNC)
> > >   return quota_sync_all(type);
> > >   return -ENODEV;
> > >   }
> > > 
> > > Should we fix this or is there userspace code that is confused by a 
> > > changed
> > > return value?
> > 
> > I'd leave the original quotactl(2) as is. There's no strong reason to risk
> > breaking some userspace. For the new syscall, I agree we can just
> > standardize the return value, there ENODEV makes even less sense since
> > there's no device in that call.
> 
> Ok, will do. Who can pick this series up? Anyone else I need to Cc next
> round?

I guess I can pick up both kernel patches (the manpage patch needs to be
submitted to the manpage list) but please CC linux-api@vger as well so that
interested people are aware of the new syscall.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 0/2] quota: Add mountpath based quota support

2021-02-12 Thread Jan Kara
On Thu 11-02-21 16:30:21, Sascha Hauer wrote:
> Current quotactl syscall uses a path to a block device to specify the
> filesystem to work on which makes it unsuitable for filesystems that
> do not have a block device. This series adds a new syscall quotactl_path()
> which replaces the path to the block device with a mountpath, but otherwise
> behaves like original quotactl.
> 
> This is done to add quota support to UBIFS. UBIFS quota support has been
> posted several times with different approaches to put the mountpath into
> the existing quotactl() syscall until it has been suggested to make it a
> new syscall instead, so here it is.
> 
> I'm not posting the full UBIFS quota series here as it remains unchanged
> and I'd like to get feedback to the new syscall first. For those interested
> the most recent series can be found here: https://lwn.net/Articles/810463/
> 
> Changes since (implicit) v1:
> - Ignore second path argument to Q_QUOTAON. With this quotactl_path() can
>   only do the Q_QUOTAON operation on filesystems which use hidden inodes
>   for quota metadata storage
> - Drop unnecessary quotactl_cmd_onoff() check

Thanks modulo the 0-day complains and the small nit discussed the patches
and manpage update look good to me.

Honza

> 
> Sascha Hauer (2):
>   quota: Add mountpath based quota support
>   quota: wire up quotactl_path
> 
>  arch/alpha/kernel/syscalls/syscall.tbl  |  1 +
>  arch/arm/tools/syscall.tbl  |  1 +
>  arch/arm64/include/asm/unistd.h |  2 +-
>  arch/arm64/include/asm/unistd32.h   |  2 +
>  arch/ia64/kernel/syscalls/syscall.tbl   |  1 +
>  arch/m68k/kernel/syscalls/syscall.tbl   |  1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
>  arch/parisc/kernel/syscalls/syscall.tbl |  1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl|  1 +
>  arch/s390/kernel/syscalls/syscall.tbl   |  1 +
>  arch/sh/kernel/syscalls/syscall.tbl |  1 +
>  arch/sparc/kernel/syscalls/syscall.tbl  |  1 +
>  arch/x86/entry/syscalls/syscall_32.tbl  |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl  |  1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl |  1 +
>  fs/quota/quota.c| 49 +
>  include/linux/syscalls.h|  2 +
>  include/uapi/asm-generic/unistd.h       |  4 +-
>  kernel/sys_ni.c |  1 +
>  22 files changed, 74 insertions(+), 2 deletions(-)
> 
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] quota: Add mountpath based quota support

2021-02-12 Thread Jan Kara
On Fri 12-02-21 09:38:35, Sascha Hauer wrote:
> On Thu, Feb 11, 2021 at 03:38:13PM +, Christoph Hellwig wrote:
> > > + if (!mountpoint)
> > > + return -ENODEV;
> > > +
> > > + ret = user_path_at(AT_FDCWD, mountpoint,
> > > +  LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT, &mountpath);
> > 
> > user_path_at handles an empty path, although you'll get EFAULT instead.
> > Do we care about the -ENODEV here?
> 
> The quotactl manpage documents EFAULT as error code for invalid addr or
> special argument, so we really should return -EFAULT here.
> 
> Existing quotactl gets this wrong as well:
> 
>   if (!special) {
>   if (cmds == Q_SYNC)
>   return quota_sync_all(type);
>   return -ENODEV;
>   }
> 
> Should we fix this or is there userspace code that is confused by a changed
> return value?

I'd leave the original quotactl(2) as is. There's no strong reason to risk
breaking some userspace. For the new syscall, I agree we can just
standardize the return value, there ENODEV makes even less sense since
there's no device in that call.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] Updated locking documentation for transaction_t

2021-02-11 Thread Jan Kara
On Thu 11-02-21 14:54:23, Alexander Lochmann wrote:
> Some members of transaction_t are allowed to be read without
> any lock being held if accessed from the correct context.
> We used LockDoc's findings to determine those members.
> Each member of them is marked with a short comment:
> "no lock needed for jbd2 thread".
> 
> Signed-off-by: Alexander Lochmann 
> Signed-off-by: Horst Schirmeier 

Thanks. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/jbd2.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 99d3cd051ac3..1f19d19f6435 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -594,18 +594,18 @@ struct transaction_s
>*/
>   unsigned long   t_log_start;
>  
> - /* Number of buffers on the t_buffers list [j_list_lock] */
> + /* Number of buffers on the t_buffers list [j_list_lock, no locks 
> needed for jbd2 thread] */
>   int t_nr_buffers;
>  
>   /*
>* Doubly-linked circular list of all buffers reserved but not yet
> -  * modified by this transaction [j_list_lock]
> +  * modified by this transaction [j_list_lock, no locks needed for jbd2 
> thread]
>*/
>   struct journal_head *t_reserved_list;
>  
>   /*
>* Doubly-linked circular list of all metadata buffers owned by this
> -  * transaction [j_list_lock]
> +  * transaction [j_list_lock, no locks needed for jbd2 thread]
>*/
>   struct journal_head *t_buffers;
>  
> @@ -631,7 +631,7 @@ struct transaction_s
>   /*
>* Doubly-linked circular list of metadata buffers being shadowed by log
>* IO.  The IO buffers on the iobuf list and the shadow buffers on this
> -  * list match each other one for one at all times. [j_list_lock]
> +  * list match each other one for one at all times. [j_list_lock, no 
> locks needed for jbd2 thread]
>*/
>   struct journal_head *t_shadow_list;
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in dquot_commit

2021-02-11 Thread Jan Kara
On Thu 11-02-21 12:47:18, Dmitry Vyukov wrote:
> On Thu, Feb 11, 2021 at 12:37 PM Jan Kara  wrote:
> >
> > On Wed 10-02-21 03:25:22, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=101cf2f8d0
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=e83e68d0a6aba5f6
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=3b6f9218b1301ddda3e2
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+3b6f9218b1301ddda...@syzkaller.appspotmail.com
> > >
> > > loop1: detected capacity change from 4096 to 0
> > > EXT4-fs (loop1): mounted filesystem without journal. Opts: 
> > > ,errors=continue. Quota mode: writeback.
> > > ==
> > > WARNING: possible circular locking dependency detected
> > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > --
> > > syz-executor.1/16170 is trying to acquire lock:
> > > 8880795f5b28 (&dquot->dq_lock){+.+.}-{3:3}, at: 
> > > dquot_commit+0x4d/0x420 fs/quota/dquot.c:476
> > >
> > > but task is already holding lock:
> > > 88807960b438 (&ei->i_data_sem/2){}-{3:3}, at: 
> > > ext4_map_blocks+0x5e1/0x17d0 fs/ext4/inode.c:630
> > >
> > > which lock already depends on the new lock.
> >
> > 
> >
> > All snipped stacktraces look perfectly fine and the lock dependencies are as
> > expected.
> >
> > > 5 locks held by syz-executor.1/16170:
> > >  #0: 88802ad18b70 (&f->f_pos_lock){+.+.}-{3:3}, at: 
> > > __fdget_pos+0xe9/0x100 fs/file.c:947
> > >  #1: 88802fbec460 (sb_writers#5){.+.+}-{0:0}, at: 
> > > ksys_write+0x12d/0x250 fs/read_write.c:658
> > >  #2: 88807960b648 (&sb->s_type->i_mutex_key#9){}-{3:3}, at: 
> > > inode_lock include/linux/fs.h:773 [inline]
> > >  #2: 88807960b648 (&sb->s_type->i_mutex_key#9){}-{3:3}, at: 
> > > ext4_buffered_write_iter+0xb6/0x4d0 fs/ext4/file.c:264
> > >  #3: 88807960b438 (&ei->i_data_sem/2){}-{3:3}, at: 
> > > ext4_map_blocks+0x5e1/0x17d0 fs/ext4/inode.c:630
> > >  #4: 8bf1be58 (dquot_srcu){}-{0:0}, at: i_dquot 
> > > fs/quota/dquot.c:926 [inline]
> > >  #4: 8bf1be58 (dquot_srcu){}-{0:0}, at: 
> > > __dquot_alloc_space+0x1b4/0xb60 fs/quota/dquot.c:1671
> >
> > This actually looks problematic: We acquired &ei->i_data_sem/2 (i.e.,
> > I_DATA_SEM_QUOTA subclass) in ext4_map_blocks() called from
> > ext4_block_write_begin(). This suggests that the write has been happening
> > directly to the quota file (or that lockdep annotation of the inode went
> > wrong somewhere). Now we normally protect quota files with IMMUTABLE flag
> > so writing it should not be possible. We also don't allow clearing this
> > flag on used quota file. Finally I'd checked lockdep annotation and
> > everything looks correct. So at this point the best theory I have is that a
> > filesystem has been suitably corrupted and quota file supposed to be
> > inaccessible from userspace got exposed but I'd expect other problems to
> > hit first in that case. Anyway without a reproducer I have no more ideas...
> 
> There is a reproducer for 4.19 available on the dashboard. Maybe it will help.
> I don't why it did not pop up on upstream yet, there lots of potential
> reasons for this.

OK, so I've checked the fs images generated by the syzkaller reproducer and
they indeed have QUOTA feature enabled. Also inodes used by quota files are
not marked as allocated so there is some potential for surprises. But all
the possible paths I could think of seem to be covered and return
EFSCORRUPTED. Also note that the reproducer didn't trigger the
lockdep splat for me so the problem still isn't clear to me.

Honza

> > >
> > > stack backtrace:
> > > CPU: 0 PID: 16170 Comm: syz-executor.1 Not tainted 5.11.0-rc6-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > > Google 01/01/2011
&g

Re: [PATCH 1/2] Updated locking documentation for transaction_t

2021-02-11 Thread Jan Kara
On Thu 11-02-21 10:53:51, Alexander Lochmann wrote:
> 
> 
> On 11.02.21 10:30, Jan Kara wrote:
> > >*/
> > >   unsigned long   t_log_start;
> > > - /* Number of buffers on the t_buffers list [j_list_lock] */
> > > + /* Number of buffers on the t_buffers list [j_list_lock, no lock for 
> > > quick racy checks] */
> > >   int t_nr_buffers;
> > 
> > So this case is actually somewhat different now that I audited the uses.
> > There are two types of users - commit code (fs/jbd2/commit.c) and others.
> > Other users properly use j_list_lock to access t_nr_buffers. Commit code
> > does not use any locks because committing transaction is fully in
> > ownership of the jbd2 thread and all other users need to check & wait for
> > commit to be finished before doing anything with the transaction's buffers.
> Mhm I see.
> What about '[..., no locks needed for jbd2 thread]'?

Sounds good to me.

> How do the others wait for the commit to be finished?

Well, usually they just don't touch buffers belonging to the committing
transation, they just store in b_next_transaction that after commit is
done, buffer should be added to the currently running transaction. There
are some exceptions though - e.g. jbd2_journal_invalidatepage() (called
from truncate code) which returns EBUSY in some rare cases and we use
jbd2_log_wait_commit() in ext4_wait_for_tail_page_commit() to wait for
commit to be done before we know it is safe to destroy the buffer.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in fs_reclaim_acquire (2)

2021-02-11 Thread Jan Kara
On Thu 11-02-21 20:04:14, Hillf Danton wrote:
> On Thu 11-02-21 12:07:29, Jan Kara wrote:
> >> Fix 71b565ceff37 ("ext4: drop ext4_kvmalloc()") by restoring the
> >> GFP_NOFS introduced in dec214d00e0d ("ext4: xattr inode deduplication").
> >> 
> >> Note this may be the fix also to possible deadlock
> >>  Reported-by: syzbot+bfdded10ab7dcd750...@syzkaller.appspotmail.com
> >>  
> >> https://lore.kernel.org/linux-ext4/563a0205bafb7...@google.com/
> >
> >Please no. Ext4 is using scoping API to limit allocations to GFP_NOFS
> >inside transactions. In this case something didn't work which seems like a
> >lockdep bug at the first sight but I'll talk to mm guys about it.
> >Definitely to problem doesn't seem to be in ext4.
> 
> Feel free to elaborate why we can find ext4  in the report?
> Why is ext4 special in this case?

Please read my reply to the syzbot report [1]. It has all the details.

[1] https://lore.kernel.org/lkml/20210211104947.gl19...@quack2.suse.cz

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in start_this_handle (2)

2021-02-11 Thread Jan Kara
On Thu 11-02-21 12:28:48, Dmitry Vyukov wrote:
> On Thu, Feb 11, 2021 at 12:22 PM Dmitry Vyukov  wrote:
> >
> > On Thu, Feb 11, 2021 at 11:49 AM Jan Kara  wrote:
> > >
> > > Hello,
> > >
> > > added mm guys to CC.
> > >
> > > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > > HEAD commit:1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > > git tree:   upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d0
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > userspace arch: i386
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > > commit:
> > > > Reported-by: syzbot+bfdded10ab7dcd750...@syzkaller.appspotmail.com
> > > >
> > > > ==
> > > > WARNING: possible circular locking dependency detected
> > > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > > --
> > > > kswapd0/2246 is trying to acquire lock:
> > > > 888041a988e0 (jbd2_handle){}-{0:0}, at: 
> > > > start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > > >
> > > > but task is already holding lock:
> > > > 8be892c0 (fs_reclaim){+.+.}-{0:0}, at: 
> > > > __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > > >
> > > > which lock already depends on the new lock.
> > > >
> > > > the existing dependency chain (in reverse order) is:
> > > >
> > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > >__fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > > >fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > > >might_alloc include/linux/sched/mm.h:193 [inline]
> > > >slab_pre_alloc_hook mm/slab.h:493 [inline]
> > > >slab_alloc_node mm/slub.c:2817 [inline]
> > > >__kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > > >kmalloc_node include/linux/slab.h:575 [inline]
> > > >kvmalloc_node+0x61/0xf0 mm/util.c:587
> > > >kvmalloc include/linux/mm.h:781 [inline]
> > > >ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > > >ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > > >ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > > >ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > > >ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > > >ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > > >ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > > >__vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > > >__vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > > >__vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > > >vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > > >setxattr+0x1ff/0x290 fs/xattr.c:553
> > > >path_setxattr+0x170/0x190 fs/xattr.c:572
> > > >__do_sys_setxattr fs/xattr.c:587 [inline]
> > > >__se_sys_setxattr fs/xattr.c:583 [inline]
> > > >__ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > > >do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > > >__do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > > >do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > > >entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> > >
> > > This stacktrace should never happen. ext4_xattr_set() starts a 
> > > transaction.
> > > That internally goes through start_this_handle() which calls:
> > >
> > > handle->saved_alloc_context = memalloc_nofs_save();
> > >
> > > and we restore the allocation context only in stop_this_handle() when
> > > stopping the handle. And with this fs_reclaim_acquire() should remove
> > > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> > >
> > > Now I have no idea why something here didn't work out. Given we don't

Re: possible deadlock in start_this_handle (2)

2021-02-11 Thread Jan Kara
On Thu 11-02-21 12:22:39, Dmitry Vyukov wrote:
> On Thu, Feb 11, 2021 at 11:49 AM Jan Kara  wrote:
> >
> > Hello,
> >
> > added mm guys to CC.
> >
> > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > HEAD commit:1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d0
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > userspace arch: i386
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+bfdded10ab7dcd750...@syzkaller.appspotmail.com
> > >
> > > ==
> > > WARNING: possible circular locking dependency detected
> > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > --
> > > kswapd0/2246 is trying to acquire lock:
> > > 888041a988e0 (jbd2_handle){}-{0:0}, at: 
> > > start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > >
> > > but task is already holding lock:
> > > 8be892c0 (fs_reclaim){+.+.}-{0:0}, at: 
> > > __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > >
> > > which lock already depends on the new lock.
> > >
> > > the existing dependency chain (in reverse order) is:
> > >
> > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > >__fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > >fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > >might_alloc include/linux/sched/mm.h:193 [inline]
> > >slab_pre_alloc_hook mm/slab.h:493 [inline]
> > >slab_alloc_node mm/slub.c:2817 [inline]
> > >__kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > >kmalloc_node include/linux/slab.h:575 [inline]
> > >kvmalloc_node+0x61/0xf0 mm/util.c:587
> > >kvmalloc include/linux/mm.h:781 [inline]
> > >ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > >ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > >ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > >ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > >ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > >ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > >ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > >__vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > >__vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > >__vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > >vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > >setxattr+0x1ff/0x290 fs/xattr.c:553
> > >path_setxattr+0x170/0x190 fs/xattr.c:572
> > >__do_sys_setxattr fs/xattr.c:587 [inline]
> > >__se_sys_setxattr fs/xattr.c:583 [inline]
> > >__ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > >do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > >__do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > >do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > >entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> >
> > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > That internally goes through start_this_handle() which calls:
> >
> > handle->saved_alloc_context = memalloc_nofs_save();
> >
> > and we restore the allocation context only in stop_this_handle() when
> > stopping the handle. And with this fs_reclaim_acquire() should remove
> > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> >
> > Now I have no idea why something here didn't work out. Given we don't have
> > a reproducer it will be probably difficult to debug this. I'd note that
> > about year and half ago similar report happened (got autoclosed) so it may
> > be something real somewhere but it may also be just some HW glitch or
> > something like that.
> 
> HW glitch is theoretically possible. But if we are considering such
> causes, I would say a kernel memory corruption is way more likely, we
> have hundreds of known memory-c

Re: possible deadlock in dquot_commit

2021-02-11 Thread Jan Kara
0x241/0x590 fs/ext4/inode.c:793
>  ext4_block_write_begin+0x4f8/0x1190 fs/ext4/inode.c:1077
>  ext4_write_begin+0x4b5/0x14b0 fs/ext4/inode.c:1202
>  ext4_da_write_begin+0x672/0x1150 fs/ext4/inode.c:2961
>  generic_perform_write+0x20a/0x4f0 mm/filemap.c:3412
>  ext4_buffered_write_iter+0x244/0x4d0 fs/ext4/file.c:270
>  ext4_file_write_iter+0x423/0x14d0 fs/ext4/file.c:664
>  call_write_iter include/linux/fs.h:1901 [inline]
>  new_sync_write+0x426/0x650 fs/read_write.c:518
>  vfs_write+0x791/0xa30 fs/read_write.c:605
>  ksys_write+0x12d/0x250 fs/read_write.c:658
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x465b09
> Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7f8097ffc188 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0056bf60 RCX: 00465b09
> RDX: 0d4ba0ff RSI: 29c0 RDI: 0003
> RBP: 004b069f R08:  R09: 
> R10:  R11: 0246 R12: 0056bf60
> R13: 00007ffefc77f01f R14: 7f8097ffc300 R15: 00022000
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in start_this_handle (2)

2021-02-11 Thread Jan Kara
Hello,

added mm guys to CC.

On Wed 10-02-21 05:35:18, syzbot wrote:
> HEAD commit:1e0d27fc Merge branch 'akpm' (patches from Andrew)
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> userspace arch: i386
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+bfdded10ab7dcd750...@syzkaller.appspotmail.com
> 
> ==
> WARNING: possible circular locking dependency detected
> 5.11.0-rc6-syzkaller #0 Not tainted
> --
> kswapd0/2246 is trying to acquire lock:
> 888041a988e0 (jbd2_handle){}-{0:0}, at: 
> start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> 
> but task is already holding lock:
> 8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 
> mm/page_alloc.c:5195
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (fs_reclaim){+.+.}-{0:0}:
>__fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
>fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
>might_alloc include/linux/sched/mm.h:193 [inline]
>slab_pre_alloc_hook mm/slab.h:493 [inline]
>slab_alloc_node mm/slub.c:2817 [inline]
>__kmalloc_node+0x5f/0x430 mm/slub.c:4015
>kmalloc_node include/linux/slab.h:575 [inline]
>kvmalloc_node+0x61/0xf0 mm/util.c:587
>kvmalloc include/linux/mm.h:781 [inline]
>ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
>ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
>ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
>ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
>ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
>ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
>ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
>__vfs_setxattr+0x10e/0x170 fs/xattr.c:177
>__vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
>__vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
>vfs_setxattr+0x135/0x320 fs/xattr.c:291
>setxattr+0x1ff/0x290 fs/xattr.c:553
>path_setxattr+0x170/0x190 fs/xattr.c:572
>__do_sys_setxattr fs/xattr.c:587 [inline]
>__se_sys_setxattr fs/xattr.c:583 [inline]
>__ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
>do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
>__do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
>do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
>entry_SYSENTER_compat_after_hwframe+0x4d/0x5c

This stacktrace should never happen. ext4_xattr_set() starts a transaction.
That internally goes through start_this_handle() which calls:

handle->saved_alloc_context = memalloc_nofs_save();

and we restore the allocation context only in stop_this_handle() when
stopping the handle. And with this fs_reclaim_acquire() should remove
__GFP_FS from the mask and not call __fs_reclaim_acquire().

Now I have no idea why something here didn't work out. Given we don't have
a reproducer it will be probably difficult to debug this. I'd note that
about year and half ago similar report happened (got autoclosed) so it may
be something real somewhere but it may also be just some HW glitch or
something like that.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: possible deadlock in fs_reclaim_acquire (2)

2021-02-11 Thread Jan Kara
015
> >  kmalloc_node include/linux/slab.h:575 [inline]
> >  kvmalloc_node+0x61/0xf0 mm/util.c:587
> >  kvmalloc include/linux/mm.h:781 [inline]
> >  ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> >  ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> >  ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> >  ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> >  ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> >  ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> >  ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> >  __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> >  __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> >  __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> >  vfs_setxattr+0x135/0x320 fs/xattr.c:291
> >  setxattr+0x1ff/0x290 fs/xattr.c:553
> >  path_setxattr+0x170/0x190 fs/xattr.c:572
> >  __do_sys_setxattr fs/xattr.c:587 [inline]
> >  __se_sys_setxattr fs/xattr.c:583 [inline]
> >  __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> >  do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> >  __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> >  do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> >  entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> 
> Fix 71b565ceff37 ("ext4: drop ext4_kvmalloc()") by restoring the
> GFP_NOFS introduced in dec214d00e0d ("ext4: xattr inode deduplication").
> 
> Note this may be the fix also to possible deadlock
>  Reported-by: syzbot+bfdded10ab7dcd750...@syzkaller.appspotmail.com
>  https://lore.kernel.org/linux-ext4/563a0205bafb7...@google.com/

Please no. Ext4 is using scoping API to limit allocations to GFP_NOFS
inside transactions. In this case something didn't work which seems like a
lockdep bug at the first sight but I'll talk to mm guys about it.
Definitely to problem doesn't seem to be in ext4.

Honza

> 
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1459,7 +1459,7 @@ ext4_xattr_inode_cache_find(struct inode
>   if (!ce)
>   return NULL;
>  
> - ea_data = kvmalloc(value_len, GFP_KERNEL);
> + ea_data = kvmalloc(value_len, GFP_NOFS);
>   if (!ea_data) {
>   mb_cache_entry_put(ea_inode_cache, ce);
>   return NULL;
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] Updated locking documentation for journal_t

2021-02-11 Thread Jan Kara
On Wed 10-02-21 10:57:38, Alexander Lochmann wrote:
> Some members of transaction_t are allowed to be read without
> any lock being held if consistency doesn't matter.
> Based on LockDoc's findings, we extended the locking
> documentation of those members.
> Each one of them is marked with a short comment:
> "no lock for quick racy checks".
> 
> Signed-off-by: Alexander Lochmann 
> Signed-off-by: Horst Schirmeier 

This patch looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/jbd2.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 18f77d9b1745..4dca33a063dd 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -768,7 +768,7 @@ enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
>  struct journal_s
>  {
>   /**
> -  * @j_flags: General journaling state flags [j_state_lock]
> +  * @j_flags: General journaling state flags [j_state_lock, no lock for 
> quick racy checks]
>*/
>   unsigned long   j_flags;
>  
> @@ -808,7 +808,7 @@ struct journal_s
>   /**
>* @j_barrier_count:
>*
> -  * Number of processes waiting to create a barrier lock [j_state_lock]
> +  * Number of processes waiting to create a barrier lock [j_state_lock, 
> no lock for quick racy checks]
>*/
>   int j_barrier_count;
>  
> @@ -821,7 +821,7 @@ struct journal_s
>* @j_running_transaction:
>*
>* Transactions: The current running transaction...
> -  * [j_state_lock] [caller holding open handle]
> +  * [j_state_lock, no lock for quick racy checks] [caller holding open 
> handle]
>*/
>   transaction_t   *j_running_transaction;
>  
> @@ -1033,7 +1033,7 @@ struct journal_s
>* @j_commit_sequence:
>*
>* Sequence number of the most recently committed transaction
> -  * [j_state_lock].
> +  * [j_state_lock, no lock for quick racy checks].
>*/
>   tid_t   j_commit_sequence;
>  
> @@ -1041,7 +1041,7 @@ struct journal_s
>* @j_commit_request:
>*
>* Sequence number of the most recent transaction wanting commit
> -  * [j_state_lock]
> +  * [j_state_lock, no lock for quick racy checks]
>*/
>   tid_t   j_commit_request;
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] Updated locking documentation for transaction_t

2021-02-11 Thread Jan Kara
On Wed 10-02-21 10:57:39, Alexander Lochmann wrote:
> Some members of transaction_t are allowed to be read without
> any lock being held if consistency doesn't matter.
> Based on LockDoc's findings, we extended the locking
> documentation of those members.
> Each one of them is marked with a short comment:
> "no lock for quick racy checks".
> 
> Signed-off-by: Alexander Lochmann 
> Signed-off-by: Horst Schirmeier 

Thanks for the patch! Some comments below...

> ---
>  include/linux/jbd2.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 99d3cd051ac3..18f77d9b1745 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -594,18 +594,18 @@ struct transaction_s
>*/
>   unsigned long   t_log_start;
>  
> - /* Number of buffers on the t_buffers list [j_list_lock] */
> + /* Number of buffers on the t_buffers list [j_list_lock, no lock for 
> quick racy checks] */
>   int t_nr_buffers;

So this case is actually somewhat different now that I audited the uses.
There are two types of users - commit code (fs/jbd2/commit.c) and others.
Other users properly use j_list_lock to access t_nr_buffers. Commit code
does not use any locks because committing transaction is fully in
ownership of the jbd2 thread and all other users need to check & wait for
commit to be finished before doing anything with the transaction's buffers.

>   /*
>* Doubly-linked circular list of all buffers reserved but not yet
> -  * modified by this transaction [j_list_lock]
> +  * modified by this transaction [j_list_lock, no lock for quick racy 
> checks]
>*/
>   struct journal_head *t_reserved_list;
>
>   /*
>* Doubly-linked circular list of all metadata buffers owned by this
> -  * transaction [j_list_lock]
> +  * transaction [j_list_lock, no lock for quick racy checks]
>*/
>   struct journal_head *t_buffers;
> 
> @@ -631,7 +631,7 @@ struct transaction_s
>   /*
>* Doubly-linked circular list of metadata buffers being shadowed by log
>* IO.  The IO buffers on the iobuf list and the shadow buffers on this
> -  * list match each other one for one at all times. [j_list_lock]
> +  * list match each other one for one at all times. [j_list_lock, no 
> lock for quick racy checks]
>*/
>       struct journal_head *t_shadow_list;

The above three cases are the same as t_reserved_list.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] quota: Add mountpath based quota support

2021-02-09 Thread Jan Kara
On Tue 09-02-21 08:51:01, Christoph Hellwig wrote:
> On Thu, Feb 04, 2021 at 01:53:50PM +0100, Jan Kara wrote:
> > Now quota data stored in a normal file is a setup we try to deprecate
> > anyway so another option is to just leave quotactl_path() only for those
> > setups where quota metadata is managed by the filesystem so we don't need
> > to pass quota files to Q_QUOTAON?
> 
> I'd be perfectly fine with that.

OK, then this looks like the best way forward to me. We can always extend
the quotactl_path() syscall later if we find this is problematic for some
real usecases.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/7] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-02-08 Thread Jan Kara
On Mon 08-02-21 01:09:17, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.
> 
> One of the key mechanism need to be implemented in fsdax is CoW.  Copy
> the data from srcmap before we actually write data to the destance
> iomap.  And we just copy range in which data won't be changed.
> 
> Another mechanism is range comparison .  In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data.  In fsdax case, readpage() does not work.  So, we need
> another compare data with direct access support.
> 
> With the two mechanism implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
> 
> Some of the patches are picked up from Goldwyn's patchset.  I made some
> changes to adapt to this patchset.

How do you deal with HWPoison code trying to reverse-map struct page back
to inode-offset pair? This also needs to be fixed for reflink to work with
DAX.

Honza

> 
> (Rebased on v5.10)
> ==
> 
> Shiyang Ruan (7):
>   fsdax: Output address in dax_iomap_pfn() and rename it
>   fsdax: Introduce dax_copy_edges() for CoW
>   fsdax: Copy data before write
>   fsdax: Replace mmap entry in case of CoW
>   fsdax: Dedup file range to use a compare function
>   fs/xfs: Handle CoW for fsdax write() path
>   fs/xfs: Add dedupe support for fsdax
> 
>  fs/btrfs/reflink.c |   3 +-
>  fs/dax.c   | 188 ++---
>  fs/ocfs2/file.c|   2 +-
>  fs/remap_range.c   |  14 +--
>  fs/xfs/xfs_bmap_util.c |   6 +-
>  fs/xfs/xfs_file.c  |  30 ++-
>  fs/xfs/xfs_inode.c |   8 +-
>  fs/xfs/xfs_inode.h |   1 +
>  fs/xfs/xfs_iomap.c |   3 +-
>  fs/xfs/xfs_iops.c  |  11 ++-
>  fs/xfs/xfs_reflink.c   |  23 -
>  include/linux/dax.h    |   5 ++
>  include/linux/fs.h |   9 +-
>  13 files changed, 270 insertions(+), 33 deletions(-)
> 
> -- 
> 2.30.0
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] quota: Add mountpath based quota support

2021-02-04 Thread Jan Kara
On Thu 04-02-21 07:34:14, Christoph Hellwig wrote:
> On Tue, Feb 02, 2021 at 07:02:41PM +0100, Jan Kara wrote:
> > Hum, let me think out loud. The path we pass to Q_QUOTAON is a path to
> > quota file - unless the filesystem stores quota in hidden files in which
> > case this argument is just ignored. You're right we could require that
> > specifically for Q_QUOTAON, the mountpoint path would actually need to
> > point to the quota file if it is relevant, otherwise anywhere in the
> > appropriate filesystem. We don't allow quota file to reside on a different
> > filesystem (for a past decade or so) so it should work fine.
> > 
> > So the only problem I have is whether requiring the mountpoint argument to
> > point quota file for Q_QUOTAON isn't going to be somewhat confusing to
> > users. At the very least it would require some careful explanation in the
> > manpage to explain the difference between quotactl_path() and quotactl()
> > in this regard. But is saving the second path for Q_QUOTAON really worth the
> > bother?
> 
> I find the doubled path argument a really horrible API, so I'd pretty
> strongly prefer to avoid that.

Honestly, I don't understand why is it so horrible. The paths point to
different things... The first path identifies the filesystem to operate on,
the second path identifies the file which contains quota accounting data.
In the ancient times, the file with quota accounting data could even be
stored on a different filesystem (these were still times when filesystem
metadata journalling was a new thing - like late 90's).  But later I just
disallowed that because it was not very useful (and luckily even used) and
just complicated matters.
 
Anyway, back to 2021 :). What I find somewhat confusing about a single path
for Q_QUOTAON is that for any other quotactl, any path on the filesystem is
fine. Similarly if quota data is stored in the hidden file, any path on the
filesystem is fine. It is only for Q_QUOTAON on a filesystem where quota
data is stored in a normal file, where we suddently require that the path
has to point to it.

Now quota data stored in a normal file is a setup we try to deprecate
anyway so another option is to just leave quotactl_path() only for those
setups where quota metadata is managed by the filesystem so we don't need
to pass quota files to Q_QUOTAON?

> > > Given how cheap quotactl_cmd_onoff and quotactl_cmd_write are we
> > > could probably simplify this down do:
> > > 
> > >   if (quotactl_cmd_write(cmd)) {
> > 
> > This needs to be (quotactl_cmd_write(cmd) || quotactl_cmd_onoff(cmd)).
> > Otherwise I agree what you suggest is somewhat more readable given how
> > small the function is.
> 
> The way I read quotactl_cmd_write, it only special cases a few commands
> and returns 0 there, so we should not need the extra quotactl_cmd_onoff
> call, as all those commands are not explicitly listed.

Right, sorry, I was mistaken.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] parser: Fix kernel-doc markups

2021-02-03 Thread Jan Kara
On Thu 28-01-21 22:39:46, Randy Dunlap wrote:
> On 1/28/21 9:00 PM, bingjingc wrote:
> > From: BingJing Chang 
> > 
> > Fix existing issues at the kernel-doc markups
> > 
> > Signed-off-by: BingJing Chang 
> > ---
> >  lib/parser.c | 22 +++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> > 
> 
> Reviewed-by: Randy Dunlap 
> 
> thanks for the kernel-doc patch.

Thanks for the patch and review. Since this is somewhat related to the
match_uint() fixes, I've just picked up this patch to my tree as well since
I don't think we have a designated lib/parser.c maintainer...

    Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v3 0/3] handle large user and group ID for isofs and udf

2021-02-03 Thread Jan Kara
On Fri 29-01-21 12:51:48, bingjingc wrote:
> From: BingJing Chang 
> 
> The uid/gid (unsigned int) of a domain user may be larger than INT_MAX.
> The parse_options of isofs and udf will return 0, and mount will fail
> with -EINVAL. These patches try to handle large user and group ID.
> 
> BingJing Chang (3):
>   parser: add unsigned int parser
>   isofs: handle large user and group ID
>   udf: handle large user and group ID

Thanks for the patches. I've added these three patches to my tree.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 01/18] vfs: add miscattr ops

2021-02-03 Thread Jan Kara
(dentry, ma);
> +}
> +EXPORT_SYMBOL(vfs_miscattr_get);

...

> +/*
> + * Generic function to check FS_IOC_FSSETXATTR/FS_IOC_SETFLAGS values and 
> reject
> + * any invalid configurations.
> + *
> + * Note: must be called with inode lock held.
> + */
> +static int miscattr_set_prepare(struct inode *inode,
> +   const struct miscattr *old_ma,
> +   struct miscattr *ma)
> +{
> + int err;
> +
> + /*
> +  * The IMMUTABLE and APPEND_ONLY flags can only be changed by
> +  * the relevant capability.
> +  */
> + if ((ma->flags ^ old_ma->flags) & (FS_APPEND_FL | FS_IMMUTABLE_FL) &&
> + !capable(CAP_LINUX_IMMUTABLE))
> + return -EPERM;
> +
> + err = fscrypt_prepare_setflags(inode, old_ma->flags, ma->flags);
> + if (err)
> + return err;
> +
> + /*
> +  * Project Quota ID state is only allowed to change from within the init
> +  * namespace. Enforce that restriction only if we are trying to change
> +  * the quota ID state. Everything else is allowed in user namespaces.
> +  */
> + if (current_user_ns() != &init_user_ns) {
> + if (old_ma->fsx_projid != ma->fsx_projid)
> + return -EINVAL;
> + if ((old_ma->fsx_xflags ^ ma->fsx_xflags) &
> + FS_XFLAG_PROJINHERIT)
> + return -EINVAL;
> + }
> +
> + /* Check extent size hints. */
> + if ((ma->fsx_xflags & FS_XFLAG_EXTSIZE) && !S_ISREG(inode->i_mode))
> + return -EINVAL;
> +
> + if ((ma->fsx_xflags & FS_XFLAG_EXTSZINHERIT) &&
> + !S_ISDIR(inode->i_mode))
> + return -EINVAL;
> +
> + if ((ma->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
> + !S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
> + return -EINVAL;
> +
> + /*
> +  * It is only valid to set the DAX flag on regular files and
> +  * directories on filesystems.
> +  */
> + if ((ma->fsx_xflags & FS_XFLAG_DAX) &&
> + !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
> + return -EINVAL;
> +
> + /* Extent size hints of zero turn off the flags. */
> + if (ma->fsx_extsize == 0)
> + ma->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);
> + if (ma->fsx_cowextsize == 0)
> + ma->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE;
> +
> + return 0;
> +}
> +
> +/**
> + * vfs_miscattr_set - change miscellaneous inode attributes
> + * @dentry:  the object to change
> + * @ma:  miscattr pointer
> + *
> + * After verifying permissions, call i_op->miscattr_set() callback, if
> + * exists.
> + *
> + * Verifying attributes involves retrieving current attributes with
> + * i_op->miscattr_get(), this also allows initilaizing attributes that have
> + * not been set by the caller to current values.  Inode lock is held
> + * thoughout to prevent racing with another instance.
> + *
> + * Returns 0 on success, or a negative error on failure.
> + */
> +int vfs_miscattr_set(struct dentry *dentry, struct miscattr *ma)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct miscattr old_ma = {};
> + int err;
> +
> + if (d_is_special(dentry))
> + return -ENOTTY;
> +
> + if (!inode->i_op->miscattr_set)
> + return -ENOIOCTLCMD;
> +
> + if (!inode_owner_or_capable(inode))
> + return -EPERM;
> +
> + inode_lock(inode);
> + err = vfs_miscattr_get(dentry, &old_ma);
> + if (!err) {
> + /* initialize missing bits from old_ma */
> + if (ma->flags_valid) {
> + ma->fsx_xflags |= old_ma.fsx_xflags & ~FS_XFLAG_COMMON;
> + ma->fsx_extsize = old_ma.fsx_extsize;
> + ma->fsx_nextents = old_ma.fsx_nextents;
> + ma->fsx_projid = old_ma.fsx_projid;
> + ma->fsx_cowextsize = old_ma.fsx_cowextsize;
> + } else {
> + ma->flags |= old_ma.flags & ~FS_COMMON_FL;
> + }
> + err = miscattr_set_prepare(inode, &old_ma, ma);
> + if (!err)
> + err = inode->i_op->miscattr_set(dentry, ma);

So I somewhat wonder here - not all filesystems support all the xflags or
other extended attributes. Currently these would be just silently ignored
AFAICT. Which seems a bit dangerous to me - most notably because it makes
future extensions of these filesystems difficult. So how are we going to go
about this? Is every filesystem supposed to check what it supports and
refuse other stuff (but currently e.g. your ext2 conversion patch doesn't do
that AFAICT)? Shouldn't we make things easier for filesystems to provide a
bitmask of changing fields (instead of flags / xflags bools) so that they
can refuse unsupported stuff with a single mask check?

To make things more complex, ext2/4 has traditionally silently cleared
unknown flags for setflags but not for setxflags. Unlike e.g. XFS which
refuses unknown flags.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH BUGFIX/IMPROVEMENT 5/6] block, bfq: keep shared queues out of the waker mechanism

2021-02-03 Thread Jan Kara
On Tue 26-01-21 11:51:01, Paolo Valente wrote:
> Shared queues are likely to receive I/O at a high rate. This may
> deceptively let them be considered as wakers of other queues. But a
> false waker will unjustly steal bandwidth to its supposedly woken
> queue. So considering also shared queues in the waking mechanism may
> cause more control troubles than throughput benefits. This commit
> keeps shared queues out of the waker-detection mechanism.
> 
> Tested-by: Jan Kara 
> Signed-off-by: Paolo Valente 

Honestly this makes me somewhat nervous. It is just a band aid for a fact
that waker detection is unreliable? There's nothing which prevents
non-shared queue to submit high amounts of IO (e.g. when it uses AIO) as
well as there's nothing which says that shared queues have no wakers (e.g.
jbd2 thread can easily be a waker for a shared queue).

Honza

> ---
>  block/bfq-iosched.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 0c7e203085f1..23d0dd7bd90f 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -5825,7 +5825,17 @@ static void bfq_completed_request(struct bfq_queue 
> *bfqq, struct bfq_data *bfqd)
>   1UL<<(BFQ_RATE_SHIFT - 10))
>   bfq_update_rate_reset(bfqd, NULL);
>   bfqd->last_completion = now_ns;
> - bfqd->last_completed_rq_bfqq = bfqq;
> + /*
> +  * Shared queues are likely to receive I/O at a high
> +  * rate. This may deceptively let them be considered as wakers
> +  * of other queues. But a false waker will unjustly steal
> +  * bandwidth to its supposedly woken queue. So considering
> +  * also shared queues in the waking mechanism may cause more
> +  * control troubles than throughput benefits. Then do not set
> +  * last_completed_rq_bfqq to bfqq if bfqq is a shared queue.
> +  */
> + if (!bfq_bfqq_coop(bfqq))
> + bfqd->last_completed_rq_bfqq = bfqq;
>  
>   /*
>* If we are waiting to discover whether the request pattern
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH BUGFIX/IMPROVEMENT 2/6] block, bfq: put reqs of waker and woken in dispatch list

2021-02-03 Thread Jan Kara
On Thu 28-01-21 18:54:05, Paolo Valente wrote:
> 
> 
> > Il giorno 26 gen 2021, alle ore 17:18, Jens Axboe  ha 
> > scritto:
> > 
> > On 1/26/21 3:50 AM, Paolo Valente wrote:
> >> Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
> >> this happens, the only active bfq_queues are bfqq and either its waker
> >> bfq_queue or one of its woken bfq_queues, then there is no point in
> >> queueing this new I/O request in bfqq for service. In fact, the
> >> in-service queue and bfqq agree on serving this new I/O request as
> >> soon as possible. So this commit puts this new I/O request directly
> >> into the dispatch list.
> >> 
> >> Tested-by: Jan Kara 
> >> Signed-off-by: Paolo Valente 
> >> ---
> >> block/bfq-iosched.c | 17 -
> >> 1 file changed, 16 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> >> index a83149407336..e5b83910fbe0 100644
> >> --- a/block/bfq-iosched.c
> >> +++ b/block/bfq-iosched.c
> >> @@ -5640,7 +5640,22 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
> >> *hctx, struct request *rq,
> >> 
> >>spin_lock_irq(&bfqd->lock);
> >>bfqq = bfq_init_rq(rq);
> >> -  if (!bfqq || at_head || blk_rq_is_passthrough(rq)) {
> >> +
> >> +  /*
> >> +   * Additional case for putting rq directly into the dispatch
> >> +   * queue: the only active bfq_queues are bfqq and either its
> >> +   * waker bfq_queue or one of its woken bfq_queues. In this
> >> +   * case, there is no point in queueing rq in bfqq for
> >> +   * service. In fact, the in-service queue and bfqq agree on
> >> +   * serving this new I/O request as soon as possible.
> >> +   */
> >> +  if (!bfqq ||
> >> +  (bfqq != bfqd->in_service_queue &&
> >> +   bfqd->in_service_queue != NULL &&
> >> +   bfq_tot_busy_queues(bfqd) == 1 + bfq_bfqq_busy(bfqq) &&
> >> +   (bfqq->waker_bfqq == bfqd->in_service_queue ||
> >> +bfqd->in_service_queue->waker_bfqq == bfqq)) ||
> >> +  at_head || blk_rq_is_passthrough(rq)) {
> >>if (at_head)
> >>list_add(&rq->queuelist, &bfqd->dispatch);
> >>else
> >> 
> > 
> > This is unreadable... Just seems like you are piling heuristics in to
> > catch some case, and it's neither readable nor clean.
> > 
> 
> Yeah, these comments inappropriately assume that the reader knows the
> waker mechanism in depth.  And they do not stress at all how important
> this improvement is.
> 
> I'll do my best to improve these comments.
> 
> To try to do a better job, let me also explain the matter early here.
> Maybe you or others can give me some early feedback (or just tell me
> to proceed).
> 
> This change is one of the main improvements that boosted
> throughput in Jan's tests.  Here is the rationale:
> - consider a bfq_queue, say Q1, detected as a waker of another
>   bfq_queue, say Q2
> - by definition of a waker, Q1 blocks the I/O of Q2, i.e., some I/O of
>   of Q1 needs to be completed for new I/O of Q1 to arrive.  A notable
   ^^ Q2?

>   example is journald
> - so, Q1 and Q2 are in any respect two cooperating processes: if the
>   service of Q1's I/O is delayed, Q2 can only suffer from it.
>   Conversely, if Q2's I/O is delayed, the purpose of Q1 is just defeated.

What do you exactly mean by this last sentence?

> - as a consequence if some I/O of Q1/Q2 arrives while Q2/Q1 is the
>   only queue in service, there is absolutely no point in delaying the
>   service of such an I/O.  The only possible result is a throughput
>   loss, detected by Jan's test

If we are idling at that moment waiting for more IO from in service queue,
I agree. But that doesn't seem to be part of your condition above?

> - so, when the above condition holds, the most effective and efficient
>   action is to put the new I/O directly in the dispatch list
> - as an additional restriction, Q1 and Q2 must be the only busy queues
>   for this commit to put the I/O of Q2/Q1 in the dispatch list.  This is
>   necessary, because, if also other queues are waiting for service, then
>   putting new I/O directly in the dispatch list may evidently cause a
>   violation of service guarantees for the other queues

This last restriction is not ideal for cases like jbd2 thread since it may
still lead to pointless idling but I understand that without some
restriction like this several waking threads could just starve other ones.
So I guess it's fine for now.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] quota: Add mountpath based quota support

2021-02-02 Thread Jan Kara
On Thu 28-01-21 14:35:52, Christoph Hellwig wrote:
> > +   struct path path, *pathp = NULL;
> > +   struct path mountpath;
> > +   bool excl = false, thawed = false;
> > +   int ret;
> > +
> > +   cmds = cmd >> SUBCMDSHIFT;
> > +   type = cmd & SUBCMDMASK;
> 
> Personal pet peeve: it would be nice to just initialize cmds and
> type on their declaration line, or while we're at it declutter
> this a bit and remove the separate cmds variable:
> 
>   unsigned int type = cmd & SUBCMDMASK;
> 
>   cmd >>= SUBCMDSHIFT;

Yeah, whatever :)

> > +   /*
> > +* Path for quotaon has to be resolved before grabbing superblock
> > +* because that gets s_umount sem which is also possibly needed by path
> > +* resolution (think about autofs) and thus deadlocks could arise.
> > +*/
> > +   if (cmds == Q_QUOTAON) {
> > +   ret = user_path_at(AT_FDCWD, addr,
> > +  LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT, &path);
> > +   if (ret)
> > +   pathp = ERR_PTR(ret);
> > +   else
> > +   pathp = &path;
> > +   }
> > +
> > +   ret = user_path_at(AT_FDCWD, mountpoint,
> > +LOOKUP_FOLLOW | LOOKUP_AUTOMOUNT, &mountpath);
> > +   if (ret)
> > +   goto out;
> 
> I don't think we need two path lookups here, we can path the same path
> to the command for quotaon.

Hum, let me think out loud. The path we pass to Q_QUOTAON is a path to
quota file - unless the filesystem stores quota in hidden files in which
case this argument is just ignored. You're right we could require that
specifically for Q_QUOTAON, the mountpoint path would actually need to
point to the quota file if it is relevant, otherwise anywhere in the
appropriate filesystem. We don't allow quota file to reside on a different
filesystem (for a past decade or so) so it should work fine.

So the only problem I have is whether requiring the mountpoint argument to
point quota file for Q_QUOTAON isn't going to be somewhat confusing to
users. At the very least it would require some careful explanation in the
manpage to explain the difference between quotactl_path() and quotactl()
in this regard. But is saving the second path for Q_QUOTAON really worth the
bother?

> > +   if (quotactl_cmd_onoff(cmds)) {
> > +   excl = true;
> > +   thawed = true;
> > +   } else if (quotactl_cmd_write(cmds)) {
> > +   thawed = true;
> > +   }
> > +
> > +   if (thawed) {
> > +   ret = mnt_want_write(mountpath.mnt);
> > +   if (ret)
> > +   goto out1;
> > +   }
> > +
> > +   sb = mountpath.dentry->d_inode->i_sb;
> > +
> > +   if (excl)
> > +   down_write(&sb->s_umount);
> > +   else
> > +   down_read(&sb->s_umount);
> 
> Given how cheap quotactl_cmd_onoff and quotactl_cmd_write are we
> could probably simplify this down do:
> 
>   if (quotactl_cmd_write(cmd)) {

This needs to be (quotactl_cmd_write(cmd) || quotactl_cmd_onoff(cmd)).
Otherwise I agree what you suggest is somewhat more readable given how
small the function is.

>   ret = mnt_want_write(path.mnt);
>   if (ret)
>   goto out1;
>   }
>   if (quotactl_cmd_onoff(cmd))
>   down_write(&sb->s_umount);
>   else
>   down_read(&sb->s_umount);
> 
> and duplicate the checks after the do_quotactl call.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] Revert "bfq: Fix computation of shallow depth"

2021-02-02 Thread Jan Kara
Hello!

On Fri 29-01-21 19:18:08, Lin Feng wrote:
> This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
> 
> bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
> sbitmap_get_shallow, which uses just the number to limit the scan depth of
> each bitmap word, formula:
> scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

Looking at sbitmap_get_shallow() again more carefully, I agree that I
misunderstood how shallow_depth argument gets used and the original code
was correct and I broke it. Thanks for spotting this!

What I didn't notice is that shallow_depth indeed gets used for each bitmap
word separately and not for bitmap as a whole. I'd say this could use some
more documentation but that's unrelated to your revert. So feel free to add:

Reviewed-by: Jan Kara 

to your patch. Thanks.

Honza

> 
> That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
> But after commit patch 'bfq: Fix computation of shallow depth', we use
> sbitmap.depth instead, as a example in following case:
> 
> sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
> The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
> three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
> nothing. Do we really don't want limit depth for such workloads, or we
> just want to bump up the percentiles to 100%?
> 
> Please correct me if I miss something, thanks.
> 
> Signed-off-by: Lin Feng 
> ---
>  block/bfq-iosched.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 9e4eb0fc1c16..9e81d1052091 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -6332,13 +6332,13 @@ static unsigned int bfq_update_depths(struct bfq_data 
> *bfqd,
>* limit 'something'.
>*/
>   /* no more than 50% of tags for async I/O */
> - bfqd->word_depths[0][0] = max(bt->sb.depth >> 1, 1U);
> + bfqd->word_depths[0][0] = max((1U << bt->sb.shift) >> 1, 1U);
>   /*
>* no more than 75% of tags for sync writes (25% extra tags
>* w.r.t. async I/O, to prevent async I/O from starving sync
>* writes)
>*/
> - bfqd->word_depths[0][1] = max((bt->sb.depth * 3) >> 2, 1U);
> + bfqd->word_depths[0][1] = max(((1U << bt->sb.shift) * 3) >> 2, 1U);
>  
>   /*
>* In-word depths in case some bfq_queue is being weight-
> @@ -6348,9 +6348,9 @@ static unsigned int bfq_update_depths(struct bfq_data 
> *bfqd,
>* shortage.
>*/
>   /* no more than ~18% of tags for async I/O */
> - bfqd->word_depths[1][0] = max((bt->sb.depth * 3) >> 4, 1U);
> + bfqd->word_depths[1][0] = max(((1U << bt->sb.shift) * 3) >> 4, 1U);
>   /* no more than ~37% of tags for sync writes (~20% extra tags) */
> - bfqd->word_depths[1][1] = max((bt->sb.depth * 6) >> 4, 1U);
> + bfqd->word_depths[1][1] = max(((1U << bt->sb.shift) * 6) >> 4, 1U);
>  
>   for (i = 0; i < 2; i++)
>   for (j = 0; j < 2; j++)
> -- 
> 2.25.4
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 0/3] handle large user and group ID for isofs and udf

2021-01-28 Thread Jan Kara
On Thu 28-01-21 15:12:27, bingjingc wrote:
> From: BingJing Chang 
> 
> The uid/gid (unsigned int) of a domain user may be larger than INT_MAX.
> The parse_options of isofs and udf will return 0, and mount will fail
> with -EINVAL. These patches try to handle large user and group ID.
> 
> BingJing Chang (3):
>   parser: add unsigned int parser
>   isofs: handle large user and group ID
>   udf: handle large user and group ID

Thanks for your patches! Just two notes:

1) I don't think Matthew Wilcox gave you his Reviewed-by tag (at least I
didn't see such email). Generally the rule is that the developer has to
explicitely write in his email that you can attach his Reviewed-by tag for
it to be valid.

2) Please split the cleanup of kernel doc comments in lib/parser.c in a
separate patch. It is unrelated to adding parse_uint() function so it
belongs into a separate patch (we occasionally do include small unrelated
cleanups if they are on the same line we change anyway but your changes
definitely triggered my treshold of this should be a separate patch).

Thanks!

Honza

> 
>  fs/isofs/inode.c   |  9 +
>  fs/udf/super.c |  9 +
>  include/linux/parser.h |  1 +
>  lib/parser.c   | 44 +---
>  4 files changed, 44 insertions(+), 19 deletions(-)
> 
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH 0/4] make jbd2 debug switch per device

2021-01-26 Thread Jan Kara
On Fri 22-01-21 14:43:18, Chunguang Xu wrote:
> On a multi-disk machine, because jbd2 debugging switch is global, this
> confuses the logs of multiple disks. It is not easy to distinguish the
> logs of each disk and the amount of generated logs is very large. Or a
> separate debugging switch for each disk would be better, so that you
> can easily distinguish the logs of a certain disk. 
> 
> We can enable jbd2 debugging of a device in the following ways:
> echo X > /proc/fs/jbd2/sdX/jbd2_debug
> 
> But there is a small disadvantage here. Because the debugging switch is
> placed in the journal_t object, the log before the object is initialized
> will be lost. However, usually this will not have much impact on
> debugging.

OK, I didn't look at the series yet but I'm wondering: How are you using
jbd2 debugging? I mean obviously it isn't meant for production use but
rather for debugging JBD2 bugs so I'm kind of wondering in which case too
many messages matter.

And if the problem is that there's a problem with distinguishing messages
from multiple filesystems, then it would be perhaps more useful to add
journal identification to each message similarly as we do it with ext4
messages (likely by using journal->j_dev) - which is very simple to do
after your patches 3 and 4.

    Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: linux-next: Fixes tag needs some work in the block tree

2021-01-26 Thread Jan Kara
On Tue 26-01-21 08:47:55, Jens Axboe wrote:
> On 1/26/21 6:29 AM, Jan Kara wrote:
> > On Mon 25-01-21 11:39:50, Jens Axboe wrote:
> >> On 1/25/21 11:35 AM, Paolo Valente wrote:
> >>>
> >>>
> >>>> Il giorno 25 gen 2021, alle ore 10:40, Stephen Rothwell 
> >>>>  ha scritto:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> In commit
> >>>>
> >>>>  d4fc3640ff36 ("block, bfq: set next_rq to waker_bfqq->next_rq in waker 
> >>>> injection")
> >>>>
> >>>> Fixes tag
> >>>>
> >>>>  Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally 
> >>>> inject their I/O")
> >>>>
> >>>> has these problem(s):
> >>>>
> >>>>  - Target SHA1 does not exist
> >>>>
> >>>> Maybe you meant
> >>>>
> >>>> Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally 
> >>>> inject their I/O")
> >>>>
> >>>
> >>> Hi Jens,
> >>> how to proceed in such a case (with patches already applied by you)?
> >>> Shall I send you a v2 with only this change?
> >>
> >> We just have to ignore it... But in the future, always double check that
> >> you are using the right shas, not some sha from an internal tree.
> > 
> > FWIW I have a commit hook in my git tree that just refuses a commit with
> > unknown Fixes tag SHA. Exactly to catch such mishaps in the patches I
> > merge...
> 
> That's not a bad idea, would help catch it upfront. Can you share the
> hook?

Sure, attached. Note that the hook just gets commit ID from the Fixes tag
and formats it with 12 commit ID digits and appropriate commit subject.

Honza
-- 
Jan Kara 
SUSE Labs, CR
#!/bin/sh
#
# Called by "git commit" with one argument, the name of the file
# that has the commit message.  The hook should exit with non-zero
# status after issuing an appropriate message if it wants to stop the
# commit.  The hook is allowed to edit the commit message file.

# Process all Fixes tags, check commit IDs and set appropriate commit titles.

for COMMIT in $(sed -n -e 's/^Fixes: \([0-9a-z]*\).*/\1/p' "$1"); do
GOOD=$(git show -s --abbrev-commit --abbrev=12 --pretty=format:"%h 
(\"%s\")%n" $COMMIT -- 2>/dev/null)
if [ -z "$GOOD" ]; then
echo "Unknown commit: $COMMIT"
exit 1
fi
echo "Setting fixes tag: $GOOD"
sed -i -e "s/^Fixes: $COMMIT.*/Fixes: $GOOD/" "$1"
done

exit 0


Re: linux-next: Fixes tag needs some work in the block tree

2021-01-26 Thread Jan Kara
On Mon 25-01-21 11:39:50, Jens Axboe wrote:
> On 1/25/21 11:35 AM, Paolo Valente wrote:
> > 
> > 
> >> Il giorno 25 gen 2021, alle ore 10:40, Stephen Rothwell 
> >>  ha scritto:
> >>
> >> Hi all,
> >>
> >> In commit
> >>
> >>  d4fc3640ff36 ("block, bfq: set next_rq to waker_bfqq->next_rq in waker 
> >> injection")
> >>
> >> Fixes tag
> >>
> >>  Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally 
> >> inject their I/O")
> >>
> >> has these problem(s):
> >>
> >>  - Target SHA1 does not exist
> >>
> >> Maybe you meant
> >>
> >> Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject 
> >> their I/O")
> >>
> > 
> > Hi Jens,
> > how to proceed in such a case (with patches already applied by you)?
> > Shall I send you a v2 with only this change?
> 
> We just have to ignore it... But in the future, always double check that
> you are using the right shas, not some sha from an internal tree.

FWIW I have a commit hook in my git tree that just refuses a commit with
unknown Fixes tag SHA. Exactly to catch such mishaps in the patches I
merge...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] isofs: release buffer head before return

2021-01-25 Thread Jan Kara
On Mon 18-01-21 15:41:23, Jan Kara wrote:
> On Mon 18-01-21 04:04:55, Pan Bian wrote:
> > Release the buffer header before returning error code.
> > 
> > Fixes: 2deb1acc653c ("isofs: fix access to unallocated memory when reading 
> > corrupted filesystem")
> > Signed-off-by: Pan Bian 
> 
> OK, good spotting. But the other hunk in commit 2deb1acc653c seems to have
> the same problem so we might fix it as well when we are at it?

OK, I did this myself and picked the patch to my tree.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH v2 2/4] jbd2: introduce some new log interfaces

2021-01-25 Thread Jan Kara
On Sat 23-01-21 20:00:44, Chunguang Xu wrote:
> From: Chunguang Xu 
> 
> Compared to directly using numbers to indicate levels, using abstract
> error, warn, notice, info, debug to indicate levels may be more
> convenient for code reading and writing. Similar to other kernel
> modules, some basic log interfaces are introduced.
> 
> Signed-off-by: Chunguang Xu 

One more thing I've noticed when reading this patch:

> +
> +#ifdef CONFIG_JBD2_DEBUG
> +/*
> + * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal
> + * consistency checks.  By default we don't do this unless
> + * CONFIG_JBD2_DEBUG is on.
> + */
> +#define JBD2_EXPENSIVE_CHECKING
> +extern ushort jbd2_journal_enable_debug;
> +void jbd2_log(int level, journal_t *j, const char *file, const char *func,
> +   unsigned int line, const char *fmt, ...);
> +
> +#define JBD2_ERR 1   /* error conditions */
> +#define JBD2_WARN2   /* warning conditions */
> +#define JBD2_NOTICE  3   /* normal but significant condition */
> +#define JBD2_INFO4   /* informational */
> +#define JBD2_DEBUG   5   /* debug-level messages */

This is actually not true. All the jbd_debug() messages are really debug
messages, not errors, not warnings. It is just a different level of detail.
Honestly, these days, I'd rather discard all the levels, use pr_debug()
function to print these messages inside jdb2_debug() and defer to
CONFIG_DYNAMIC_DEBUG framework for configuration of which messages are
interesting for a particular debug session.

Honza

> +
> +#define jbd2_err(j, fmt, a...)   
> \
> + jbd2_log(JBD2_ERR, j, __FILE__, __func__, __LINE__, (fmt), ##a)
> +
> +#define jbd2_warn(j, fmt, a...)  
> \
> + jbd2_log(JBD2_WARN, j, __FILE__, __func__, __LINE__, (fmt), ##a)
> +
> +#define jbd2_notice(j, fmt, a...)\
> + jbd2_log(JBD2_NOTICE, j, __FILE__, __func__, __LINE__, (fmt), ##a)
> +
> +#define jbd2_info(j, fmt, a...)  
> \
> + jbd2_log(JBD2_INFO, j, __FILE__, __func__, __LINE__, (fmt), ##a)
> +
> +#define jbd2_debug(j, fmt, a...) \
> + jbd2_log(JBD2_DEBUG, j, __FILE__, __func__, __LINE__, (fmt), ##a)
> +
> +#else
> +
> +#define jbd2_err(j, fmt, a...)
> +#define jbd2_warn(j, fmt, a...)
> +#define jbd2_notice(j, fmt, a...)
> +#define jbd2_info(j, fmt, a...)
> +#define jbd2_debug(j, fmt, a...)
> +
> +#endif
>  #endif
>  
>  /*
> -- 
> 2.30.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: Expense of read_iter

2021-01-20 Thread Jan Kara
On Wed 20-01-21 15:47:00, Dave Chinner wrote:
> On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> > On Thu, 14 Jan 2021, Mikulas wrote:
> > For Ext4-dax, the overhead of dax_iomap_rw is significant
> > compared to the overhead of struct iov_iter. Although methods
> > proposed by Mikulas can eliminate the overhead of iov_iter
> > well, they can not be applied in Ext4-dax unless we implement an
> > internal "read" method in Ext4-dax.
> > 
> > For Ext4-dax, there could be two approaches to optimizing:
> > 1) implementing the internal "read" method without the complexity
> > of iterators and dax_iomap_rw;
> 
> Please do not go an re-invent the wheel just for ext4. If there's a
> problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
> as well, so any improvements to that path benefit all DAX users, not
> just ext4.
> 
> > 2) optimizing how dax_iomap_rw works.
> > Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> > the iomap structure and others (e.g., journaling status locks in Ext4),
> > we think implementing the internal "read" method would be easier.
> 
> Maybe it is, but it's also very selfish. The DAX iomap path was
> written to be correct for all users, not inecessarily provide
> optimal performance. There will be lots of things that could be done
> to optimise it, so rather than creating a special snowflake in ext4
> that makes DAX in ext4 much harder to maintain for non-ext4 DAX
> developers, please work to improve the common DAX IO path and so
> provide the same benefit to all the filesystems that use it.

Yeah, I agree. I'm against ext4 private solution for this read problem. And
I'm also against duplicating ->read_iter functionatily in ->read handler.
The maintenance burden of this code duplication is IMHO just too big. We
rather need to improve the generic code so that the fast path is faster.
And every filesystem will benefit because this is not ext4 specific
problem.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] [v3] block: Fix an error handling in add_partition

2021-01-18 Thread Jan Kara
On Sun 17-01-21 16:53:42, Dinghao Liu wrote:
> Once we have called device_initialize(), we should use put_device() to
> give up the reference on error, just like what we have done on failure
> of device_add().
> 
> Signed-off-by: Dinghao Liu 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
> 
> Changelog:
> 
> v2: - Refine commit message.
> 
> v3: - Add '[v3]' to the title.
> ---
>  block/partitions/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/partitions/core.c b/block/partitions/core.c
> index e7d776db803b..23460cee9de5 100644
> --- a/block/partitions/core.c
> +++ b/block/partitions/core.c
> @@ -384,7 +384,7 @@ static struct block_device *add_partition(struct gendisk 
> *disk, int partno,
>  
>   err = blk_alloc_devt(bdev, &devt);
>   if (err)
> - goto out_bdput;
> + goto out_put;
>   pdev->devt = devt;
>  
>   /* delay uevent until 'holders' subdir is created */
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] isofs: release buffer head before return

2021-01-18 Thread Jan Kara
On Mon 18-01-21 04:04:55, Pan Bian wrote:
> Release the buffer header before returning error code.
> 
> Fixes: 2deb1acc653c ("isofs: fix access to unallocated memory when reading 
> corrupted filesystem")
> Signed-off-by: Pan Bian 

OK, good spotting. But the other hunk in commit 2deb1acc653c seems to have
the same problem so we might fix it as well when we are at it?

Honza

> ---
>  fs/isofs/dir.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/isofs/dir.c b/fs/isofs/dir.c
> index f0fe641893a5..b9e6a7ec78be 100644
> --- a/fs/isofs/dir.c
> +++ b/fs/isofs/dir.c
> @@ -152,6 +152,7 @@ static int do_isofs_readdir(struct inode *inode, struct 
> file *file,
>   printk(KERN_NOTICE "iso9660: Corrupted directory entry"
>  " in block %lu of inode %lu\n", block,
>  inode->i_ino);
> + brelse(bh);
>   return -EIO;
>   }
>  
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy

2021-01-15 Thread Jan Kara
On Sat 09-01-21 11:46:46, Linus Torvalds wrote:
> On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox  wrote:
> >
> > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote:
> > > Side note, and not really related to UFFD, but the mmap_sem in
> > > general: I was at one point actually hoping that we could make the
> > > mmap_sem a spinlock, or at least make the rule be that we never do any
> > > IO under it. At which point a write lock hopefully really shouldn't be
> > > such a huge deal.
> >
> > There's a (small) group of us working towards that.  It has some
> > prerequisites, but where we're hoping to go currently:
> >
> >  - Replace the vma rbtree with a b-tree protected with a spinlock
> >  - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset
> >  - If we need to do I/O, take a refcount on the VMA
> >
> > After that, we can gradually move things out from mmap_sem protection
> > to just the vma tree spinlock, or whatever makes sense for them.  In a
> > very real way the mmap_sem is the MM layer's BKL.
> 
> Well, we could do the "no IO" part first, and keep the semaphore part.
> 
> Some people actually prefer a semaphore to a spinlock, because it
> doesn't end up causing preemption issues.
> 
> As long as you don't do IO (or memory allocations) under a semaphore
> (ok, in this case it's a rwsem, same difference), it might even be
> preferable to keep it as a semaphore rather than as a spinlock.
> 
> So it doesn't necessarily have to go all the way - we _could_ just try
> something like "when taking the mmap_sem, set a thread flag" and then
> have a "warn if doing allocations or IO under that flag".
> 
> And since this is about performance, not some hard requirement, it
> might not even matter if we catch all cases.  If we fix it so that any
> regular load on most normal filesystems never see the warning, we'd
> already be golden.

Honestly, I'd *love* if a filesystem can be guaranteed that ->fault and
->mkwrite callbacks do not happen under mmap_sem (or if at least fs would
be free to drop mmap_sem if it finds the page is not already cached /
prepared for writing). Because for filesystems the locking of page fault is
really painful as the lock ordering wrt mmap_sem is exactly oposite
compared to read / write path (read & write path must be designed so that
mmap_sem can be taken inside it to copy user data, fault path may be all
happening under mmap_sem). As a result this has been a long term source of
deadlocks, stale data exposure issues, and filesystem corruption issues due
to insufficient locking for multiple filesystems.

But when I was looking at what it would take to achieve this several years
ago, fixing all GUP users to deal with mmap_sem being dropped during a
fault was a gigantic task because there were users of GUP relying on
mmap_sem being held for large code sections around the GUP call...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

2021-01-15 Thread Jan Kara
On Thu 07-01-21 13:53:18, John Hubbard wrote:
> On 1/7/21 1:29 PM, Linus Torvalds wrote:
> > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli  
> > wrote:
> > > 
> > > The problem is it's not even possible to detect reliably if there's
> > > really a long term GUP pin because of speculative pagecache lookups.
> > 
> > So none of the normal code _needs_ that any more these days, which is
> > what I think is so nice. Any pinning will do the COW, and then we have
> > the logic to make sure it stays writable, and that keeps everything
> > nicely coherent and is all fairly simple.
> > 
> > And yes, it does mean that if somebody then explicitly write-protects
> > a page, it may end up being COW'ed after all, but if you first pinned
> > it, and then started playing with the protections of that page, why
> > should you be surprised?
> > 
> > So to me, this sounds like a "don't do that then" situation.
> > 
> > Anybody who does page pinning and wants coherency should NOT TOUCH THE
> > MAPPING IT PINNED.
> > 
> > (And if you do touch it, it's your own fault, and you get to keep both
> > of the broken pieces)
> > 
> > Now, I do agree that from a QoI standpoint, it would be really lovely
> > if we actually enforced it. I'm not entirely sure we can, but maybe it
> > would be reasonable to use that
> > 
> >mm->has_pinned && page_maybe_dma_pinned(page)
> > 
> > at least as the beginning of a heuristic.
> > 
> > In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> > made stronger than it is. Because at *THAT* point, we might say "we
> 
> What exactly did you have in mind, to make it stronger? I think the
> answer is in this email but I don't quite see it yet...
> 
> Also, now seems to be a good time to mention that I've been thinking about
> a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer
> to peer with GUP/NIC, and HMM support for atomic operations from a device).
> And it seems like the following approach would help:
> 
> * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks
> to Jason for this point) means "user space owns the lifetime". We might
> even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would
> make them mean the same thing. The idea is that there are no "short term"
> pins of this kind of memory.
> 
> * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans,
> because several of us had thought that Direct IO needs FOLL_PIN. However, this
> recent conversation, plus my list of cases above, seems to indicate otherwise.
> That's because we only have one refcount approach for marking pages in this 
> way,
> and we should spend it on the long-term pinned pages. Those are both hard to
> identify otherwise, and actionable once we identify them.

Somewhat late to the game but I disagree here. I think direct IO still
needs FOLL_PIN so that page_may_be_dma_pinned() returns true for it. At
least for shared pages. Because filesystems/mm in the writeback path need
to detect whether the page is pinned and thus its contents can change
anytime without noticing, the page can be dirtied at random times etc. In
that case we need to bounce the page during writeback (to avoid checksum
failures), keep page as dirty in internal filesystem bookkeeping (and in MM
as well) etc...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] udf: fix the problem that the disc content is not displayed

2021-01-14 Thread Jan Kara
On Thu 14-01-21 21:26:15, lianzhi chang wrote:
> When the capacity of the disc is too large (assuming the 4.7G
> specification), the disc (UDF file system) will be burned
> multiple times in the windows (Multisession Usage). When the
> remaining capacity of the CD is less than 300M (estimated
> value, for reference only), open the CD in the Linux system,
> the content of the CD is displayed as blank (the kernel will
> say "No VRS found"). Windows can display the contents of the
> CD normally.
> Through analysis, in the "fs/udf/super.c": udf_check_vsd
> function, the actual value of VSD_MAX_SECTOR_OFFSET may
> be much larger than 0x80. According to the current code
> logic, it is found that the type of sbi->s_session is "__s32",
>  when the remaining capacity of the disc is less than 300M
> (take a set of test values: sector=3154903040,
> sbi->s_session=1540464, sb->s_blocksize_bits=11 ), the
> calculation result of "sbi->s_session << sb->s_blocksize_bits"
>  will overflow. Therefore, it is necessary to convert the
> type of s_session to "loff_t" (when udf_check_vsd starts,
> assign a value to _sector, which is also converted in this
> way), so that the result will not overflow, and then the
> content of the disc can be displayed normally.
> 
> Signed-off-by: lianzhi chang 

There was no need to resend the patch (I've fixed up the problem locally -
it was easy enough) but thanks anyway :).

Honza

> ---
>  fs/udf/super.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/udf/super.c b/fs/udf/super.c
> index 5bef3a68395d..f2ff98f393fb 100644
> --- a/fs/udf/super.c
> +++ b/fs/udf/super.c
> @@ -705,6 +705,7 @@ static int udf_check_vsd(struct super_block *sb)
>   struct buffer_head *bh = NULL;
>   int nsr = 0;
>   struct udf_sb_info *sbi;
> + loff_t sector_offset;
>  
>   sbi = UDF_SB(sb);
>   if (sb->s_blocksize < sizeof(struct volStructDesc))
> @@ -712,7 +713,8 @@ static int udf_check_vsd(struct super_block *sb)
>   else
>   sectorsize = sb->s_blocksize;
>  
> - sector += (((loff_t)sbi->s_session) << sb->s_blocksize_bits);
> + sector_offset = (loff_t)sbi->s_session << sb->s_blocksize_bits;
> + sector += sector_offset;
>  
>   udf_debug("Starting at sector %u (%lu byte sectors)\n",
> (unsigned int)(sector >> sb->s_blocksize_bits),
> @@ -757,8 +759,7 @@ static int udf_check_vsd(struct super_block *sb)
>  
>   if (nsr > 0)
>   return 1;
> -     else if (!bh && sector - (sbi->s_session << sb->s_blocksize_bits) ==
> - VSD_FIRST_SECTOR_OFFSET)
> + else if (!bh && sector - sector_offset == VSD_FIRST_SECTOR_OFFSET)
>   return -1;
>   else
>   return 0;
> -- 
> 2.20.1
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] udf: fix hole append when File Tail exists

2021-01-14 Thread Jan Kara
On Sat 09-01-21 16:40:53, Steve Magnani wrote:
> From: Steven J. Magnani 
> 
> When an ALLOCATED_NOT_RECORDED extent ("File Tail") follows the
> extents describing a file body, don't (improperly) try to make use of it
> as part of appending a hole to the file.
> 
> Fixes: fa33cdbf3ece ("udf: Fix incorrect final NOT_ALLOCATED (hole) extent 
> length")
> Signed-off-by: Steven J. Magnani 

Thanks for the fix! One comment below.

> --- a/fs/udf/inode.c  2020-12-28 20:51:29.0 -0600
> +++ b/fs/udf/inode.c  2021-01-02 17:00:39.794157840 -0600
> @@ -520,8 +520,7 @@ static int udf_do_extend_file(struct ino
>   prealloc_loc = last_ext->extLocation;
>   prealloc_len = last_ext->extLength;
>   /* Mark the extent as a hole */
> - last_ext->extLength = EXT_NOT_RECORDED_NOT_ALLOCATED |
> - (last_ext->extLength & UDF_EXTENT_LENGTH_MASK);
> + last_ext->extLength = EXT_NOT_RECORDED_NOT_ALLOCATED;
>   last_ext->extLocation.logicalBlockNum = 0;
>   last_ext->extLocation.partitionReferenceNum = 0;
>   }
> @@ -626,7 +625,6 @@ static void udf_do_extend_final_block(st
>  
>  static int udf_extend_file(struct inode *inode, loff_t newsize)
>  {
> -
>   struct extent_position epos;
>   struct kernel_lb_addr eloc;
>   uint32_t elen;
> @@ -639,6 +637,7 @@ static int udf_extend_file(struct inode
>   struct kernel_long_ad extent;
>   int err = 0;
>   int within_final_block;
> + loff_t hole_size = 0;
>  
>   if (iinfo->i_alloc_type == ICBTAG_FLAG_AD_SHORT)
>   adsize = sizeof(struct short_ad);
> @@ -648,7 +647,8 @@ static int udf_extend_file(struct inode
>   BUG();
>  
>   etype = inode_bmap(inode, first_block, &epos, &eloc, &elen, &offset);
> - within_final_block = (etype != -1);
> + within_final_block = (etype == (EXT_RECORDED_ALLOCATED >> 30)) ||
> +  (etype == (EXT_NOT_RECORDED_NOT_ALLOCATED >> 30));
>  
>   if ((!epos.bh && epos.offset == udf_file_entry_alloc_offset(inode)) ||
>   (epos.bh && epos.offset == sizeof(struct allocExtDesc))) {
> @@ -658,9 +658,15 @@ static int udf_extend_file(struct inode
>   extent.extLocation.partitionReferenceNum = 0;
>   extent.extLength = EXT_NOT_RECORDED_NOT_ALLOCATED;
>   } else {
> + int8_t bmap_etype = etype;
>   epos.offset -= adsize;
>   etype = udf_next_aext(inode, &epos, &extent.extLocation,
> &extent.extLength, 0);
> + if ((bmap_etype == -1) &&
> + (etype == (EXT_NOT_RECORDED_ALLOCATED >> 30))) {
> + /* offset is relative to prealloc extent end */
> + hole_size = extent.extLength;

Rather than introducing 'hole_size', I think we can just update 'offset'
here?

> + }
>   extent.extLength |= etype << 30;
>   }
>  
...

> @@ -729,14 +735,22 @@ static sector_t inode_getblk(struct inod
>   cur_epos.bh = next_epos.bh;
>   }
>  
> - lbcount += elen;
> -
>   prev_epos.block = cur_epos.block;
>   cur_epos.block = next_epos.block;
>  
>   prev_epos.offset = cur_epos.offset;
>   cur_epos.offset = next_epos.offset;
>  
> + /* Corner case: preallocated extent that stops short of
> +  * desired block
> +  */
> + if ((etype == (EXT_NOT_RECORDED_ALLOCATED >> 30)) &&
> + ((lbcount + elen) <= b_off)) {
> + etype = -1;
> + break;
> + }
> +
> + lbcount += elen;

Honestly, I don't like this special case and it seems rather arbitrary.
Also any EXT_NOT_RECORDED_ALLOCATED inside the file would now break the
block lookup while previously it was gracefully handled. Also I'm not sure
why it's event needed - even if inode_getblk() returns returns -1 and the
'extent' loaded by udf_next_aext() ends up being EXT_NOT_RECORDED_ALLOCATED,
we will end up passing it to udf_do_extend_file() which recognizes it as
preallocation extent and will insert a hole extent before it? Am I missing
something?

Honza


>   etype = udf_next_aext(inode, &next_epos, &eloc, &elen, 1);
>   if (etype == -1)
>   break;
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/1] udf: fix silent AED tagLocation corruption

2021-01-14 Thread Jan Kara
On Thu 07-01-21 17:41:16, Steve Magnani wrote:
> From: Steven J. Magnani 
> 
> When extending a file, make sure that the pointer to the last written
> extent does not get adjusted into the header (tag) portion of an
> Allocation Extent Descriptor. Otherwise, a subsequent call to
> udf_update_exents() can clobber the AED's tagLocation field, replacing
> it with the logical block number of a file extent.
> 
> Signed-off-by: Steven J. Magnani 

Thanks! It took me some time to understand what's the actual problem but
eventually I've understood - I've updated the changelog and a comment to
explain a bit more and merged the patch into my tree.

Honza

> ---
> --- a/fs/udf/inode.c  2020-12-28 20:51:29.0 -0600
> +++ b/fs/udf/inode.c  2021-01-01 19:20:37.163767576 -0600
> @@ -547,11 +547,14 @@ static int udf_do_extend_file(struct ino
>  
>   udf_write_aext(inode, last_pos, &last_ext->extLocation,
>   last_ext->extLength, 1);
> - /*
> -  * We've rewritten the last extent but there may be empty
> -  * indirect extent after it - enter it.
> -  */
> - udf_next_aext(inode, last_pos, &tmploc, &tmplen, 0);
> +
> + if (new_block_bytes || prealloc_len) {
> + /*
> +  * We've rewritten the last extent but there may be 
> empty
> +  * indirect extent after it - enter it.
> +  */
> + udf_next_aext(inode, last_pos, &tmploc, &tmplen, 0);
> + }
>   }
>  
>   /* Managed to do everything necessary? */
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] udf: fix the problem that the disc content is not displayed

2021-01-14 Thread Jan Kara
On Thu 14-01-21 15:57:41, lianzhi chang wrote:
> When the capacity of the disc is too large (assuming the 4.7G
> specification), the disc (UDF file system) will be burned
> multiple times in the windows (Multisession Usage). When the
> remaining capacity of the CD is less than 300M (estimated
> value, for reference only), open the CD in the Linux system,
> the content of the CD is displayed as blank (the kernel will
> say "No VRS found"). Windows can display the contents of the
> CD normally.
> Through analysis, in the "fs/udf/super.c": udf_check_vsd
> function, the actual value of VSD_MAX_SECTOR_OFFSET may
> be much larger than 0x80. According to the current code
> logic, it is found that the type of sbi->s_session is "__s32",
>  when the remaining capacity of the disc is less than 300M
> (take a set of test values: sector=3154903040,
> sbi->s_session=1540464, sb->s_blocksize_bits=11 ), the
> calculation result of "sbi->s_session << sb->s_blocksize_bits"
>  will overflow. Therefore, it is necessary to convert the
> type of s_session to "loff_t" (when udf_check_vsd starts,
> assign a value to _sector, which is also converted in this
> way), so that the result will not overflow, and then the
> content of the disc can be displayed normally.
> 
> Signed-off-by: lianzhi chang 
> ---
>  fs/udf/super.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/udf/super.c b/fs/udf/super.c
> index 5bef3a68395d..f2ff98f393fb 100644
> --- a/fs/udf/super.c
> +++ b/fs/udf/super.c
> @@ -705,6 +705,7 @@ static int udf_check_vsd(struct super_block *sb)
>   struct buffer_head *bh = NULL;
>   int nsr = 0;
>   struct udf_sb_info *sbi;
> + loff_t sector_offset;
>  
>   sbi = UDF_SB(sb);
>   if (sb->s_blocksize < sizeof(struct volStructDesc))
> @@ -712,7 +713,8 @@ static int udf_check_vsd(struct super_block *sb)
>   else
>   sectorsize = sb->s_blocksize;
>  
> - sector += (((loff_t)sbi->s_session) << sb->s_blocksize_bits);
> + sector_offset = (loff_t)sbi->s_session << sb->s_blocksize_bits);

There's imbalanced parentheses here, I'll fix it up on commit. Otherwise
the fix looks good to me. Thanks!

Honza

> + sector += sector_offset;
>  
>   udf_debug("Starting at sector %u (%lu byte sectors)\n",
> (unsigned int)(sector >> sb->s_blocksize_bits),
> @@ -757,8 +759,7 @@ static int udf_check_vsd(struct super_block *sb)
>  
>   if (nsr > 0)
>   return 1;
> -     else if (!bh && sector - (sbi->s_session << sb->s_blocksize_bits) ==
> - VSD_FIRST_SECTOR_OFFSET)
> + else if (!bh && sector - sector_offset == VSD_FIRST_SECTOR_OFFSET)
>   return -1;
>   else
>   return 0;
> -- 
> 2.20.1
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/1] udf: fix silent AED tagLocation corruption

2021-01-13 Thread Jan Kara
ta > /tmp/provoker.$$
> make_test_udf $TEST_UDFFS
> 
> # Verify hardcoded LSNs involved in testing
> require_tag_id  $PD_LSN   5  # PD
> require_tag_id $AED_LSN 258  # AED
> 
> LBN_BASE=`extract32 $PD_LSN 188`   # Partition Starting Location
> AED_LBN=`expr $AED_LSN - $LBN_BASE`
> AED_TAG_LOCATION=`extract32 $AED_LSN 12`
> 
> if [ $AED_TAG_LOCATION -ne $AED_LBN ] ; then
> echo Test FAILED: expected AED tag location $AED_LBN, actual is 
> $AED_TAG_LOCATION
> exit 1
> else
> echo Test PASSED. AED tag location field is correct.
> fi
> 
>  Steven J. Magnani   "I claim this network for MARS!
>   Earthling, return my space modulator!"
>  #include 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: kernel BUG at mm/page-writeback.c:LINE!

2021-01-12 Thread Jan Kara
On Fri 08-01-21 18:04:21, Linus Torvalds wrote:
> On Tue, Jan 5, 2021 at 11:53 AM Linus Torvalds
>  wrote:
> >
> > I took your "way to go" statement as an ack, and made it all be commit
> > c2407cf7d22d ("mm: make wait_on_page_writeback() wait for multiple
> > pending writebacks").
> 
> Oh, and Michael Larabel (of phoronix) reports that that one-liner does
> something bad to a few PostgreSQL tests, on the order of 5-10%
> regression on some machines (but apparently not others).

Do you have more details? From my experience (we do regular pgbench runs
for various kernels in various configs in SUSE) PostgreSQL numbers tend to
be somewhat noisy and more dependent on CPU scheduling and NUMA locality
than anything else. But it very much depends on the exact config passed to
pgbench so that's why I'm asking...

> I suspect that's a sign of instability in the benchmark numbers, but
> it probably also means that we have some silly condition where
> multiple threads want to clean the same page.
> 
> I sent him a patch to try if it ends up being better to just not wake
> things up early at all (instead of the "if" -> "while") conversion.
> That trivial patch appended here in case anybody has comments.
> 
> Just the fact that that one-liner made a performance impact makes me
> go "hmm", though. Michael didn't see the BUG_ON(), so it's presumably
> some _other_ user of wait_on_page_writeback() than the
> write_cache_pages() one that causes issues.
> 
> Anybody got any suspicions? Honestly, when working on the page wait
> queues, I was working under the assumption that it's really just the
> page lock that truly matters.
> 
> I'm thinking things like __filemap_fdatawait_range(), which doesn't
> hold the page lock at all, so it's all kinds of non-serialized, and
> could now be waiting for any number of IO's ro complete..
> 
> Oh well. This email doesn't really have a point, it's more of a
> heads-up that that "wait to see one or multiple writebacks" thing
> seems to matter more than I would have expected for some loads..

Honestly I'm surprised your patch made a difference as well. It is pretty
common a page gets redirtied while it's being written back but usually it
takes time before next writeback of the page is started. But I guess with
the DB load it is possible e.g. if we frequently flush out some page for
data consistency reasons (I know PostgreSQL is using sync_file_range(2)
interface to start flushing pages early and then uses fsync(2) when it
really needs the pages written which could create a situation with unfair
treatment of PageWriteback bit).

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] block: fallocate: avoid false positive on collision detection

2021-01-07 Thread Jan Kara
On Thu 07-01-21 14:40:22, Maxim Levitsky wrote:
> Align start and end on page boundaries before calling
> invalidate_inode_pages2_range.
> 
> This might allow us to miss a collision if the write and the discard were done
> to the same page and do overlap but it is still better than returning -EBUSY
> if those writes didn't overlap.
> 
> Signed-off-by: Maxim Levitsky 

Thanks for getting back to this and I'm sorry I didn't get to this earlier
myself! I actually think the fix should be different as we discussed with
Darrick. Attached patch should fix the issue for you (I'll also post it
formally for inclusion).

Honza

> ---
>  fs/block_dev.c | 20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9e84b1928b94..97f0d16661b5 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1970,6 +1970,7 @@ static long blkdev_fallocate(struct file *file, int 
> mode, loff_t start,
>   loff_t end = start + len - 1;
>   loff_t isize;
>   int error;
> + pgoff_t invalidate_first_page, invalidate_last_page;
>  
>   /* Fail if we don't recognize the flags. */
>   if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
> @@ -2020,12 +2021,23 @@ static long blkdev_fallocate(struct file *file, int 
> mode, loff_t start,
>  
>   /*
>* Invalidate again; if someone wandered in and dirtied a page,
> -  * the caller will be given -EBUSY.  The third argument is
> -  * inclusive, so the rounding here is safe.
> +  * the caller will be given -EBUSY.
> +  *
> +  * If the start/end of the range is not page aligned, exclude the
> +  * non aligned regions to avoid false positives.
>*/
> + invalidate_first_page = DIV_ROUND_UP(start, PAGE_SIZE);
> + invalidate_last_page = end >> PAGE_SHIFT;
> +
> + if ((end + 1) & PAGE_MASK)
> + invalidate_last_page--;
> +
> + if (invalidate_last_page < invalidate_first_page)
> + return 0;
> +
>   return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
> -  start >> PAGE_SHIFT,
> -  end >> PAGE_SHIFT);
> +          invalidate_first_page,
> +      invalidate_last_page);
>  }
>  
>  const struct file_operations def_blk_fops = {
> -- 
> 2.26.2
> 
-- 
Jan Kara 
SUSE Labs, CR
>From 36f751ac88420a6bda8a3c161986455629dc80d4 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Thu, 7 Jan 2021 16:26:52 +0100
Subject: [PATCH] bdev: Do not return EBUSY if bdev discard races with write

blkdev_fallocate() tries to detect whether a discard raced with an
overlapping write by calling invalidate_inode_pages2_range(). However
this check can give both false negatives (when writing using direct IO
or when writeback already writes out the written pagecache range) and
false positives (when write is not actually overlapping but ends in the
same page when blocksize < pagesize). This actually causes issues for
qemu which is getting confused by EBUSY errors.

Fix the problem by removing this conflicting write detection since it is
inherently racy and thus of little use anyway.

Reported-by: Maxim Levitsky 
CC: "Darrick J. Wong" 
Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..a97f43b49839 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1797,13 +1797,11 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		return error;
 
 	/*
-	 * Invalidate again; if someone wandered in and dirtied a page,
-	 * the caller will be given -EBUSY.  The third argument is
-	 * inclusive, so the rounding here is safe.
+	 * Invalidate the page cache again; if someone wandered in and dirtied
+	 * a page, we just discard it - userspace has no way of knowing whether
+	 * the write happened before or after discard completing...
 	 */
-	return invalidate_inode_pages2_range(bdev->bd_inode->i_mapping,
-	 start >> PAGE_SHIFT,
-	 end >> PAGE_SHIFT);
+	return truncate_bdev_range(bdev, file->f_mode, start, end);
 }
 
 const struct file_operations def_blk_fops = {
-- 
2.26.2



Re: [PATCH 08/10] md: Implement ->corrupted_range()

2021-01-06 Thread Jan Kara
On Thu 31-12-20 00:55:59, Shiyang Ruan wrote:
> With the support of ->rmap(), it is possible to obtain the superblock on
> a mapped device.
> 
> If a pmem device is used as one target of mapped device, we cannot
> obtain its superblock directly.  With the help of SYSFS, the mapped
> device can be found on the target devices.  So, we iterate the
> bdev->bd_holder_disks to obtain its mapped device.
> 
> Signed-off-by: Shiyang Ruan 

Thanks for the patch. Two comments below.

> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4688bff19c20..9f9a2f3bf73b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -256,21 +256,16 @@ static int pmem_rw_page(struct block_device *bdev, 
> sector_t sector,
>  static int pmem_corrupted_range(struct gendisk *disk, struct block_device 
> *bdev,
>   loff_t disk_offset, size_t len, void *data)
>  {
> - struct super_block *sb;
>   loff_t bdev_offset;
>   sector_t disk_sector = disk_offset >> SECTOR_SHIFT;
> - int rc = 0;
> + int rc = -ENODEV;
>  
>   bdev = bdget_disk_sector(disk, disk_sector);
>   if (!bdev)
> - return -ENODEV;
> + return rc;
>  
>   bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
> - sb = get_super(bdev);
> - if (sb && sb->s_op->corrupted_range) {
> - rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
> data);
> - drop_super(sb);
> - }
> + rc = bd_corrupted_range(bdev, bdev_offset, bdev_offset, len, data);
>  
>   bdput(bdev);
>   return rc;

This (and the fs/block_dev.c change below) is just refining the function
you've implemented in the patch 6. I think it's confusing to split changes
like this - why not implement things correctly from the start in patch 6?

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 9e84b1928b94..0e50f0e8e8af 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1171,6 +1171,27 @@ struct bd_holder_disk {
>   int refcnt;
>  };
>  
> +static int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_t 
> off,
> +   size_t len, void *data)
> +{
> + struct bd_holder_disk *holder;
> + struct gendisk *disk;
> + int rc = 0;
> +
> + if (list_empty(&(bdev->bd_holder_disks)))
> + return -ENODEV;

This will not compile for !CONFIG_SYSFS kernels. Not that it would be
common but still. Also I'm not sure whether using bd_holder_disks like this
is really the right thing to do (when it seems to be only a sysfs thing),
although admittedly I'm not aware of a better way of getting this
information.

Honza

> +
> + list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
> + disk = holder->disk;
> + if (disk->fops->corrupted_range) {
> + rc = disk->fops->corrupted_range(disk, bdev, off, len, 
> data);
> + if (rc != -ENODEV)
> + break;
> + }
> + }
> + return rc;
> +}
> +
>  static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
> struct gendisk *disk)
>  {
> @@ -1378,6 +1399,22 @@ void bd_set_nr_sectors(struct block_device *bdev, 
> sector_t sectors)
>  }
>  EXPORT_SYMBOL(bd_set_nr_sectors);
>  
> +int bd_corrupted_range(struct block_device *bdev, loff_t disk_off, loff_t 
> bdev_off, size_t len, void *data)
> +{
> + struct super_block *sb = get_super(bdev);
> + int rc = 0;
> +
> + if (!sb) {
> + rc = bd_disk_holder_corrupted_range(bdev, disk_off, len, data);
> + return rc;
> + } else if (sb->s_op->corrupted_range)
> + rc = sb->s_op->corrupted_range(sb, bdev, bdev_off, len, data);
> + drop_super(sb);
> +
> + return rc;
> +}
> +EXPORT_SYMBOL(bd_corrupted_range);
> +
>  static void __blkdev_put(struct block_device *bdev, fmode_t mode, int 
> for_part);
>  
>  int bdev_disk_changed(struct block_device *bdev, bool invalidate)
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index ed06209008b8..42290470810d 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -376,6 +376,8 @@ void revalidate_disk_size(struct gendisk *disk, bool 
> verbose);
>  bool bdev_check_media_change(struct block_device *bdev);
>  int __invalidate_device(struct block_device *bdev, bool kill_dirty);
>  void bd_set_nr_sectors(struct block_device *bdev, sector_t sectors);
> +int bd_corrupted_range(struct block_device *bdev, loff_t disk_off,
> +loff_t bdev_off, size_t len, void *data);
>  
>  /* for drivers/char/raw.c: */
>  int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
> -- 
> 2.29.2
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 05/10] mm, pmem: Implement ->memory_failure() in pmem driver

2021-01-06 Thread Jan Kara
lure */
> + if (pgmap->ops->memory_failure)
> + rc = pgmap->ops->memory_failure(pgmap, pfn, flags);
>  
> - list_for_each_entry(tk, &to_kill, nd)
> - if (tk->size_shift)
> - size = max(size, 1UL << tk->size_shift);
> - if (size) {
> - /*
> -  * Unmap the largest mapping to avoid breaking up
> -  * device-dax mappings which are constant size. The
> -  * actual size of the mapping being torn down is
> -  * communicated in siginfo, see kill_proc()
> -  */
> - start = (page->index << PAGE_SHIFT) & ~(size - 1);
> - unmap_mapping_range(page->mapping, start, start + size, 0);
> - }
> - kill_procs(&to_kill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
> - rc = 0;
> -unlock:
> - dax_unlock_page(page, cookie);
>  out:
>   /* drop pgmap ref acquired in caller */
>   put_dev_pagemap(pgmap);
> -- 
> 2.29.2
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-01-06 Thread Jan Kara
*page,
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   */
>  static void add_to_kill(struct task_struct *tsk, struct page *p,
> -struct vm_area_struct *vma,
> -struct list_head *to_kill)
> + struct address_space *mapping, pgoff_t pgoff,
> + struct vm_area_struct *vma, struct list_head *to_kill)
>  {
>   struct to_kill *tk;
>  
> @@ -345,9 +348,12 @@ static void add_to_kill(struct task_struct *tsk, struct 
> page *p,
>   }
>  
>   tk->addr = page_address_in_vma(p, vma);
> - if (is_zone_device_page(p))
> - tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> - else
> + if (is_zone_device_page(p)) {
> + if (is_device_fsdax_page(p))
> + tk->addr = vma->vm_start +
> + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);

It seems strange to use 'pgoff' for dax pages and not for any other page.
Why? I'd rather pass correct pgoff from all callers of add_to_kill() and
avoid this special casing...

> + tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
> + } else
>   tk->size_shift = page_shift(compound_head(p));
>  
>   /*
> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page *page, struct 
> list_head *to_kill,
>   if (!page_mapped_in_vma(page, vma))
>   continue;
>   if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, NULL, 0, vma, to_kill);
>   }
>   }
>   read_unlock(&tasklist_lock);
> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page *page, 
> struct list_head *to_kill,
>  /*
>   * Collect processes when the error hit a file mapped page.
>   */
> -static void collect_procs_file(struct page *page, struct list_head *to_kill,
> - int force_early)
> +static void collect_procs_file(struct page *page, struct address_space 
> *mapping,
> + pgoff_t pgoff, struct list_head *to_kill, int force_early)
>  {
>   struct vm_area_struct *vma;
>   struct task_struct *tsk;
> - struct address_space *mapping = page->mapping;
> - pgoff_t pgoff;
>  
>   i_mmap_lock_read(mapping);
>   read_lock(&tasklist_lock);
> - pgoff = page_to_pgoff(page);
>   for_each_process(tsk) {
>   struct task_struct *t = task_early_kill(tsk, force_early);
> -
>   if (!t)
>   continue;
> - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
> -   pgoff) {
> + vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>   /*
>* Send early kill signal to tasks where a vma covers
>* the page but the corrupted page is not necessarily
> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page *page, struct 
> list_head *to_kill,
>* to be informed of all such data corruptions.
>*/
>   if (vma->vm_mm == t->mm)
> - add_to_kill(t, page, vma, to_kill);
> + add_to_kill(t, page, mapping, pgoff, vma, 
> to_kill);
>   }
>   }
>   read_unlock(&tasklist_lock);
> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page, struct 
> list_head *tokill,
>   if (PageAnon(page))
>   collect_procs_anon(page, tokill, force_early);
>   else
> - collect_procs_file(page, tokill, force_early);
> + collect_procs_file(page, page->mapping, page_to_pgoff(page),

Why not use page_mapping() helper here? It would be safer for THPs if they
ever get here...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] Use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR.

2021-01-06 Thread Jan Kara
On Wed 30-12-20 11:38:27, Yi Li wrote:
> 1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
> instead of IS_ERR_OR_NULL to fix this.
> 
> 2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
> and go to call iput properly.
> 
> Signed-off-by: Yi Li 

Thanks for the patch! It looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/fast_commit.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 4fcc21c25e79..6b5489273c85 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -1318,14 +1318,14 @@ static int ext4_fc_replay_unlink(struct super_block 
> *sb, struct ext4_fc_tl *tl)
>   entry.len = darg.dname_len;
>   inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
>  
> - if (IS_ERR_OR_NULL(inode)) {
> + if (IS_ERR(inode)) {
>   jbd_debug(1, "Inode %d not found", darg.ino);
>   return 0;
>   }
>  
>   old_parent = ext4_iget(sb, darg.parent_ino,
>   EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(old_parent)) {
> + if (IS_ERR(old_parent)) {
>   jbd_debug(1, "Dir with inode  %d not found", darg.parent_ino);
>   iput(inode);
>   return 0;
> @@ -1410,7 +1410,7 @@ static int ext4_fc_replay_link(struct super_block *sb, 
> struct ext4_fc_tl *tl)
>   darg.parent_ino, darg.dname_len);
>  
>   inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(inode)) {
> + if (IS_ERR(inode)) {
>   jbd_debug(1, "Inode not found.");
>   return 0;
>   }
> @@ -1466,10 +1466,11 @@ static int ext4_fc_replay_inode(struct super_block 
> *sb, struct ext4_fc_tl *tl)
>   trace_ext4_fc_replay(sb, tag, ino, 0, 0);
>  
>   inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
> - if (!IS_ERR_OR_NULL(inode)) {
> + if (!IS_ERR(inode)) {
>   ext4_ext_clear_bb(inode);
>   iput(inode);
>   }
> + inode = NULL;
>  
>   ext4_fc_record_modified_inode(sb, ino);
>  
> @@ -1512,7 +1513,7 @@ static int ext4_fc_replay_inode(struct super_block *sb, 
> struct ext4_fc_tl *tl)
>  
>   /* Given that we just wrote the inode on disk, this SHOULD succeed. */
>   inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(inode)) {
> + if (IS_ERR(inode)) {
>   jbd_debug(1, "Inode not found.");
>   return -EFSCORRUPTED;
>   }
> @@ -1564,7 +1565,7 @@ static int ext4_fc_replay_create(struct super_block 
> *sb, struct ext4_fc_tl *tl)
>   goto out;
>  
>   inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(inode)) {
> + if (IS_ERR(inode)) {
>   jbd_debug(1, "inode %d not found.", darg.ino);
>   inode = NULL;
>   ret = -EINVAL;
> @@ -1577,7 +1578,7 @@ static int ext4_fc_replay_create(struct super_block 
> *sb, struct ext4_fc_tl *tl)
>* dot and dot dot dirents are setup properly.
>*/
>   dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(dir)) {
> + if (IS_ERR(dir)) {
>   jbd_debug(1, "Dir %d not found.", darg.ino);
>   goto out;
>   }
> @@ -1653,7 +1654,7 @@ static int ext4_fc_replay_add_range(struct super_block 
> *sb,
>  
>   inode = ext4_iget(sb, le32_to_cpu(fc_add_ex->fc_ino),
>   EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(inode)) {
> + if (IS_ERR(inode)) {
>   jbd_debug(1, "Inode not found.");
>   return 0;
>   }
> @@ -1777,7 +1778,7 @@ ext4_fc_replay_del_range(struct super_block *sb, struct 
> ext4_fc_tl *tl)
>   le32_to_cpu(lrange->fc_ino), cur, remaining);
>  
>   inode = ext4_iget(sb, le32_to_cpu(lrange->fc_ino), EXT4_IGET_NORMAL);
> - if (IS_ERR_OR_NULL(inode)) {
> + if (IS_ERR(inode)) {
>   jbd_debug(1, "Inode %d not found", le32_to_cpu(lrange->fc_ino));
>   return 0;
>   }
> @@ -1832,7 +1833,7 @@ static void ext4_fc_set_bitmaps_and_counters(struct 
> super_block *sb)
>   for (i = 0; i < state->fc_modified_inodes_used; i++) {
>   inode = ext4_iget(sb, state->fc_modified_inodes[i],
>   EXT4_IGET_NORMAL);
> - if (IS_

Re: [RFC PATCH kernel] block: initialize block_device::bd_bdi for bdev_cache

2021-01-06 Thread Jan Kara
On Wed 06-01-21 20:29:00, Alexey Kardashevskiy wrote:
> This is a workaround to fix a null derefence crash:
> 
> [cb01f840] cb01f880 (unreliable)
> [cb01f880] c0769a3c bdev_evict_inode+0x21c/0x370
> [cb01f8c0] c070bacc evict+0x11c/0x230
> [cb01f900] c070c138 iput+0x2a8/0x4a0
> [cb01f970] c06ff030 dentry_unlink_inode+0x220/0x250
> [cb01f9b0] c07001c0 __dentry_kill+0x190/0x320
> [cb01fa00] c0701fb8 dput+0x5e8/0x860
> [cb01fa80] c0705848 shrink_dcache_for_umount+0x58/0x100
> [cb01fb00] c06cf864 generic_shutdown_super+0x54/0x200
> [cb01fb80] c06cfd48 kill_anon_super+0x38/0x60
> [cb01fbc0] c06d12cc deactivate_locked_super+0xbc/0x110
> [cb01fbf0] c06d13bc deactivate_super+0x9c/0xc0
> [cb01fc20] c071a340 cleanup_mnt+0x1b0/0x250
> [cb01fc80] c0278fa8 task_work_run+0xf8/0x180
> [cb01fcd0] c002b4ac do_notify_resume+0x4dc/0x5d0
> [cb01fda0] c004ba0c syscall_exit_prepare+0x28c/0x370
> [cb01fe10] c000e06c system_call_common+0xfc/0x27c
> --- Exception: c00 (System Call) at 10034890
> 
> Is this fixed properly already somewhere? Thanks,
> 
> Fixes: e6cb53827ed6 ("block: initialize struct block_device in bdev_alloc")

I don't think it's fixed anywhere and I've seen the syzbot report and I was
wondering how this can happen when bdev_alloc() initializes bdev->bd_bdi
and it also wasn't clear to me whether bd_bdi is really the only field that
is problematic - if we can get to bdev_evict_inode() without going through
bdev_alloc(), we are probably missing initialization of other fields in
that place as well...

But now I've realized that probably the inode is a root inode for bdev
superblock which is allocated by VFS through new_inode() and thus doesn't
undergo the initialization in bdev_alloc(). And AFAICT the root inode on
bdev superblock can get only to bdev_evict_inode() and bdev_free_inode().
Looking at bdev_evict_inode() the only thing that's used there from struct
block_device is really bd_bdi. bdev_free_inode() will also access
bdev->bd_stats and bdev->bd_meta_info. So we need to at least initialize
these to NULL as well. IMO the most logical place for all these
initializations is in bdev_alloc_inode()...

Honza

> ---
>  fs/block_dev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 3e5b02f6606c..86fdc28d565e 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -792,8 +792,10 @@ static void bdev_free_inode(struct inode *inode)
>  static void init_once(void *data)
>  {
>   struct bdev_inode *ei = data;
> + struct block_device *bdev = &ei->bdev;
>  
>   inode_init_once(&ei->vfs_inode);
> + bdev->bd_bdi = &noop_backing_dev_info;
>  }
>  
>  static void bdev_evict_inode(struct inode *inode)
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] writeback: add warning messages for not registered bdi

2020-12-22 Thread Jan Kara
On Thu 17-12-20 19:28:01, Yanjun Zhang wrote:
> The device name is only printed for the warning case, that bdi is not
> registered detected by the function __mark_inode_dirty. Besides, the
> device name returned by bdi_dev_name may be "(unknown)" in some cases.
> 
> This patch add printed messages about the inode and super block. Once
> trigging this warning, we could make more direct analysis.
> 
> Signed-off-by: Yanjun Zhang 

Thanks for the patch but I've just sent a patch to remove this warning from
the kernel a few days ago to Linus because it could result in false
positive... So your patch is not needed anymore.

Honza

> ---
>  fs/fs-writeback.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e6005c78b..825160cf4 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2323,7 +2323,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>  
>   WARN((wb->bdi->capabilities & BDI_CAP_WRITEBACK) &&
>!test_bit(WB_registered, &wb->state),
> -  "bdi-%s not registered\n", bdi_dev_name(wb->bdi));
> +  "bdi-%s not registered, dirtied inode %lu on %s\n",
> +  bdi_dev_name(wb->bdi), inode->i_ino, sb->s_id);
>  
>   inode->dirtied_when = jiffies;
>   if (dirtytime)
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: general protection fault in ext4_commit_super

2020-12-22 Thread Jan Kara
On Tue 22-12-20 18:09:08, Hillf Danton wrote:
> Tue, 22 Dec 2020 00:54:15 -0800
> > syzbot found the following issue on:
> > 
> > HEAD commit:0d52778b Add linux-next specific files for 20201218
> > git tree:   linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15b4aecb50
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=5c81cc44aa25b5b3
> > dashboard link: https://syzkaller.appspot.com/bug?extid=9043030c040ce1849a60
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1798348750
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10c0293750
> > 
> > The issue was bisected to:
> > 
> > commit e810c942a325cf749e859d7aa3a43dc219cea299
> > Author: Jan Kara 
> > Date:   Wed Dec 16 10:18:40 2020 +
> > 
> > ext4: save error info to sb through journal if available
> > 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1282f13750
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=1182f13750
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1682f13750
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+9043030c040ce1849...@syzkaller.appspotmail.com
> > Fixes: e810c942a325 ("ext4: save error info to sb through journal if 
> > available")
> > 
> > general protection fault, probably for non-canonical address 
> > 0xdc0c:  [#1] PREEMPT SMP KASAN
> > KASAN: null-ptr-deref in range [0x0060-0x0067]
> > CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.10.0-next-20201218-syzkaller 
> > #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > Workqueue: events flush_stashed_error_work
> > RIP: 0010:ext4_commit_super+0x52/0x510 fs/ext4/super.c:5548
> > Code: 48 c1 ea 03 80 3c 02 00 0f 85 21 04 00 00 48 8b 9d 78 06 00 00 48 b8 
> > 00 00 00 00 00 fc ff df 48 8d 7b 60 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 
> > 85 f1 03 00 00 48 8b 5b 60 48 85 db 0f 84 13 01 00
> > RSP: 0018:c9cc7cb8 EFLAGS: 00010206
> > RAX: dc00 RBX:  RCX: 
> > RDX: 000c RSI: 8217a0db RDI: 0060
> > RBP: 88802441c000 R08:  R09: 
> > R10: 8217a7fa R11:  R12: 88802441c000
> > R13: 88802441c678 R14: 8880109a5a00 R15: 8880b9c34440
> > FS:  () GS:8880b9c0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 00400200 CR3: 14038000 CR4: 001506f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  flush_stashed_error_work+0x1c9/0x2a0 fs/ext4/super.c:727
> >  process_one_work+0x98d/0x1630 kernel/workqueue.c:2275
> >  worker_thread+0x64c/0x1120 kernel/workqueue.c:2421
> >  kthread+0x3b1/0x4a0 kernel/kthread.c:292
> >  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
> 
> Fix e810c942a325 ("ext4: save error info to sb through journal if available")
> by flushing work as part of rollback.

Thanks for having a look. I don't think the fix is quite correct though. The
flush_work() should be at failed_mount3: label. So something like attached
fixup. Ted, can you please fold it into the buggy commit?

Honza

> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5171,6 +5171,7 @@ out_fail:
>   sb->s_fs_info = NULL;
>   kfree(sbi->s_blockgroup_lock);
>  out_free_base:
> + flush_work(&sbi->s_error_work);
>   kfree(sbi);
>   kfree(orig_data);
>   fs_put_dax(dax_dev);
-- 
Jan Kara 
SUSE Labs, CR
>From b3f87141c5944d4dba8e462e1c9ce0c723434fbb Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Tue, 22 Dec 2020 12:26:26 +0100
Subject: [PATCH] Fixup error handling on sb load

Reported-by: syzbot+9043030c040ce1849...@syzkaller.appspotmail.com
Signed-off-by: Jan Kara 
---
 fs/ext4/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 502ae491d07d..c2311978f9b0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5142,6 +5142,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	ext4_es_unregister_shrinker(sbi);
 failed_mount3:
 	del_timer_sync(&sbi->s_err_report);
+	flush_work(&sbi->s_error_work);
 	if (sbi->s_mmp_tsk)
 		kthread_stop(sbi->s_mmp_tsk);
 failed_mount2:
-- 
2.16.4



Re: WARNING: ODEBUG bug in ext4_fill_super (2)

2020-12-22 Thread Jan Kara
#syz dup: general protection fault in ext4_commit_super

On Mon 21-12-20 23:54:18, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:0d52778b Add linux-next specific files for 20201218
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1619061350
> kernel config:  https://syzkaller.appspot.com/x/.config?x=5c81cc44aa25b5b3
> dashboard link: https://syzkaller.appspot.com/bug?extid=3002ac6b4fd242a64228
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=128f512350
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14f9f30f50
> 
> The issue was bisected to:
> 
> commit e810c942a325cf749e859d7aa3a43dc219cea299
> Author: Jan Kara 
> Date:   Wed Dec 16 10:18:40 2020 +
> 
> ext4: save error info to sb through journal if available
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12d9df0750
> final oops: https://syzkaller.appspot.com/x/report.txt?x=11d9df0750
> console output: https://syzkaller.appspot.com/x/log.txt?x=16d9df0750
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+3002ac6b4fd242a64...@syzkaller.appspotmail.com
> Fixes: e810c942a325 ("ext4: save error info to sb through journal if 
> available")
> 
> EXT4-fs error (device loop3): ext4_fill_super:4943: inode #2: comm 
> syz-executor723: iget: root inode unallocated
> EXT4-fs (loop3): get root inode failed
> EXT4-fs (loop3): mount failed
> [ cut here ]
> ODEBUG: free active (active state 0) object type: work_struct hint: 
> flush_stashed_error_work+0x0/0x2a0 fs/ext4/ext4.h:2040
> WARNING: CPU: 0 PID: 13670 at lib/debugobjects.c:505 
> debug_print_object+0x16e/0x250 lib/debugobjects.c:505
> Modules linked in:
> CPU: 0 PID: 13670 Comm: syz-executor723 Not tainted 
> 5.10.0-next-20201218-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
> Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd 20 
> af bf 89 4c 89 ee 48 c7 c7 20 a3 bf 89 e8 30 78 05 05 <0f> 0b 83 05 55 8a b9 
> 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
> RSP: 0018:c90001d0f9b8 EFLAGS: 00010282
> RAX:  RBX: 0003 RCX: 
> RDX: 888024b63600 RSI: 815b95f5 RDI: f520003a1f29
> RBP: 0001 R08:  R09: 
> R10: 815b7a4b R11:  R12: 896ae040
> R13: 89bfa920 R14: 814911f0 R15: dc00
> FS:  0163a940() GS:8880b9c0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0043fb60 CR3: 1226b000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  __debug_check_no_obj_freed lib/debugobjects.c:987 [inline]
>  debug_check_no_obj_freed+0x309/0x430 lib/debugobjects.c:1018
>  slab_free_hook mm/slub.c:1540 [inline]
>  slab_free_freelist_hook+0x12b/0x1d0 mm/slub.c:1586
>  slab_free mm/slub.c:3157 [inline]
>  kfree+0xdb/0x3c0 mm/slub.c:4156
>  ext4_fill_super+0x86c/0xdf40 fs/ext4/super.c:5174
>  mount_bdev+0x34d/0x410 fs/super.c:1366
>  legacy_get_tree+0x105/0x220 fs/fs_context.c:592
>  vfs_get_tree+0x89/0x2f0 fs/super.c:1496
>  do_new_mount fs/namespace.c:2896 [inline]
>  path_mount+0x12ae/0x1e70 fs/namespace.c:3227
>  do_mount fs/namespace.c:3240 [inline]
>  __do_sys_mount fs/namespace.c:3448 [inline]
>  __se_sys_mount fs/namespace.c:3425 [inline]
>  __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x44873a
> Code: b8 08 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 cd a2 fb ff c3 66 2e 0f 1f 
> 84 00 00 00 00 00 66 90 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 aa a2 fb ff c3 66 0f 1f 84 00 00 00 00 00
> RSP: 002b:7fff16496da8 EFLAGS: 0202 ORIG_RAX: 00a5
> RAX: ffda RBX: 7fff16496e00 RCX: 0044873a
> RDX: 2000 RSI: 2100 RDI: 7fff16496dc0
> RBP: 7fff16496dc0 R08: 7fff16496e00 R09: 
> R10:  R11: 0202 R12: 0006
> R13: 0005 R14: 0004 R15: 0004
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
>

Re: memory leak in v2_read_file_info

2020-12-22 Thread Jan Kara
Hello,

attached patch should fix this. I'll queue it to my tree.

Honza

On Tue 22-12-20 02:24:18, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:8653b778 Merge tag 'clk-for-linus' of git://git.kernel.org..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=153fc4db50
> kernel config:  https://syzkaller.appspot.com/x/.config?x=faf2996955887e91
> dashboard link: https://syzkaller.appspot.com/bug?extid=9c9b52ab78154b08
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11c44960d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13bc8c0b50
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+9c9b52ab78154...@syzkaller.appspotmail.com
> 
> BUG: memory leak
> unreferenced object 0x888110974f00 (size 64):
>   comm "syz-executor849", pid 8516, jiffies 4294942501 (age 13.960s)
>   hex dump (first 32 bytes):
> 00 30 ee 0d 81 88 ff ff 00 00 00 00 00 00 00 00  .0..
> 00 00 00 00 00 00 00 00 0a 00 00 00 48 00 00 00  H...
>   backtrace:
> [<18aa1939>] kmalloc include/linux/slab.h:552 [inline]
> [<18aa1939>] v2_read_file_info+0x1ae/0x430 fs/quota/quota_v2.c:122
> [<1061252b>] dquot_load_quota_sb+0x351/0x650 fs/quota/dquot.c:2387
> [<6c1f70f9>] dquot_load_quota_inode fs/quota/dquot.c:2423 [inline]
> [<6c1f70f9>] dquot_load_quota_inode+0xda/0x160 
> fs/quota/dquot.c:2415
> [<abace495>] ext4_quota_enable fs/ext4/super.c:6362 [inline]
> [<abace495>] ext4_enable_quotas+0x1b2/0x2f0 fs/ext4/super.c:6388
> [<b6d6a975>] ext4_fill_super+0x3bc5/0x5ac0 fs/ext4/super.c:5046
> [<03a869bd>] mount_bdev+0x223/0x260 fs/super.c:1366
> [<2138e18c>] legacy_get_tree+0x2b/0x90 fs/fs_context.c:592
> [<96e90d3d>] vfs_get_tree+0x28/0x100 fs/super.c:1496
> [<eddeeb8e>] do_new_mount fs/namespace.c:2875 [inline]
> [<eddeeb8e>] path_mount+0xc5e/0x1170 fs/namespace.c:3205
> [<c52e2f18>] do_mount fs/namespace.c:3218 [inline]
> [<c52e2f18>] __do_sys_mount fs/namespace.c:3426 [inline]
> [<c52e2f18>] __se_sys_mount fs/namespace.c:3403 [inline]
> [<c52e2f18>] __x64_sys_mount+0x18e/0x1d0 fs/namespace.c:3403
> [<e70a31f4>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> [<7f651b8c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
> 
-- 
Jan Kara 
SUSE Labs, CR
>From 4d76181dc109d8a0c8ce74325ee5e885734d5ab8 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Tue, 22 Dec 2020 12:09:53 +0100
Subject: [PATCH] quota: Fix memory leak when handling corrupted quota file

When checking corrupted quota file we can bail out and leak allocated
info structure. Properly free info structure on error return.

Reported-by: syzbot+9c9b52ab78154...@syzkaller.appspotmail.com
Fixes: 11c514a99bb9 ("quota: Sanity-check quota file headers on load")
Signed-off-by: Jan Kara 
---
 fs/quota/quota_v2.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
index c21106557a37..b1467f3921c2 100644
--- a/fs/quota/quota_v2.c
+++ b/fs/quota/quota_v2.c
@@ -164,19 +164,24 @@ static int v2_read_file_info(struct super_block *sb, int type)
 		quota_error(sb, "Number of blocks too big for quota file size (%llu > %llu).",
 		(loff_t)qinfo->dqi_blocks << qinfo->dqi_blocksize_bits,
 		i_size_read(sb_dqopt(sb)->files[type]));
-		goto out;
+		goto out_free;
 	}
 	if (qinfo->dqi_free_blk >= qinfo->dqi_blocks) {
 		quota_error(sb, "Free block number too big (%u >= %u).",
 			qinfo->dqi_free_blk, qinfo->dqi_blocks);
-		goto out;
+		goto out_free;
 	}
 	if (qinfo->dqi_free_entry >= qinfo->dqi_blocks) {
 		quota_error(sb, "Block with free entry too big (%u >= %u).",
 			qinfo->dqi_free_entry, qinfo->dqi_blocks);
-		goto out;
+		goto out_free;
 	}
 	ret = 0;
+out_free:
+	if (ret) {
+		kfree(info->dqi_priv);
+		info->dqi_priv = NULL;
+	}
 out:
 	up_read(&dqopt->dqio_sem);
 	return ret;
-- 
2.16.4



Re: [PATCH v2] inotify, memcg: account inotify instances to kmemcg

2020-12-21 Thread Jan Kara
66d17e8ecb9 100644
> > --- a/fs/notify/inotify/inotify_user.c
> > +++ b/fs/notify/inotify/inotify_user.c
> > @@ -632,11 +632,11 @@ static struct fsnotify_group 
> > *inotify_new_group(unsigned int max_events)
> > struct fsnotify_group *group;
> > struct inotify_event_info *oevent;
> >
> > -   group = fsnotify_alloc_group(&inotify_fsnotify_ops);
> > +   group = fsnotify_alloc_user_group(&inotify_fsnotify_ops);
> > if (IS_ERR(group))
> > return group;
> >
> > -   oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL);
> > +   oevent = kmalloc(sizeof(struct inotify_event_info), 
> > GFP_KERNEL_ACCOUNT);
> > if (unlikely(!oevent)) {
> > fsnotify_destroy_group(group);
> > return ERR_PTR(-ENOMEM);
> > diff --git a/include/linux/fsnotify_backend.h 
> > b/include/linux/fsnotify_backend.h
> > index a2e42d3cd87c..e5409b83e731 100644
> > --- a/include/linux/fsnotify_backend.h
> > +++ b/include/linux/fsnotify_backend.h
> > @@ -470,6 +470,7 @@ static inline void fsnotify_update_flags(struct dentry 
> > *dentry)
> >
> >  /* create a new group */
> >  extern struct fsnotify_group *fsnotify_alloc_group(const struct 
> > fsnotify_ops *ops);
> > +extern struct fsnotify_group *fsnotify_alloc_user_group(const struct 
> > fsnotify_ops *ops);
> >  /* get reference to a group */
> >  extern void fsnotify_get_group(struct fsnotify_group *group);
> >  /* drop reference on a group from fsnotify_alloc_group */
> > --
> > 2.29.2.684.gfbc64c5ab5-goog
> >
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/3] vfs: add new f_op->syncfs vector

2020-12-17 Thread Jan Kara
On Thu 17-12-20 00:49:35, Al Viro wrote:
> [Christoph added to Cc...]
> On Wed, Dec 16, 2020 at 06:31:47PM -0500, Vivek Goyal wrote:
> > Current implementation of __sync_filesystem() ignores the return code
> > from ->sync_fs(). I am not sure why that's the case. There must have
> > been some historical reason for this.
> > 
> > Ignoring ->sync_fs() return code is problematic for overlayfs where
> > it can return error if sync_filesystem() on upper super block failed.
> > That error will simply be lost and sycnfs(overlay_fd), will get
> > success (despite the fact it failed).
> > 
> > If we modify existing implementation, there is a concern that it will
> > lead to user space visible behavior changes and break things. So
> > instead implement a new file_operations->syncfs() call which will
> > be called in syncfs() syscall path. Return code from this new
> > call will be captured. And all the writeback error detection
> > logic can go in there as well. Only filesystems which implement
> > this call get affected by this change. Others continue to fallback
> > to existing mechanism.
> 
> That smells like a massive source of confusion down the road.  I'd just
> looked through the existing instances; many always return 0, but quite
> a few sometimes try to return an error:
> fs/btrfs/super.c:2412:  .sync_fs= btrfs_sync_fs,
> fs/exfat/super.c:204:   .sync_fs= exfat_sync_fs,
> fs/ext4/super.c:1674:   .sync_fs= ext4_sync_fs,
> fs/f2fs/super.c:2480:   .sync_fs= f2fs_sync_fs,
> fs/gfs2/super.c:1600:   .sync_fs= gfs2_sync_fs,
> fs/hfsplus/super.c:368: .sync_fs= hfsplus_sync_fs,
> fs/nilfs2/super.c:689:  .sync_fs= nilfs_sync_fs,
> fs/ocfs2/super.c:139:   .sync_fs= ocfs2_sync_fs,
> fs/overlayfs/super.c:399:   .sync_fs= ovl_sync_fs,
> fs/ubifs/super.c:2052:  .sync_fs   = ubifs_sync_fs,
> is the list of such.  There are 4 method callers:
> dquot_quota_sync(), dquot_disable(), __sync_filesystem() and
> sync_fs_one_sb().  For sync_fs_one_sb() we want to ignore the
> return value; for __sync_filesystem() we almost certainly
> do *not* - it ends with return __sync_blockdev(sb->s_bdev, wait),
> after all.  The question for that one is whether we want
> __sync_blockdev() called even in case of ->sync_fs() reporting
> a failure, and I suspect that it's safer to call it anyway and
> return the first error value we'd got.  No idea about quota
> situation.

WRT quota situation: All the ->sync_fs() calls there are due to cache
coherency reasons (we need to get quota changes to disk, then prune quota
files's page cache, and then userspace can read current quota structures
from the disk). We don't want to fail dquot_disable() just because caches
might be incoherent so ignoring ->sync_fs() return value there is fine.
With dquot_quota_sync() it might make some sense to return the error -
that's just a backend for Q_SYNC quotactl(2). OTOH I'm not sure anybody
really cares - Q_SYNC is rarely used.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: linux-next fsnotify mod breaks tail -f

2020-12-11 Thread Jan Kara
On Fri 11-12-20 10:42:16, Amir Goldstein wrote:
> On Fri, Dec 11, 2020 at 1:45 AM Hugh Dickins  wrote:
> >
> > Hi Jan, Amir,
> >
> > There's something wrong with linux-next commit ca7fbf0d29ab
> > ("fsnotify: fix events reported to watching parent and child").
> >
> > If I revert that commit, no problem;
> > but here's a one-line script "tailed":
> >
> > for i in 1 2 3 4 5; do date; sleep 1; done &
> >
> > Then if I run that (same result doing ./tailed after chmod a+x):
> >
> > sh tailed >log; tail -f log
> >
> > the "tail -f log" behaves in one of three ways:
> >
> > 1) On a console, before graphical screen, no problem,
> >it shows the five lines coming from "date" as you would expect.
> > 2) From xterm or another tty, shows just the first line from date,
> >but after I wait and Ctrl-C out, "cat log" shows all five lines.
> > 3) From xterm or another tty, doesn't even show that first line.
> >
> > The before/after graphical screen thing seems particularly weird:
> > I expect you'll end up with a simpler explanation for what's
> > causing that difference.
> >
> > tailed and log are on ext4, if that's relevant;
> > ah, I just tried on tmpfs, and saw no problem there.
> 
> Nice riddle Hugh :)
> Thanks for this early testing!
> 
> I was able to reproduce this.
> The outcome does not depend on the type of terminal or filesystem
> it depends on the existence of a watch on the parent dir of the log file.
> Running ' inotifywait -m . &' will stop tail from getting notifications:
> 
> echo > log
> tail -f log &
> sleep 1
> echo "can you see this?" >> log
> inotifywait -m . &
> sleep 1
> echo "how about this?" >> log
> kill $(jobs -p)
> 
> I suppose with a graphical screen you have systemd or other services
> in the system watching the logs/home dir in your test env.
> 
> Attached fix patch. I suppose Jan will want to sqhash it.
> 
> We missed a subtle logic change in the switch from inode/child marks
> to parent/inode marks terminology.
> 
> Before the change (!inode_mark && child_mark) meant that name
> was not NULL and should be discarded (which the old code did).
> After the change (!parent_mark && inode_mark) is not enough to
> determine if name should be discarded (it should be discarded only
> for "events on child"), so another check is needed.

Thanks for testing Hugh and for a quick fix Amir! I've folded it into the
original buggy commit.

Honza

> 
> Thanks,
> Amir.

> From c7ea57c66c8c9f9607928bf7c55fc409eecc3e57 Mon Sep 17 00:00:00 2001
> From: Amir Goldstein 
> Date: Fri, 11 Dec 2020 10:19:36 +0200
> Subject: [PATCH] fsnotify: fix for fix events reported to watching parent and
>  child
> 
> The child watch is expecting an event without file name and without
> the ON_CHILD flag.
> 
> Reported-by: Hugh Dickins 
> Signed-off-by: Amir Goldstein 
> ---
>  fs/notify/fsnotify.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index a0da9e766992..30d422b8c0fc 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -291,13 +291,18 @@ static int fsnotify_handle_event(struct fsnotify_group 
> *group, __u32 mask,
>   }
>   if (!inode_mark)
>   return 0;
> + }
>  
> + if (mask & FS_EVENT_ON_CHILD) {
>   /*
>* Some events can be sent on both parent dir and child marks
>* (e.g. FS_ATTRIB).  If both parent dir and child are
>* watching, report the event once to parent dir with name (if
>* interested) and once to child without name (if interested).
> +  * The child watcher is expecting an event without a file name
> +  * and without the FS_EVENT_ON_CHILD flag.
>*/
> + mask &= ~FS_EVENT_ON_CHILD;
>   dir = NULL;
>   name = NULL;
>   }
> -- 
> 2.25.1
> 

-- 
Jan Kara 
SUSE Labs, CR


Re: kernel BUG at fs/notify/dnotify/dnotify.c:LINE! (2)

2020-12-09 Thread Jan Kara
On Wed 09-12-20 17:15:02, Miklos Szeredi wrote:
> On Wed, Dec 9, 2020 at 2:59 PM Jan Kara  wrote:
> >
> > On Wed 09-12-20 14:38:42, Jan Kara wrote:
> > > Hello!
> > >
> > > so I was debugging the dnotify crash below (it's 100% reproducible for me)
> > > and I came to the following. The reproducer opens 'file0' on FUSE
> > > filesystem which is a directory at that point. Then it attached dnotify
> > > mark to the directory 'file0' and then it does something to the FUSE fs
> > > which I don't understand but the result is that when FUSE is unmounted the
> > > 'file0' inode is actually a regular file (note that I've verified this is
> > > really the same inode pointer). This then confuses dnotify which doesn't
> > > tear down its structures properly and eventually crashes. So my question
> > > is: How can an inode on FUSE filesystem morph from a dir to a regular 
> > > file?
> > > I presume this could confuse much more things than just dnotify?
> > >
> > > Before I dwelve more into FUSE internals, any idea Miklos what could have
> > > gone wrong and how to debug this further?
> >
> > I've got an idea where to look and indeed it is the fuse_do_getattr() call
> > that finds attributes returned by the server are inconsistent so it calls
> > make_bad_inode() which, among other things, does:
> >
> > inode->i_mode = S_IFREG;
> >
> > Indeed calling make_bad_inode() on a live inode doesn't look like a good
> > idea. IMHO FUSE needs to come up with some other means of marking the inode
> > as stale. Miklos?
> 
> Something like the attached.  It's untested and needs the
> fuse_is_bad() test in more ops...

The patch fixes the problem for me (the reproducer no longer crashes the
kernel). So feel free to add:

Tested-by: Jan Kara 

Honza

> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index ff7dbeb16f88..1172179c9fba 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -202,7 +202,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, 
> unsigned int flags)
>   int ret;
>  
>   inode = d_inode_rcu(entry);
> - if (inode && is_bad_inode(inode))
> + if (inode && fuse_is_bad(inode))
>   goto invalid;
>   else if (time_before64(fuse_dentry_time(entry), get_jiffies_64()) ||
>(flags & LOOKUP_REVAL)) {
> @@ -1030,7 +1030,7 @@ static int fuse_do_getattr(struct inode *inode, struct 
> kstat *stat,
>   if (!err) {
>   if (fuse_invalid_attr(&outarg.attr) ||
>   (inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
> - make_bad_inode(inode);
> + fuse_make_bad(inode);
>   err = -EIO;
>   } else {
>   fuse_change_attributes(inode, &outarg.attr,
> @@ -1327,7 +1327,7 @@ static const char *fuse_get_link(struct dentry *dentry, 
> struct inode *inode,
>   int err;
>  
>   err = -EIO;
> - if (is_bad_inode(inode))
> + if (fuse_is_bad(inode))
>   goto out_err;
>  
>   if (fc->cache_symlinks)
> @@ -1375,7 +1375,7 @@ static int fuse_dir_fsync(struct file *file, loff_t 
> start, loff_t end,
>   struct fuse_conn *fc = get_fuse_conn(inode);
>   int err;
>  
> - if (is_bad_inode(inode))
> + if (fuse_is_bad(inode))
>   return -EIO;
>  
>   if (fc->no_fsyncdir)
> @@ -1664,7 +1664,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr 
> *attr,
>  
>   if (fuse_invalid_attr(&outarg.attr) ||
>   (inode->i_mode ^ outarg.attr.mode) & S_IFMT) {
> - make_bad_inode(inode);
> + fuse_make_bad(inode);
>   err = -EIO;
>   goto error;
>   }
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index c03034e8c152..30fdb3adf9b9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -463,7 +463,7 @@ static int fuse_flush(struct file *file, fl_owner_t id)
>   FUSE_ARGS(args);
>   int err;
>  
> - if (is_bad_inode(inode))
> + if (fuse_is_bad(inode))
>   return -EIO;
>  
>   err = write_inode_now(inode, 1);
> @@ -535,7 +535,7 @@ static int fuse_fsync(struct file *file, loff_t start, 
> loff_t end,
>   struct fuse_conn *fc = get_fuse_conn(inode);
>   int err;
>  
> - if (is_bad_inode(inode))
> + if (fuse_is_bad(inode))
>   return -EIO;
>  
>   inode_lock(inode);
> @

Re: kernel BUG at fs/notify/dnotify/dnotify.c:LINE! (2)

2020-12-09 Thread Jan Kara
On Wed 09-12-20 14:38:42, Jan Kara wrote:
> Hello!
> 
> so I was debugging the dnotify crash below (it's 100% reproducible for me)
> and I came to the following. The reproducer opens 'file0' on FUSE
> filesystem which is a directory at that point. Then it attached dnotify
> mark to the directory 'file0' and then it does something to the FUSE fs
> which I don't understand but the result is that when FUSE is unmounted the
> 'file0' inode is actually a regular file (note that I've verified this is
> really the same inode pointer). This then confuses dnotify which doesn't
> tear down its structures properly and eventually crashes. So my question
> is: How can an inode on FUSE filesystem morph from a dir to a regular file?
> I presume this could confuse much more things than just dnotify?
> 
> Before I dwelve more into FUSE internals, any idea Miklos what could have
> gone wrong and how to debug this further?

I've got an idea where to look and indeed it is the fuse_do_getattr() call
that finds attributes returned by the server are inconsistent so it calls
make_bad_inode() which, among other things, does:

inode->i_mode = S_IFREG;

Indeed calling make_bad_inode() on a live inode doesn't look like a good
idea. IMHO FUSE needs to come up with some other means of marking the inode
as stale. Miklos?

Honza

> On Mon 23-11-20 02:05:16, syzbot wrote:
> > syzbot found the following issue on:
> > 
> > HEAD commit:27bba9c5 Merge tag 'scsi-fixes' of git://git.kernel.org/pu..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=11b8222550
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=330f3436df12fd44
> > dashboard link: https://syzkaller.appspot.com/bug?extid=f427adf9324b92652ccc
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11d3f01550
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17162d4d50
> > 
> > Bisection is inconclusive: the issue happens on the oldest tested release.
> > 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1657052550
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=1557052550
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1157052550
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+f427adf9324b92652...@syzkaller.appspotmail.com
> > 
> > wlan1: Creating new IBSS network, BSSID 50:50:50:50:50:50
> > [ cut here ]
> > kernel BUG at fs/notify/dnotify/dnotify.c:118!
> > invalid opcode:  [#1] PREEMPT SMP KASAN
> > CPU: 1 PID: 648 Comm: kworker/u4:4 Not tainted 5.10.0-rc4-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > Workqueue: events_unbound fsnotify_mark_destroy_workfn
> > RIP: 0010:dnotify_free_mark fs/notify/dnotify/dnotify.c:118 [inline]
> > RIP: 0010:dnotify_free_mark+0x4b/0x60 fs/notify/dnotify/dnotify.c:112
> > Code: 80 3c 02 00 75 26 48 83 bd 80 00 00 00 00 75 15 e8 0a d3 a0 ff 48 89 
> > ee 48 8b 3d 68 8c 1d 0b 5d e9 aa 06 e2 ff e8 f5 d2 a0 ff <0f> 0b e8 ae 4d 
> > e2 ff eb d3 66 90 66 2e 0f 1f 84 00 00 00 00 00 41
> > RSP: 0018:c90002f1fc38 EFLAGS: 00010293
> > RAX:  RBX: 8958ae60 RCX: 1920005e3f95
> > RDX: 888012601a40 RSI: 81cf5ceb RDI: 88801aea2080
> > RBP: 88801aea2000 R08: 0001 R09: 8ebb170f
> > R10:  R11:  R12: 8880171a2000
> > R13: c90002f1fc98 R14: 88801aea2010 R15: 88801aea2018
> > FS:  () GS:8880b9f0() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 56045fa95978 CR3: 12121000 CR4: 001506e0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  fsnotify_final_mark_destroy+0x71/0xb0 fs/notify/mark.c:205
> >  fsnotify_mark_destroy_workfn+0x1eb/0x340 fs/notify/mark.c:840
> >  process_one_work+0x933/0x15a0 kernel/workqueue.c:2272
> >  worker_thread+0x64c/0x1120 kernel/workqueue.c:2418
> >  kthread+0x3af/0x4a0 kernel/kthread.c:292
> >  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
> > Modules linked in:
> > 
> > 
> > ---
> > This report is generated by a bot. It may contain errors.
>

Re: kernel BUG at fs/notify/dnotify/dnotify.c:LINE! (2)

2020-12-09 Thread Jan Kara
Hello!

so I was debugging the dnotify crash below (it's 100% reproducible for me)
and I came to the following. The reproducer opens 'file0' on FUSE
filesystem which is a directory at that point. Then it attached dnotify
mark to the directory 'file0' and then it does something to the FUSE fs
which I don't understand but the result is that when FUSE is unmounted the
'file0' inode is actually a regular file (note that I've verified this is
really the same inode pointer). This then confuses dnotify which doesn't
tear down its structures properly and eventually crashes. So my question
is: How can an inode on FUSE filesystem morph from a dir to a regular file?
I presume this could confuse much more things than just dnotify?

Before I dwelve more into FUSE internals, any idea Miklos what could have
gone wrong and how to debug this further?

Honza

On Mon 23-11-20 02:05:16, syzbot wrote:
> syzbot found the following issue on:
> 
> HEAD commit:27bba9c5 Merge tag 'scsi-fixes' of git://git.kernel.org/pu..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11b8222550
> kernel config:  https://syzkaller.appspot.com/x/.config?x=330f3436df12fd44
> dashboard link: https://syzkaller.appspot.com/bug?extid=f427adf9324b92652ccc
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11d3f01550
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17162d4d50
> 
> Bisection is inconclusive: the issue happens on the oldest tested release.
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1657052550
> final oops: https://syzkaller.appspot.com/x/report.txt?x=1557052550
> console output: https://syzkaller.appspot.com/x/log.txt?x=1157052550
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+f427adf9324b92652...@syzkaller.appspotmail.com
> 
> wlan1: Creating new IBSS network, BSSID 50:50:50:50:50:50
> [ cut here ]
> kernel BUG at fs/notify/dnotify/dnotify.c:118!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 648 Comm: kworker/u4:4 Not tainted 5.10.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Workqueue: events_unbound fsnotify_mark_destroy_workfn
> RIP: 0010:dnotify_free_mark fs/notify/dnotify/dnotify.c:118 [inline]
> RIP: 0010:dnotify_free_mark+0x4b/0x60 fs/notify/dnotify/dnotify.c:112
> Code: 80 3c 02 00 75 26 48 83 bd 80 00 00 00 00 75 15 e8 0a d3 a0 ff 48 89 ee 
> 48 8b 3d 68 8c 1d 0b 5d e9 aa 06 e2 ff e8 f5 d2 a0 ff <0f> 0b e8 ae 4d e2 ff 
> eb d3 66 90 66 2e 0f 1f 84 00 00 00 00 00 41
> RSP: 0018:c90002f1fc38 EFLAGS: 00010293
> RAX:  RBX: 8958ae60 RCX: 1920005e3f95
> RDX: 888012601a40 RSI: 81cf5ceb RDI: 88801aea2080
> RBP: 88801aea2000 R08: 0001 R09: 8ebb170f
> R10:  R11:  R12: 8880171a2000
> R13: c90002f1fc98 R14: 88801aea2010 R15: 88801aea2018
> FS:  () GS:8880b9f0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 56045fa95978 CR3: 12121000 CR4: 001506e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  fsnotify_final_mark_destroy+0x71/0xb0 fs/notify/mark.c:205
>  fsnotify_mark_destroy_workfn+0x1eb/0x340 fs/notify/mark.c:840
>  process_one_work+0x933/0x15a0 kernel/workqueue.c:2272
>  worker_thread+0x64c/0x1120 kernel/workqueue.c:2418
>  kthread+0x3af/0x4a0 kernel/kthread.c:292
>  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
> Modules linked in:
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs: quota: fix array-index-out-of-bounds bug by passing correct argument to vfs_cleanup_quota_inode()

2020-12-09 Thread Jan Kara
On Wed 09-12-20 01:13:38, Anant Thazhemadam wrote:
> When dquot_resume() was last updated, the argument that got passed
> to vfs_cleanup_quota_inode was incorrectly set.
> 
> If type = -1 and dquot_load_quota_sb() returns a negative value,
> then vfs_cleanup_quota_inode() gets called with -1 passed as an
> argument, and this leads to an array-index-out-of-bounds bug.
> 
> Fix this issue by correctly passing the arguments.
> 
> Fixes: ae45f07d47cc ("quota: Simplify dquot_resume()")
> Reported-by: syzbot+2643e825238d7aabb...@syzkaller.appspotmail.com
> Tested-by: syzbot+2643e825238d7aabb...@syzkaller.appspotmail.com
> Signed-off-by: Anant Thazhemadam 

Thanks for the fix! I've just queued the very same fix I wrote yesterday to
my tree. But yours has better changelog so let me pick your patch instead
;)

For next time, how can we avoid collisions like this? Did you work on the fix
based on the syzbot email sent to the list so if I actually reply to the
syzbot email that I'm working on / already have a fix you'd see it?

Honza

> ---
> If type = -1 is passed as an argument to vfs_cleanup_quota_inode(),
> it causes an array-index-out-of-bounds error since dqopt->files[-1]
> can be potentially attempted to be accessed.
> Before the bisected commit introduced this bug, vfs_load_quota_inode()
> was being directly called in dquot_resume(), and subsequently 
> vfs_cleanup_quota_inode() was called with the cnt value as argument.
> 
>  fs/quota/dquot.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index bb02989d92b6..4f1373463766 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2455,7 +2455,7 @@ int dquot_resume(struct super_block *sb, int type)
>   ret = dquot_load_quota_sb(sb, cnt, dqopt->info[cnt].dqi_fmt_id,
> flags);
>   if (ret < 0)
> - vfs_cleanup_quota_inode(sb, type);
> + vfs_cleanup_quota_inode(sb, cnt);
>   }
>  
>   return ret;
> -- 
> 2.25.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fanotify: Fix sys_fanotify_mark() on native x86-32

2020-12-02 Thread Jan Kara
On Tue 01-12-20 16:51:26, Borislav Petkov wrote:
> On Tue, Dec 01, 2020 at 10:48:10AM +0100, Jan Kara wrote:
> > On Mon 30-11-20 17:30:59, Brian Gerst wrote:
> > > Commit 121b32a58a3a converted native x86-32 which take 64-bit arguments to
> > > use the compat handlers to allow conversion to passing args via pt_regs.
> > > sys_fanotify_mark() was however missed, as it has a general compat 
> > > handler.
> > > Add a config option that will use the syscall wrapper that takes the split
> > > args for native 32-bit.
> > > 
> > > Reported-by: Paweł Jasiak 
> > > Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for 
> > > syscalls taking 64-bit arguments")
> > > Signed-off-by: Brian Gerst 
> > 
> > Thanks for the patch! It looks good to me. Feel free to add:
> > 
> > Acked-by: Jan Kara 
> > 
> > I assume you plan to push this via x86 tree given the changes are mostly
> > there, don't you?
> 
> Looks sane to me too, I guess I can send it to Linus even now so that it
> lands in 5.10. Is that what you'd prefer Jan?

Yes, that would be fine by me. Although I don't think there's a huge rush.
The thing is broken for some time already so if it goes in later with CC to
stable, that would also work OK.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fanotify: Fix sys_fanotify_mark() on native x86-32

2020-12-01 Thread Jan Kara
On Mon 30-11-20 17:30:59, Brian Gerst wrote:
> Commit 121b32a58a3a converted native x86-32 which take 64-bit arguments to
> use the compat handlers to allow conversion to passing args via pt_regs.
> sys_fanotify_mark() was however missed, as it has a general compat handler.
> Add a config option that will use the syscall wrapper that takes the split
> args for native 32-bit.
> 
> Reported-by: Paweł Jasiak 
> Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls 
> taking 64-bit arguments")
> Signed-off-by: Brian Gerst 

Thanks for the patch! It looks good to me. Feel free to add:

Acked-by: Jan Kara 

I assume you plan to push this via x86 tree given the changes are mostly
there, don't you?

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] reiserfs: add check for an invalid ih_entry_count

2020-11-26 Thread Jan Kara
On Sun 01-11-20 06:09:58, Rustam Kovhaev wrote:
> when directory item has an invalid value set for ih_entry_count it might
> trigger use-after-free or out-of-bounds read in bin_search_in_dir_item()
> 
> ih_entry_count * IH_SIZE for directory item should not be larger than
> ih_item_len
> 
> Reported-and-tested-by: syzbot+83b6f7cf9922cae5c...@syzkaller.appspotmail.com
> Link: https://syzkaller.appspot.com/bug?extid=83b6f7cf9922cae5c4d7
> Signed-off-by: Rustam Kovhaev 

Thanks! I've added the patch to my tree and will push it to Linus.

Honza

> ---
>  fs/reiserfs/stree.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
> index 8bf88d690729..476a7ff49482 100644
> --- a/fs/reiserfs/stree.c
> +++ b/fs/reiserfs/stree.c
> @@ -454,6 +454,12 @@ static int is_leaf(char *buf, int blocksize, struct 
> buffer_head *bh)
>"(second one): %h", ih);
>   return 0;
>   }
> + if (is_direntry_le_ih(ih) && (ih_item_len(ih) < 
> (ih_entry_count(ih) * IH_SIZE))) {
> + reiserfs_warning(NULL, "reiserfs-5093",
> +  "item entry count seems wrong %h",
> +  ih);
> +     return 0;
> + }
>   prev_location = ih_location(ih);
>   }
>  
> -- 
> 2.28.0
> 
-- 
Jan Kara 
SUSE Labs, CR


  1   2   3   4   5   6   7   8   9   10   >