Re: mm: shm: hang in shmem_fallocate

2014-07-09 Thread Hugh Dickins
On Wed, 9 Jul 2014, Hugh Dickins wrote:
> On Wed, 9 Jul 2014, Johannes Weiner wrote:
> > On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> > > Hannes, a question for you please, I just could not make up my mind.
> > > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > > with a failed clear_exceptional_entry() in the case of hole-punch?
> > > Is that case currently depending on the rescan loop (that I'm about
> > > to revert) to remove a new page, so I would need to add a retry for
> > > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > > and can stay unchanged as below?  I've veered back and forth,
> > > thinking first one and then the other.
> > 
> > I realize you have given up on changing truncate.c in the meantime,
> > but I'm still asking myself about the swap retry case: why retry for
> > swap-to-page changes, yet not for page-to-page changes?
> > 
> > In case faults are disabled through i_size, concurrent swapin could
> > still turn swap entries into pages, so I can see the need to retry.
> > There is no equivalent for shadow entries, though, and they can only
> > be turned through page faults, so no retry necessary in that case.
> > 
> > However, you explicitely mentioned the hole-punch case above: if that
> > can't guarantee the hole will be reliably cleared under concurrent
> > faults, I'm not sure why it would put in more effort to free it of
> > swap (or shadow) entries than to free it of pages.
> > 
> > What am I missing?
> 
> In dropping the pincer effect, I am conceding that data written (via
> mmap) racily into the hole, behind the punching cursor, between the
> starting and the ending of the punch operation, may be allowed to
> remain.  It will not often happen (given the two loops), but it might.
> 
> But I insist that all data in the hole at the starting of the punch
> operation must be removed by the ending of the punch operation (though
> of course, given the paragraph above, identical data might be written
> in its place concurrently, via mmap, if the application chooses).
> 
> I think you probably agree with both of those propositions.
> 
> As the punching cursor moves along the radix_tree, it gathers page
> pointers and swap entries (the emply slots are already skipped at
> the level below; and tmpfs takes care that there is no instant in
> switching between page and swap when the slot appears empty).
> 
> Dealing with the page pointers is easy: a reference is already held,
> then shmem_undo_range takes the page lock which prevents swizzling
> to swap, then truncates that page out of the tree.
> 
> But dealing with swap entries is slippery: there is no reference
> held, and no lock to prevent swizzling to page (outside of the
> tree_lock taken in shmem_free_swap).
> 
> So, as I see it, the page lock ensures that any pages present at
> the starting of the punch operation will be removed, without any
> need to go back and retry.  But a swap entry present at the starting
> of the punch operation might be swizzled back to page (and, if we
> imagine massive preemption, even back to swap again, and to page
> again, etc) at the wrong moment: so for swap we do need to retry.
> 
> (What I said there is not quite correct: that swap would actually
> have to be a locked page at the time when the first loop meets it.)
> 
> Does that make sense?

Allow me to disagree with myself: no, I now think you were right
to press me on this, and we need also the patch below.  Do you
agree, or do you see something more is needed?

Note to onlookers: this would have no bearing on the shmem_fallocate
hang which Sasha reported seeing on -next yesterday (in another thread),
but is nonetheless a correction to the patch we have there - I think.

Hugh

--- 3.16-rc3-mm1/mm/shmem.c 2014-07-02 15:32:22.220311543 -0700
+++ linux/mm/shmem.c2014-07-09 17:38:49.972818635 -0700
@@ -516,6 +516,11 @@ static void shmem_undo_range(struct inod
if (page->mapping == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), 
page);
truncate_inode_page(mapping, page);
+   } else {
+   /* Page was replaced by swap: retry */
+   unlock_page(page);
+   index--;
+   break;
}
}
unlock_page(page);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-07-09 Thread Hugh Dickins
On Wed, 9 Jul 2014, Johannes Weiner wrote:
> On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> > Hannes, a question for you please, I just could not make up my mind.
> > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > with a failed clear_exceptional_entry() in the case of hole-punch?
> > Is that case currently depending on the rescan loop (that I'm about
> > to revert) to remove a new page, so I would need to add a retry for
> > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > and can stay unchanged as below?  I've veered back and forth,
> > thinking first one and then the other.
> 
> I realize you have given up on changing truncate.c in the meantime,
> but I'm still asking myself about the swap retry case: why retry for
> swap-to-page changes, yet not for page-to-page changes?
> 
> In case faults are disabled through i_size, concurrent swapin could
> still turn swap entries into pages, so I can see the need to retry.
> There is no equivalent for shadow entries, though, and they can only
> be turned through page faults, so no retry necessary in that case.
> 
> However, you explicitely mentioned the hole-punch case above: if that
> can't guarantee the hole will be reliably cleared under concurrent
> faults, I'm not sure why it would put in more effort to free it of
> swap (or shadow) entries than to free it of pages.
> 
> What am I missing?

In dropping the pincer effect, I am conceding that data written (via
mmap) racily into the hole, behind the punching cursor, between the
starting and the ending of the punch operation, may be allowed to
remain.  It will not often happen (given the two loops), but it might.

But I insist that all data in the hole at the starting of the punch
operation must be removed by the ending of the punch operation (though
of course, given the paragraph above, identical data might be written
in its place concurrently, via mmap, if the application chooses).

I think you probably agree with both of those propositions.

As the punching cursor moves along the radix_tree, it gathers page
pointers and swap entries (the emply slots are already skipped at
the level below; and tmpfs takes care that there is no instant in
switching between page and swap when the slot appears empty).

Dealing with the page pointers is easy: a reference is already held,
then shmem_undo_range takes the page lock which prevents swizzling
to swap, then truncates that page out of the tree.

But dealing with swap entries is slippery: there is no reference
held, and no lock to prevent swizzling to page (outside of the
tree_lock taken in shmem_free_swap).

So, as I see it, the page lock ensures that any pages present at
the starting of the punch operation will be removed, without any
need to go back and retry.  But a swap entry present at the starting
of the punch operation might be swizzled back to page (and, if we
imagine massive preemption, even back to swap again, and to page
again, etc) at the wrong moment: so for swap we do need to retry.

(What I said there is not quite correct: that swap would actually
have to be a locked page at the time when the first loop meets it.)

Does that make sense?

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-07-09 Thread Johannes Weiner
Hi Hugh,

On Thu, Jun 26, 2014 at 10:36:20PM -0700, Hugh Dickins wrote:
> Hannes, a question for you please, I just could not make up my mind.
> In mm/truncate.c truncate_inode_pages_range(), what should be done
> with a failed clear_exceptional_entry() in the case of hole-punch?
> Is that case currently depending on the rescan loop (that I'm about
> to revert) to remove a new page, so I would need to add a retry for
> that rather like the shmem_free_swap() one?  Or is it irrelevant,
> and can stay unchanged as below?  I've veered back and forth,
> thinking first one and then the other.

I realize you have given up on changing truncate.c in the meantime,
but I'm still asking myself about the swap retry case: why retry for
swap-to-page changes, yet not for page-to-page changes?

In case faults are disabled through i_size, concurrent swapin could
still turn swap entries into pages, so I can see the need to retry.
There is no equivalent for shadow entries, though, and they can only
be turned through page faults, so no retry necessary in that case.

However, you explicitely mentioned the hole-punch case above: if that
can't guarantee the hole will be reliably cleared under concurrent
faults, I'm not sure why it would put in more effort to free it of
swap (or shadow) entries than to free it of pages.

What am I missing?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-07-01 Thread Hugh Dickins
On Tue, 1 Jul 2014, Vlastimil Babka wrote:
> On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I have
> a particular question for you]
> > On Thu, 26 Jun 2014, Vlastimil Babka wrote:
> > > 
> > > Thanks, I didn't notice that. Do I understand correctly that this could
> > > mean
> > > info leak for the punch hole call, but wouldn't be a problem for madvise?
> > > (In
> > > any case, that means the solution is not general enough for all kernels,
> > > so
> > > I'm asking just to be sure).
> > 
> > It's exactly the same issue for the madvise as for the fallocate:
> > data that is promised to have been punched out would still be there.
> 
> AFAIK madvise doesn't promise anything. But nevermind.

Good point.  I was looking at it from an implementation point of
view, that the implementation is the same for both, so therefore the
same issue for both.  But you are right, madvise makes no promise,
so we can therefore excuse it.  You'd make a fine lawyer :)

> > 
> > So let's all forget that patch, although it does help to highlight my
> > mistake in d0823576bf4b.  (Oh, hey, let's all forget my mistake too!)
> 
> What patch? What mistake? :)

Yes, which of my increasingly many? :(

> 
> > Here's the 3.16-rc2 patch that I've now settled on (which will also
> > require a revert of current git's f00cdc6df7d7; well, not require the
> > revert, but this makes that redundant, and cannot be tested with it in).
> > 
> > I've not yet had time to write up the patch description, nor to test
> > it fully; but thought I should get the patch itself into the open for
> > review and testing before then.
> 
> It seems to work here (tested 3.16-rc1 which didn't have f00cdc6df7d7 yet).
> Checking for end != -1 is indeed much more elegant solution than i_size.
> Thanks. So you can add my Tested-by.

Thanks a lot for the testing, Vlastimir.

Though I'm happy with the new shmem.c patch, I've thought more about my
truncate.c patch meanwhile, and grown unhappy with it for two reasons.

One was remembering that XFS still uses lend -1 even when punching a
hole: it writes out dirty pages, then throws away the pagecache from
start of hole to end of file; not brilliant (and a violation of mlock),
but that's how it is, and I'm not about to become an XFS hacker to fix
it (I did long ago send a patch I thought fixed it, but it never went
in, and I could easily have overlooked all kinds of XFS subtleties).

So although the end -1 test is more satisfying in tmpfs, and I don't
particularly like making assumptions in truncate_inode_pages_range()
about what i_size will show at that point, XFS would probably push
me back to using your original i_size test in truncate.c.

If we are to stop the endless pincer in truncate.c like in shmem.c.

But the other reason I'm unhappy with it, is really a generalization
of that.  Starting from the question I asked Hannes below, I came to
realize that truncate_inode_pages_range() is serving many filesystems,
and I don't know what all their assumptions are; and even if I spent
days researching what each requires of truncate_inode_pages_range(),
chances are that I wouldn't get the right answer on all of them.
Maybe there is a filesystem which now depends upon it to clean out
that hole completely: obviously not before I made the change, but
perhaps in the years since.

So, although I dislike tmpfs behaviour diverging from the others here,
we do have Sasha's assurance that tmpfs was the only one to show the
problem, and no intention of implementing hole-punch on ramfs: so I
think the safest course is for me not to interfere with the other
filesystems, just fix the pessimization I introduced back then.

And now that we have hard evidence that my "fix" there in -rc3
must be reverted, I should move forward with the alternative.

Hugh

> 
> > I've checked against v3.1 to see how it works out there: certainly
> > wouldn't apply cleanly (and beware: prior to v3.5's shmem_undo_range,
> > "end" was included in the range, not excluded), but the same
> > principles apply.  Haven't checked the intermediates yet, will
> > probably leave those until each stable wants them - but if you've a
> > particular release in mind, please ask, or ask me to check your port.
> 
> I will try, thanks.
> 
> > I've included the mm/truncate.c part of it here, but that will be a
> > separate (not for -stable) patch when I post the finalized version.
> > 
> > Hannes, a question for you please, I just could not make up my mind.
> > In mm/truncate.c truncate_inode_pages_range(), what should be done
> > with a failed clear_exceptional_entry() in the case of hole-punch?
> > Is that case currently depending on the rescan loop (that I'm about
> > to revert) to remove a new page, so I would need to add a retry for
> > that rather like the shmem_free_swap() one?  Or is it irrelevant,
> > and can stay unchanged as below?  I've veered back and forth,
> > thinking first one and then the other.
> > 
> > Thanks,
> > Hugh
> > 
> > ---
> > 
> >   mm

Re: mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate)

2014-07-01 Thread Hugh Dickins
On Tue, 1 Jul 2014, Sasha Levin wrote:

> Hi Hugh,
> 
> I've been observing a very nonspecific hang involving some mutexes from fs/ 
> but
> without any lockdep output or a concrete way to track it down.
> 
> It seems that today was my lucky day, and after enough tinkering I've managed
> to get output out of lockdep, which pointed me to shmem:
> 
> [ 1871.989131] =
> [ 1871.990028] [ INFO: possible recursive locking detected ]
> [ 1871.992591] 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758 
> Tainted: GW
> [ 1871.992591] -
> [ 1871.992591] trinity-c84/27757 is trying to acquire lock:
> [ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: shmem_fault 
> (mm/shmem.c:1289)
> [ 1871.992591]
> [ 1871.992591] but task is already holding lock:
> [ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: 
> generic_file_write_iter (mm/filemap.c:2633)
> [ 1871.992591]
> [ 1871.992591] other info that might help us debug this:
> [ 1871.992591]  Possible unsafe locking scenario:
> [ 1871.992591]
> [ 1871.992591]CPU0
> [ 1871.992591]
> [ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
> [ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
> [ 1872.013889]
> [ 1872.013889]  *** DEADLOCK ***
> [ 1872.013889]
> [ 1872.013889]  May be due to missing lock nesting notation
> [ 1872.013889]
> [ 1872.013889] 3 locks held by trinity-c84/27757:
> [ 1872.013889] #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
> [ 1872.030221] #1: (sb_writers#13){.+.+.+}, at: do_readv_writev 
> (include/linux/fs.h:2264 fs/read_write.c:830)
> [ 1872.030221] #2: (&sb->s_type->i_mutex_key#17){+.+.+.}, at: 
> generic_file_write_iter (mm/filemap.c:2633)
> [ 1872.030221]
> [ 1872.030221] stack backtrace:
> [ 1872.030221] CPU: 6 PID: 27757 Comm: trinity-c84 Tainted: GW  
> 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758
> [ 1872.030221]  9fc112b0 8803c844f5d8 9c531022 
> 0002
> [ 1872.030221]  9fc112b0 8803c844f6d8 991d1a8d 
> 8803c5da3000
> [ 1872.030221]  8803c5da3d70 88030001 8803c5da3000 
> 8803c5da3da8
> [ 1872.030221] Call Trace:
> [ 1872.030221] dump_stack (lib/dump_stack.c:52)
> [ 1872.030221] __lock_acquire (kernel/locking/lockdep.c:3034 
> kernel/locking/lockdep.c:3180)
> [ 1872.030221] lock_acquire (./arch/x86/include/asm/current.h:14 
> kernel/locking/lockdep.c:3602)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] mutex_lock_nested (kernel/locking/mutex.c:486 
> kernel/locking/mutex.c:587)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1288)
> [ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] shmem_fault (mm/shmem.c:1289)
> [ 1872.030221] __do_fault (mm/memory.c:2705)
> [ 1872.030221] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 
> include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 1872.030221] do_read_fault.isra.40 (mm/memory.c:2896)
> [ 1872.030221] ? get_parent_ip (kernel/sched/core.c:2550)
> [ 1872.030221] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 
> mm/memory.c:3322)
> [ 1872.030221] handle_mm_fault (mm/memory.c:3345)
> [ 1872.030221] __do_page_fault (arch/x86/mm/fault.c:1230)
> [ 1872.030221] ? retint_restore_args (arch/x86/kernel/entry_64.S:829)
> [ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 1872.030221] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 
> kernel/locking/lockdep.c:2599)
> [ 1872.030221] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 1872.030221] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 
> (discriminator 2))
> [ 1872.030221] trace_do_page_fault (arch/x86/mm/fault.c:1313 
> include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 
> include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
> [ 1872.030221] do_async_page_fault (arch/x86/kernel/kvm.c:264)
> [ 1872.030221] async_page_fault (arch/x86/kernel/entry_64.S:1322)
> [ 1872.030221] ? iov_iter_fault_in_readable (include/linux/pagemap.h:598 
> mm/iov_iter.c:267)
> [ 1872.030221] generic_perform_write (mm/filemap.c:2461)
> [ 1872.030221] ? __mnt_drop_write (./arch/x86/include/asm/preempt.h:98 
> fs/namespace.c:455)
> [ 1872.030221] __generic_file_write_iter (mm/filemap.c:2608)
> [ 1872.030221] ? generic_file_llseek (fs/read_write.c:467)
> [ 1872.030221] generic_file_write_iter (mm/filemap.c:2634)
> [ 1872.030221] do_iter_readv_writev (fs/read_write.c:666)
> [ 1872.030221] do_readv_writev (fs/read_write.c:834)
> [ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
> [ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
> [ 1872.030221] ? mutex_lock_nested (./arch/x86/include/asm/preempt.h:98 
> kernel/locking/mutex.c:570 kernel/locki

mm: shmem: hang in shmem_fault (WAS: mm: shm: hang in shmem_fallocate)

2014-07-01 Thread Sasha Levin
Hi Hugh,

I've been observing a very nonspecific hang involving some mutexes from fs/ but
without any lockdep output or a concrete way to track it down.

It seems that today was my lucky day, and after enough tinkering I've managed
to get output out of lockdep, which pointed me to shmem:

[ 1871.989131] =
[ 1871.990028] [ INFO: possible recursive locking detected ]
[ 1871.992591] 3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758 
Tainted: GW
[ 1871.992591] -
[ 1871.992591] trinity-c84/27757 is trying to acquire lock:
[ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: shmem_fault 
(mm/shmem.c:1289)
[ 1871.992591]
[ 1871.992591] but task is already holding lock:
[ 1871.992591] (&sb->s_type->i_mutex_key#17){+.+.+.}, at: 
generic_file_write_iter (mm/filemap.c:2633)
[ 1871.992591]
[ 1871.992591] other info that might help us debug this:
[ 1871.992591]  Possible unsafe locking scenario:
[ 1871.992591]
[ 1871.992591]CPU0
[ 1871.992591]
[ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
[ 1871.992591]   lock(&sb->s_type->i_mutex_key#17);
[ 1872.013889]
[ 1872.013889]  *** DEADLOCK ***
[ 1872.013889]
[ 1872.013889]  May be due to missing lock nesting notation
[ 1872.013889]
[ 1872.013889] 3 locks held by trinity-c84/27757:
[ 1872.013889] #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
[ 1872.030221] #1: (sb_writers#13){.+.+.+}, at: do_readv_writev 
(include/linux/fs.h:2264 fs/read_write.c:830)
[ 1872.030221] #2: (&sb->s_type->i_mutex_key#17){+.+.+.}, at: 
generic_file_write_iter (mm/filemap.c:2633)
[ 1872.030221]
[ 1872.030221] stack backtrace:
[ 1872.030221] CPU: 6 PID: 27757 Comm: trinity-c84 Tainted: GW  
3.16.0-rc3-next-20140630-sasha-00023-g44434d4-dirty #758
[ 1872.030221]  9fc112b0 8803c844f5d8 9c531022 
0002
[ 1872.030221]  9fc112b0 8803c844f6d8 991d1a8d 
8803c5da3000
[ 1872.030221]  8803c5da3d70 88030001 8803c5da3000 
8803c5da3da8
[ 1872.030221] Call Trace:
[ 1872.030221] dump_stack (lib/dump_stack.c:52)
[ 1872.030221] __lock_acquire (kernel/locking/lockdep.c:3034 
kernel/locking/lockdep.c:3180)
[ 1872.030221] lock_acquire (./arch/x86/include/asm/current.h:14 
kernel/locking/lockdep.c:3602)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] mutex_lock_nested (kernel/locking/mutex.c:486 
kernel/locking/mutex.c:587)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1288)
[ 1872.030221] ? shmem_fault (mm/shmem.c:1289)
[ 1872.030221] shmem_fault (mm/shmem.c:1289)
[ 1872.030221] __do_fault (mm/memory.c:2705)
[ 1872.030221] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 
include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 1872.030221] do_read_fault.isra.40 (mm/memory.c:2896)
[ 1872.030221] ? get_parent_ip (kernel/sched/core.c:2550)
[ 1872.030221] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 
mm/memory.c:3322)
[ 1872.030221] handle_mm_fault (mm/memory.c:3345)
[ 1872.030221] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 1872.030221] ? retint_restore_args (arch/x86/kernel/entry_64.S:829)
[ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1872.030221] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 
kernel/locking/lockdep.c:2599)
[ 1872.030221] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 1872.030221] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1872.030221] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 
(discriminator 2))
[ 1872.030221] trace_do_page_fault (arch/x86/mm/fault.c:1313 
include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 
include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 1872.030221] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 1872.030221] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 1872.030221] ? iov_iter_fault_in_readable (include/linux/pagemap.h:598 
mm/iov_iter.c:267)
[ 1872.030221] generic_perform_write (mm/filemap.c:2461)
[ 1872.030221] ? __mnt_drop_write (./arch/x86/include/asm/preempt.h:98 
fs/namespace.c:455)
[ 1872.030221] __generic_file_write_iter (mm/filemap.c:2608)
[ 1872.030221] ? generic_file_llseek (fs/read_write.c:467)
[ 1872.030221] generic_file_write_iter (mm/filemap.c:2634)
[ 1872.030221] do_iter_readv_writev (fs/read_write.c:666)
[ 1872.030221] do_readv_writev (fs/read_write.c:834)
[ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
[ 1872.030221] ? __generic_file_write_iter (mm/filemap.c:2627)
[ 1872.030221] ? mutex_lock_nested (./arch/x86/include/asm/preempt.h:98 
kernel/locking/mutex.c:570 kernel/locking/mutex.c:587)
[ 1872.030221] ? __fdget_pos (fs/file.c:714)
[ 1872.030221] ? __fdget_pos (fs/file.c:714)
[ 1872.030221] ? __fget_light (include/linux/rcupdate.h:402 
include/linux/fdtable.h:80 fs/file.c:684)
[ 1872.101905] vfs_writev (fs/

Re: mm: shm: hang in shmem_fallocate

2014-07-01 Thread Vlastimil Babka
On 06/27/2014 07:36 AM, Hugh Dickins wrote:> [Cc Johannes: at the end I 
have a particular question for you]


On Thu, 26 Jun 2014, Vlastimil Babka wrote:

On 06/26/2014 12:36 AM, Hugh Dickins wrote:

On Tue, 24 Jun 2014, Vlastimil Babka wrote:

Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.


Thanks a lot for looking into it!



since this got a CVE,


Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.


Sorry, I should have mentioned it explicitly.


Looks overrated to me


I'd bet it would pass unnoticed if you didn't use the sentence "but whether
it's a serious matter in the scale of denials of service, I'm not so sure" in
your first reply to Sasha's report :) I wouldn't be surprised if people grep
for this.


Hah, you're probably right,
I better choose my words more carefully in future.




(and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.


I've been looking at backport to an older kernel where


Thanks a lot for looking into it.  I didn't think it was worth a
Cc: sta...@vger.kernel.org myself, but admit to being both naive
and inconsistent about that.


fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and
since


Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.


it doesn't guarantee anything, it seems simpler just to give up retrying
to


Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.


truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page
fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
different
from madvise(MADV_REMOVE), but it seems to me that as long as it does
discard
the old data from the range, it's fine from any information leak point of
view.
If someone races page faulting, it IMHO doesn't matter if he gets a new
zeroed
page before the parallel truncate has ended, or right after it has ended.


Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)


I was looking at our 3.0 based kernel, but it could be due to backported
patches on top.


And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.




If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.


So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't
satisfy


My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.


the new "lstart < inode->i_size" condition, but I don't know if it's
"vulnerable"
to the problem.


I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.



-8<-
From: Vlastimil Babka 
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
punching

---
   mm/shmem.c | 19 +++
   1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..6d6005c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
loff_t lstart, loff_t lend,
if (!pvec.nr) {
if (index == start || unfalloc)
break;
+/*
+ * When this condition is true, it means we were
+ * called from fallocate(FALLOC_FL_PUNCH_HOLE).
+ * To prevent a livelock when someone else is
faulting
+ * pages back, we are content with single pass
and do
+ * not retry with index = start. It's important
that
+ * previous page content has been discarded, and
+ * faulter(s) got new zeroed pages.
+ *
+ * The other callsites are shmem_setattr (for
+ * tr

Re: mm: shm: hang in shmem_fallocate

2014-06-28 Thread Sasha Levin
On 06/27/2014 02:03 PM, Hugh Dickins wrote:
> On Fri, 27 Jun 2014, Sasha Levin wrote:
>> On 06/27/2014 01:59 AM, Hugh Dickins wrote:
> First, this:
>
> [  681.267487] BUG: unable to handle kernel paging request at 
> ea0003480048
> [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
>>> Weird, I don't think we've seen anything like that before, have we?
>>> I'm pretty sure it's not a consequence of my "index = min(index, end)",
>>> but what it portends I don't know.  Please confirm mm/memory.c:1132 -
>>> that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
>>> the code below.  So accessing page->mapping is causing an oops...
>>
>> Right, that's the correct line.
>>
>> At this point I'm pretty sure that it's somehow related to that one line
>> patch since it reproduced fairly quickly after applying it, and when I
>> removed it I didn't see it happening again during the overnight fuzzing.
> 
> Oh, I assumed it was a one-off: you're saying that you saw it more than
> once with the min(index, end) patch in?  But not since removing it (did
> you replace that by the newer patch? or by the older? or by nothing?).

It reproduced exactly twice, can't say it happens too often.

What I did was revert your original fix for the issue and apply the one-liner.

I've spent most of yesterday chasing a different bug with a "clean" -next
tree (without the revert and the one-line patch) and didn't see any mm/
issues.

However, about 2 hours after doing the revert and applying the one-line
patch I've encountered the following:

[ 3686.797859] BUG: unable to handle kernel paging request at 88028a488f98
[ 3686.805732] IP: do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.805732] PGD 12b82067 PUD 704d49067 PMD 704cf6067 PTE 80028a488060
[ 3686.805732] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3686.815852] Dumping ftrace buffer:
[ 3686.815852](ftrace buffer empty)
[ 3686.815852] Modules linked in:
[ 3686.815852] CPU: 10 PID: 8890 Comm: modprobe Not tainted 
3.16.0-rc2-next-20140627-sasha-00024-ga284b83-dirty #753
[ 3686.815852] task: 8801d1c2 ti: 8801c6a08000 task.ti: 
8801c6a08000
[ 3686.826134] RIP: do_read_fault.isra.40 (mm/memory.c:2856 mm/memory.c:2889)
[ 3686.826134] RSP: :8801c6a0bc78  EFLAGS: 00010297
[ 3686.826134] RAX:  RBX: 880288531200 RCX: 001f
[ 3686.826134] RDX: 0014 RSI: 7f22949f3000 RDI: 88028a488f98
[ 3686.826134] RBP: 8801c6a0bd18 R08: 7f2294a13000 R09: 000c
[ 3686.826134] R10:  R11: 00a8 R12: 7f2294a07c50
[ 3686.826134] R13: 880279fec4b0 R14: 7f22949f3000 R15: 88028ebbc528
[ 3686.826134] FS:  () GS:880292e0() 
knlGS:
[ 3686.826134] CS:  0010 DS:  ES:  CR0: 8005003b
[ 3686.826134] CR2: 88028a488f98 CR3: 00026b766000 CR4: 06a0
[ 3686.826134] Stack:
[ 3686.826134]  8801c6a0bc98 0001 880200a8 
0014
[ 3686.826134]  88028a489038  88026d40d000 
88028eafaee0
[ 3686.826134]  8801c6a0bcd8 8e572715 ea000a292240 
0028a489
[ 3686.826134] Call Trace:
[ 3686.826134] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 
include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 3686.826134] ? __pte_alloc (mm/memory.c:598 mm/memory.c:593)
[ 3686.826134] __handle_mm_fault (mm/memory.c:3037 mm/memory.c:3198 
mm/memory.c:3322)
[ 3686.826134] handle_mm_fault (include/linux/memcontrol.h:124 mm/memory.c:3348)
[ 3686.826134] ? __do_page_fault (arch/x86/mm/fault.c:1163)
[ 3686.826134] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 3686.826134] ? vtime_account_user (kernel/sched/cputime.c:687)
[ 3686.826134] ? get_parent_ip (kernel/sched/core.c:2550)
[ 3686.826134] ? context_tracking_user_exit (include/linux/vtime.h:89 
include/linux/jump_label.h:115 include/trace/events/context_tracking.h:47 
kernel/context_tracking.c:180)
[ 3686.826134] ? preempt_count_sub (kernel/sched/core.c:2606)
[ 3686.826134] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 3686.826134] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 3686.826134] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 
(discriminator 2))
[ 3686.826134] trace_do_page_fault (arch/x86/mm/fault.c:1313 
include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 
include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 3686.826134] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 3686.826134] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 3686.826134] Code: 89 c0 4c 8b 43 08 48 8d 4c 08 ff 49 01 c1 49 39 c9 4c 0f 
47 c9 4c 89 c1 4c 29 f1 48 c1 e9 0c 49 8d 4c 0a ff 49 39 c9 4c 0f 47 c9 <48> 83 
3f 00 74 3c 48 83 c0 01 4c 39 c8 77 74 48 81 c6 00 10 00
All code

   0:   89 c0   mov%eax,%eax
   2:   4c 8b 43 08 

Re: mm: shm: hang in shmem_fallocate

2014-06-27 Thread Hugh Dickins
On Fri, 27 Jun 2014, Sasha Levin wrote:
> On 06/27/2014 01:59 AM, Hugh Dickins wrote:
> >> > First, this:
> >> > 
> >> > [  681.267487] BUG: unable to handle kernel paging request at 
> >> > ea0003480048
> >> > [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
> > Weird, I don't think we've seen anything like that before, have we?
> > I'm pretty sure it's not a consequence of my "index = min(index, end)",
> > but what it portends I don't know.  Please confirm mm/memory.c:1132 -
> > that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
> > the code below.  So accessing page->mapping is causing an oops...
> 
> Right, that's the correct line.
> 
> At this point I'm pretty sure that it's somehow related to that one line
> patch since it reproduced fairly quickly after applying it, and when I
> removed it I didn't see it happening again during the overnight fuzzing.

Oh, I assumed it was a one-off: you're saying that you saw it more than
once with the min(index, end) patch in?  But not since removing it (did
you replace that by the newer patch? or by the older? or by nothing?).

I want to exclaim "That makes no sense!", but bugs don't make sense
anyway.  It's going to be a challenge to work out a connection though.
I think I want to ask for more attempts to reproduce, with and without
the min(index, end) patch (if you have enough time - there must be a
limit to the amount of time you can give me on this).

I rather hoped that the oops on PageAnon might shed light from another
direction on the outstanding page_mapped bug: both seem like page table
corruption of some kind (though I've not seen a plausible path to either).

And regarding the page_mapped bug: we've heard nothing since Dave
Hansen suggested a VM_BUG_ON_PAGE for that - has it gone away now?

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-06-27 Thread Sasha Levin
On 06/27/2014 01:59 AM, Hugh Dickins wrote:
>> > First, this:
>> > 
>> > [  681.267487] BUG: unable to handle kernel paging request at 
>> > ea0003480048
>> > [  681.268621] IP: zap_pte_range (mm/memory.c:1132)
> Weird, I don't think we've seen anything like that before, have we?
> I'm pretty sure it's not a consequence of my "index = min(index, end)",
> but what it portends I don't know.  Please confirm mm/memory.c:1132 -
> that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
> the code below.  So accessing page->mapping is causing an oops...

Right, that's the correct line.

At this point I'm pretty sure that it's somehow related to that one line
patch since it reproduced fairly quickly after applying it, and when I
removed it I didn't see it happening again during the overnight fuzzing.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-06-26 Thread Hugh Dickins
On Thu, 26 Jun 2014, Sasha Levin wrote:
> On 06/25/2014 06:36 PM, Hugh Dickins wrote:
> > Sasha, may I trespass on your time, and ask you to revert the previous
> > patch from your tree, and give this patch below a try?  I am very
> > interested to learn if in fact it fixes it for you (as it did for me).
> 
> Hi Hugh,
> 
> Happy to help,

Thank you!  Though Vlastimil has made it clear that we cannot go
forward with that one-liner patch, so I've just proposed another.

> and as I often do I will answer with a question.
> 
> I've observed two different issues after reverting the original fix and
> applying this new patch. Both of them seem semi-related, but I'm not sure.

I've rather concentrated on putting the new patch together, and just
haven't had time to give these two much thought - nor shall tomorrow,
I'm afraid.

> 
> First, this:
> 
> [  681.267487] BUG: unable to handle kernel paging request at ea0003480048
> [  681.268621] IP: zap_pte_range (mm/memory.c:1132)

Weird, I don't think we've seen anything like that before, have we?
I'm pretty sure it's not a consequence of my "index = min(index, end)",
but what it portends I don't know.  Please confirm mm/memory.c:1132 -
that's the "if (PageAnon(page))" line, isn't it?  Which indeed matches
the code below.  So accessing page->mapping is causing an oops...

> [  681.269335] PGD 37fcc067 PUD 37fcb067 PMD 0
> [  681.269972] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  681.270952] Dumping ftrace buffer:
> [  681.270952](ftrace buffer empty)
> [  681.270952] Modules linked in:
> [  681.270952] CPU: 7 PID: 1952 Comm: trinity-c29 Not tainted 
> 3.16.0-rc2-next-20140625-s
> asha-00025-g2e02e05-dirty #730
> [  681.270952] task: 8803e6f58000 ti: 8803df05 task.ti: 
> 8803df05
> [  681.270952] RIP: zap_pte_range (mm/memory.c:1132)
> [  681.270952] RSP: 0018:8803df053c58  EFLAGS: 00010246
> [  681.270952] RAX: ea0003480040 RBX: 8803edae7a70 RCX: 
> 03480040
> [  681.270952] RDX: d2001730 RSI:  RDI: 
> d2001730
> [  681.270952] RBP: 8803df053cf8 R08: 8815cc00 R09: 
> 
> [  681.270952] R10: 0001 R11:  R12: 
> ea0003480040
> [  681.270952] R13: 8803df053de8 R14: 7fc15014f000 R15: 
> 7fc15014e000
> [  681.270952] FS:  7fc15031b700() GS:8801ece0() 
> knlGS:0
> 000
> [  681.270952] CS:  0010 DS:  ES:  CR0: 8005003b
> [  681.270952] CR2: ea0003480048 CR3: 1a02e000 CR4: 
> 06a0
> [  681.270952] Stack:
> [  681.270952]  8803df053de8 d2001000 d2001fff 
> 8803e6f58000
> [  681.270952]   0001 880404dd8400 
> 8803e6e31900
> [  681.270952]  d2001730 8815cc00  
> 8804078f8000
> [  681.270952] Call Trace:
> [  681.270952] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 
> mm/memory.c:1301 mm/m
> emory.c:1346)
> [  681.270952] unmap_vmas (mm/memory.c:1375 (discriminator 1))
> [  681.270952] exit_mmap (mm/mmap.c:2797)
> [  681.270952] ? preempt_count_sub (kernel/sched/core.c:2606)
> [  681.270952] mmput (kernel/fork.c:638)
> [  681.270952] do_exit (kernel/exit.c:744)
> [  681.270952] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [  681.270952] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 
> kernel/locking/
> lockdep.c:2599)
> [  681.270952] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [  681.270952] do_group_exit (kernel/exit.c:884)
> [  681.270952] SyS_exit_group (kernel/exit.c:895)
> [  681.270952] tracesys (arch/x86/kernel/entry_64.S:542)
> [ 681.270952] Code: e8 cf 39 25 03 49 8b 4c 24 10 48 39 c8 74 1c 48 8b 7d b8 
> 48 c1 e1 0c
>  48 89 da 48 83 c9 40 4c 89 fe e8 e5 db ff ff 0f 1f 44 00 00 <41> f6 44 24 08 
> 01 74 08 83 6d c8 01 eb 33 66 90 f6 45 a0 40 74
> All code
> 
>0:   e8 cf 39 25 03  callq  0x32539d4
>5:   49 8b 4c 24 10  mov0x10(%r12),%rcx
>a:   48 39 c8cmp%rcx,%rax
>d:   74 1c   je 0x2b
>f:   48 8b 7d b8 mov-0x48(%rbp),%rdi
>   13:   48 c1 e1 0c shl$0xc,%rcx
>   17:   48 89 damov%rbx,%rdx
>   1a:   48 83 c9 40 or $0x40,%rcx
>   1e:   4c 89 femov%r15,%rsi
>   21:   e8 e5 db ff ff  callq  0xdc0b
>   26:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>   2b:*  41 f6 44 24 08 01   testb  $0x1,0x8(%r12)   <-- trapping 
> instruction
>   31:   74 08   je 0x3b
>   33:   83 6d c8 01 subl   $0x1,-0x38(%rbp)
>   37:   eb 33   jmp0x6c
>   39:   66 90   xchg   %ax,%ax
>   3b:   f6 45 a0 40 testb  $0x40,-0x60(%rbp)
>   3f:   74 00   je 0x41
> 
> Code starting with the faulting instruction
> ==

Re: mm: shm: hang in shmem_fallocate

2014-06-26 Thread Hugh Dickins
[Cc Johannes: at the end I have a particular question for you]

On Thu, 26 Jun 2014, Vlastimil Babka wrote:
> On 06/26/2014 12:36 AM, Hugh Dickins wrote:
> > On Tue, 24 Jun 2014, Vlastimil Babka wrote:
> > 
> > Sorry for the slow response: I have got confused, learnt more, and
> > changed my mind, several times in the course of replying to you.
> > I think this reply will be stable... though not final.
> 
> Thanks a lot for looking into it!
> 
> > > 
> > > since this got a CVE,
> > 
> > Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
> 
> Sorry, I should have mentioned it explicitly.
> 
> > Looks overrated to me
> 
> I'd bet it would pass unnoticed if you didn't use the sentence "but whether
> it's a serious matter in the scale of denials of service, I'm not so sure" in
> your first reply to Sasha's report :) I wouldn't be surprised if people grep
> for this.

Hah, you're probably right,
I better choose my words more carefully in future.

> 
> > (and amusing to see my pompous words about a
> > "range notification mechanism" taken too seriously), but of course
> > we do need to address it.
> > 
> > > I've been looking at backport to an older kernel where
> > 
> > Thanks a lot for looking into it.  I didn't think it was worth a
> > Cc: sta...@vger.kernel.org myself, but admit to being both naive
> > and inconsistent about that.
> > 
> > > fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
> > > range notification mechanism yet. There's just madvise(MADV_REMOVE) and
> > > since
> > 
> > Yes, that mechanism could be ported back pre-v3.5,
> > but I agree with your preference not to.
> > 
> > > it doesn't guarantee anything, it seems simpler just to give up retrying
> > > to
> > 
> > Right, I don't think we have formally documented the instant of "full hole"
> > that I strove for there, and it's probably not externally verifiable, nor
> > guaranteed by other filesystems.  I just thought it a good QoS aim, but
> > it has given us this problem.
> > 
> > > truncate really everything. Then I realized that maybe it would work for
> > > current kernel as well, without having to add any checks in the page
> > > fault
> > > path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look
> > > different
> > > from madvise(MADV_REMOVE), but it seems to me that as long as it does
> > > discard
> > > the old data from the range, it's fine from any information leak point of
> > > view.
> > > If someone races page faulting, it IMHO doesn't matter if he gets a new
> > > zeroed
> > > page before the parallel truncate has ended, or right after it has ended.
> > 
> > Yes.  I disagree with your actual patch, for more than one reason,
> > but it's in the right area; and I found myself growing to agree with
> > you, that's it's better to have one kind of fix for all these releases,
> > than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
> > v3.0 too, I'm sceptical about that, but haven't tried it as yet.)
> 
> I was looking at our 3.0 based kernel, but it could be due to backported
> patches on top.

And later you confirm that 3.0.101 vanilla is okay: thanks, that fits.

> 
> > If I'd realized that we were going to have to backport, I'd have spent
> > longer looking for a patch like yours originally.  So my inclination
> > now is to go your route, make a new patch for v3.16 and backports,
> > and revert the f00cdc6df7d7 that has already gone in.
> > 
> > > So I'm posting it here as a RFC. I haven't thought about the
> > > i915_gem_object_truncate caller yet. I think that this path wouldn't
> > > satisfy
> > 
> > My understanding is that i915_gem_object_truncate() is not a problem,
> > that i915's dev->struct_mutex serializes all its relevant transitions,
> > plus the object woudn't even be interestingly accessible to the user.
> > 
> > > the new "lstart < inode->i_size" condition, but I don't know if it's
> > > "vulnerable"
> > > to the problem.
> > 
> > I don't think i915 is vulnerable, but if it is, that condition would
> > be fine for it, as would be the patch I'm now thinking of.
> > 
> > > 
> > > -8<-
> > > From: Vlastimil Babka 
> > > Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole
> > > punching
> > > 
> > > ---
> > >   mm/shmem.c | 19 +++
> > >   1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index f484c27..6d6005c 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode,
> > > loff_t lstart, loff_t lend,
> > >   if (!pvec.nr) {
> > >   if (index == start || unfalloc)
> > >   break;
> > > +/*
> > > + * When this condition is true, it means we were
> > > + * called from fallocate(FALLOC_FL_PUNCH_HOLE).
> > > + * To prevent a livelock when s

Re: mm: shm: hang in shmem_fallocate

2014-06-26 Thread Vlastimil Babka

On 06/26/2014 11:14 AM, Vlastimil Babka wrote:

On 06/26/2014 12:36 AM, Hugh Dickins wrote:

On Tue, 24 Jun 2014, Vlastimil Babka wrote:

On 06/16/2014 04:29 AM, Hugh Dickins wrote:

On Thu, 12 Jun 2014, Sasha Levin wrote:

On 02/09/2014 08:41 PM, Sasha Levin wrote:

On 02/08/2014 10:25 PM, Hugh Dickins wrote:

Would trinity be likely to have a thread or process repeatedly faulting
in pages from the hole while it is being punched?


I can see how trinity would do that, but just to be certain - Cc davej.

On 02/08/2014 10:25 PM, Hugh Dickins wrote:

Does this happen with other holepunch filesystems?  If it does not,
I'd suppose it's because the tmpfs fault-in-newly-created-page path
is lighter than a consistent disk-based filesystem's has to be.
But we don't want to make the tmpfs path heavier to match them.


No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
punching in other filesystems and I make sure to get a bunch of those
mounted before starting testing.


Just pinging this one again. I still see hangs in -next where the hang
location looks same as before:



Please give this patch a try.  It fixes what I can reproduce, but given
your unexplained page_mapped() BUG in this area, we know there's more
yet to be understood, so perhaps this patch won't do enough for you.



Hi,


Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.


Thanks a lot for looking into it!



since this got a CVE,


Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.


Sorry, I should have mentioned it explicitly.


Looks overrated to me


I'd bet it would pass unnoticed if you didn't use the sentence "but
whether it's a serious matter in the scale of denials of service, I'm
not so sure" in your first reply to Sasha's report :) I wouldn't be
surprised if people grep for this.


(and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.


I've been looking at backport to an older kernel where


Thanks a lot for looking into it.  I didn't think it was worth a
Cc: sta...@vger.kernel.org myself, but admit to being both naive
and inconsistent about that.


fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and since


Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.


it doesn't guarantee anything, it seems simpler just to give up retrying to


Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.


truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
the old data from the range, it's fine from any information leak point of view.
If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
page before the parallel truncate has ended, or right after it has ended.


Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)


I was looking at our 3.0 based kernel, but it could be due to backported
patches on top.


OK, seems I cannot reproduce this on 3.0.101 vanilla.


If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.


So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy


My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.


the new "lstart < inode->i_size" condition, but I don't know if it's 
"vulnerable"
to the problem.


I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.



-8<-
From: Vlastimil Babka 
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole 
punching

---
   mm/shmem.c | 19 +++
   1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/

Re: mm: shm: hang in shmem_fallocate

2014-06-26 Thread Vlastimil Babka

On 06/26/2014 12:36 AM, Hugh Dickins wrote:

On Tue, 24 Jun 2014, Vlastimil Babka wrote:

On 06/16/2014 04:29 AM, Hugh Dickins wrote:

On Thu, 12 Jun 2014, Sasha Levin wrote:

On 02/09/2014 08:41 PM, Sasha Levin wrote:

On 02/08/2014 10:25 PM, Hugh Dickins wrote:

Would trinity be likely to have a thread or process repeatedly faulting
in pages from the hole while it is being punched?


I can see how trinity would do that, but just to be certain - Cc davej.

On 02/08/2014 10:25 PM, Hugh Dickins wrote:

Does this happen with other holepunch filesystems?  If it does not,
I'd suppose it's because the tmpfs fault-in-newly-created-page path
is lighter than a consistent disk-based filesystem's has to be.
But we don't want to make the tmpfs path heavier to match them.


No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
punching in other filesystems and I make sure to get a bunch of those
mounted before starting testing.


Just pinging this one again. I still see hangs in -next where the hang
location looks same as before:



Please give this patch a try.  It fixes what I can reproduce, but given
your unexplained page_mapped() BUG in this area, we know there's more
yet to be understood, so perhaps this patch won't do enough for you.



Hi,


Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.


Thanks a lot for looking into it!



since this got a CVE,


Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.


Sorry, I should have mentioned it explicitly.


Looks overrated to me


I'd bet it would pass unnoticed if you didn't use the sentence "but 
whether it's a serious matter in the scale of denials of service, I'm 
not so sure" in your first reply to Sasha's report :) I wouldn't be 
surprised if people grep for this.



(and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.


I've been looking at backport to an older kernel where


Thanks a lot for looking into it.  I didn't think it was worth a
Cc: sta...@vger.kernel.org myself, but admit to being both naive
and inconsistent about that.


fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and since


Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.


it doesn't guarantee anything, it seems simpler just to give up retrying to


Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.


truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
the old data from the range, it's fine from any information leak point of view.
If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
page before the parallel truncate has ended, or right after it has ended.


Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)


I was looking at our 3.0 based kernel, but it could be due to backported 
patches on top.



If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.


So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy


My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.


the new "lstart < inode->i_size" condition, but I don't know if it's 
"vulnerable"
to the problem.


I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.



-8<-
From: Vlastimil Babka 
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole 
punching

---
  mm/shmem.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..6d6005c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,25 @@ static 

Re: mm: shm: hang in shmem_fallocate

2014-06-25 Thread Hugh Dickins
On Tue, 24 Jun 2014, Vlastimil Babka wrote:
> On 06/16/2014 04:29 AM, Hugh Dickins wrote:
> > On Thu, 12 Jun 2014, Sasha Levin wrote:
> >> On 02/09/2014 08:41 PM, Sasha Levin wrote:
> >>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>  Would trinity be likely to have a thread or process repeatedly faulting
>  in pages from the hole while it is being punched?
> >>>
> >>> I can see how trinity would do that, but just to be certain - Cc davej.
> >>>
> >>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>  Does this happen with other holepunch filesystems?  If it does not,
>  I'd suppose it's because the tmpfs fault-in-newly-created-page path
>  is lighter than a consistent disk-based filesystem's has to be.
>  But we don't want to make the tmpfs path heavier to match them.
> >>>
> >>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> >>> punching in other filesystems and I make sure to get a bunch of those
> >>> mounted before starting testing.
> >>
> >> Just pinging this one again. I still see hangs in -next where the hang
> >> location looks same as before:
> >>
> > 
> > Please give this patch a try.  It fixes what I can reproduce, but given
> > your unexplained page_mapped() BUG in this area, we know there's more
> > yet to be understood, so perhaps this patch won't do enough for you.
> > 
> 
> Hi,

Sorry for the slow response: I have got confused, learnt more, and
changed my mind, several times in the course of replying to you.
I think this reply will be stable... though not final.

> 
> since this got a CVE,

Oh.  CVE-2014-4171.  Couldn't locate that yesterday but see it now.
Looks overrated to me (and amusing to see my pompous words about a
"range notification mechanism" taken too seriously), but of course
we do need to address it.

> I've been looking at backport to an older kernel where

Thanks a lot for looking into it.  I didn't think it was worth a
Cc: sta...@vger.kernel.org myself, but admit to being both naive
and inconsistent about that.

> fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
> range notification mechanism yet. There's just madvise(MADV_REMOVE) and since

Yes, that mechanism could be ported back pre-v3.5,
but I agree with your preference not to.

> it doesn't guarantee anything, it seems simpler just to give up retrying to

Right, I don't think we have formally documented the instant of "full hole"
that I strove for there, and it's probably not externally verifiable, nor
guaranteed by other filesystems.  I just thought it a good QoS aim, but
it has given us this problem.

> truncate really everything. Then I realized that maybe it would work for
> current kernel as well, without having to add any checks in the page fault
> path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
> from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
> the old data from the range, it's fine from any information leak point of 
> view.
> If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
> page before the parallel truncate has ended, or right after it has ended.

Yes.  I disagree with your actual patch, for more than one reason,
but it's in the right area; and I found myself growing to agree with
you, that's it's better to have one kind of fix for all these releases,
than one for v3.5..v3.15 and another for v3.1..v3.4.  (The CVE cites
v3.0 too, I'm sceptical about that, but haven't tried it as yet.)

If I'd realized that we were going to have to backport, I'd have spent
longer looking for a patch like yours originally.  So my inclination
now is to go your route, make a new patch for v3.16 and backports,
and revert the f00cdc6df7d7 that has already gone in.

> So I'm posting it here as a RFC. I haven't thought about the
> i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy

My understanding is that i915_gem_object_truncate() is not a problem,
that i915's dev->struct_mutex serializes all its relevant transitions,
plus the object woudn't even be interestingly accessible to the user.

> the new "lstart < inode->i_size" condition, but I don't know if it's 
> "vulnerable"
> to the problem.

I don't think i915 is vulnerable, but if it is, that condition would
be fine for it, as would be the patch I'm now thinking of.

> 
> -8<-
> From: Vlastimil Babka 
> Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole 
> punching
> 
> ---
>  mm/shmem.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f484c27..6d6005c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t 
> lstart, loff_t lend,
>   if (!pvec.nr) {
>   if (index == start || unfalloc)
>   break;
> +/* 
> + * When this condition is true

Re: mm: shm: hang in shmem_fallocate

2014-06-24 Thread Vlastimil Babka
On 06/16/2014 04:29 AM, Hugh Dickins wrote:
> On Thu, 12 Jun 2014, Sasha Levin wrote:
>> On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 Would trinity be likely to have a thread or process repeatedly faulting
 in pages from the hole while it is being punched?
>>>
>>> I can see how trinity would do that, but just to be certain - Cc davej.
>>>
>>> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 Does this happen with other holepunch filesystems?  If it does not,
 I'd suppose it's because the tmpfs fault-in-newly-created-page path
 is lighter than a consistent disk-based filesystem's has to be.
 But we don't want to make the tmpfs path heavier to match them.
>>>
>>> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>> punching in other filesystems and I make sure to get a bunch of those
>>> mounted before starting testing.
>>
>> Just pinging this one again. I still see hangs in -next where the hang
>> location looks same as before:
>>
> 
> Please give this patch a try.  It fixes what I can reproduce, but given
> your unexplained page_mapped() BUG in this area, we know there's more
> yet to be understood, so perhaps this patch won't do enough for you.
> 

Hi,

since this got a CVE, I've been looking at backport to an older kernel where
fallocate(FALLOC_FL_PUNCH_HOLE) is not yet supported, and there's also no
range notification mechanism yet. There's just madvise(MADV_REMOVE) and since
it doesn't guarantee anything, it seems simpler just to give up retrying to
truncate really everything. Then I realized that maybe it would work for
current kernel as well, without having to add any checks in the page fault
path. The semantics of fallocate(FALLOC_FL_PUNCH_HOLE) might look different
from madvise(MADV_REMOVE), but it seems to me that as long as it does discard
the old data from the range, it's fine from any information leak point of view.
If someone races page faulting, it IMHO doesn't matter if he gets a new zeroed
page before the parallel truncate has ended, or right after it has ended.
So I'm posting it here as a RFC. I haven't thought about the
i915_gem_object_truncate caller yet. I think that this path wouldn't satisfy
the new "lstart < inode->i_size" condition, but I don't know if it's 
"vulnerable"
to the problem.

-8<-
From: Vlastimil Babka 
Subject: [RFC PATCH] shmem: prevent livelock between page fault and hole 
punching

---
 mm/shmem.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index f484c27..6d6005c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -476,6 +476,25 @@ static void shmem_undo_range(struct inode *inode, loff_t 
lstart, loff_t lend,
if (!pvec.nr) {
if (index == start || unfalloc)
break;
+/* 
+ * When this condition is true, it means we were
+ * called from fallocate(FALLOC_FL_PUNCH_HOLE).
+ * To prevent a livelock when someone else is faulting
+ * pages back, we are content with single pass and do
+ * not retry with index = start. It's important that
+ * previous page content has been discarded, and
+ * faulter(s) got new zeroed pages.
+ *
+ * The other callsites are shmem_setattr (for
+ * truncation) and shmem_evict_inode, which set i_size
+ * to truncated size or 0, respectively, and then call
+ * us with lstart == inode->i_size. There we do want to
+ * retry, and livelock cannot happen for other reasons.
+ *
+ * XXX what about i915_gem_object_truncate?
+ */
+if (lstart < inode->i_size)
+break;
index = start;
continue;
}
-- 
1.8.4.5




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-06-17 Thread Sasha Levin
On 06/15/2014 10:29 PM, Hugh Dickins wrote:
> On Thu, 12 Jun 2014, Sasha Levin wrote:
>> > On 02/09/2014 08:41 PM, Sasha Levin wrote:
>>> > > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 > >> Would trinity be likely to have a thread or process repeatedly 
 > >> faulting
 > >> in pages from the hole while it is being punched?
>>> > > 
>>> > > I can see how trinity would do that, but just to be certain - Cc davej.
>>> > > 
>>> > > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
 > >> Does this happen with other holepunch filesystems?  If it does not,
 > >> I'd suppose it's because the tmpfs fault-in-newly-created-page path
 > >> is lighter than a consistent disk-based filesystem's has to be.
 > >> But we don't want to make the tmpfs path heavier to match them.
>>> > > 
>>> > > No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
>>> > > punching in other filesystems and I make sure to get a bunch of those
>>> > > mounted before starting testing.
>> > 
>> > Just pinging this one again. I still see hangs in -next where the hang
>> > location looks same as before:
>> > 
> Please give this patch a try.  It fixes what I can reproduce, but given
> your unexplained page_mapped() BUG in this area, we know there's more
> yet to be understood, so perhaps this patch won't do enough for you.
> 
> 
> [PATCH] shmem: fix faulting into a hole while it's punched
> 
> Trinity finds that mmap access to a hole while it's punched from shmem
> can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
> from completing, until the reader chooses to stop; with the puncher's
> hold on i_mutex locking out all other writers until it can complete.
> 
> It appears that the tmpfs fault path is too light in comparison with
> its hole-punching path, lacking an i_data_sem to obstruct it; but we
> don't want to slow down the common case.
> 
> Extend shmem_fallocate()'s existing range notification mechanism, so
> shmem_fault() can refrain from faulting pages into the hole while it's
> punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
> faulting when not).
> 
> Signed-off-by: Hugh Dickins 

No shmem_fallocate issues observed in the past day, works for me. Thanks Hugh!


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-06-15 Thread Hugh Dickins
On Thu, 12 Jun 2014, Sasha Levin wrote:
> On 02/09/2014 08:41 PM, Sasha Levin wrote:
> > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >> Would trinity be likely to have a thread or process repeatedly faulting
> >> in pages from the hole while it is being punched?
> > 
> > I can see how trinity would do that, but just to be certain - Cc davej.
> > 
> > On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> >> Does this happen with other holepunch filesystems?  If it does not,
> >> I'd suppose it's because the tmpfs fault-in-newly-created-page path
> >> is lighter than a consistent disk-based filesystem's has to be.
> >> But we don't want to make the tmpfs path heavier to match them.
> > 
> > No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> > punching in other filesystems and I make sure to get a bunch of those
> > mounted before starting testing.
> 
> Just pinging this one again. I still see hangs in -next where the hang
> location looks same as before:
> 

Please give this patch a try.  It fixes what I can reproduce, but given
your unexplained page_mapped() BUG in this area, we know there's more
yet to be understood, so perhaps this patch won't do enough for you.


[PATCH] shmem: fix faulting into a hole while it's punched

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with
its hole-punching path, lacking an i_data_sem to obstruct it; but we
don't want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

Signed-off-by: Hugh Dickins 
---

 mm/shmem.c |   55 +++
 1 file changed, 51 insertions(+), 4 deletions(-)

--- 3.16-rc1/mm/shmem.c 2014-06-12 11:20:43.21098 -0700
+++ linux/mm/shmem.c2014-06-15 18:32:00.049039969 -0700
@@ -80,11 +80,12 @@ static struct vfsmount *shm_mnt;
 #define SHORT_SYMLINK_LEN 128
 
 /*
- * shmem_fallocate and shmem_writepage communicate via inode->i_private
- * (with i_mutex making sure that it has only one user at a time):
- * we would prefer not to enlarge the shmem inode just for that.
+ * shmem_fallocate communicates with shmem_fault or shmem_writepage via
+ * inode->i_private (with i_mutex making sure that it has only one user at
+ * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
+   int mode;   /* FALLOC_FL mode currently operating */
pgoff_t start;  /* start of range currently being fallocated */
pgoff_t next;   /* the next page offset to be fallocated */
pgoff_t nr_falloced;/* how many new pages have been fallocated */
@@ -759,6 +760,7 @@ static int shmem_writepage(struct page *
spin_lock(&inode->i_lock);
shmem_falloc = inode->i_private;
if (shmem_falloc &&
+   !shmem_falloc->mode &&
index >= shmem_falloc->start &&
index < shmem_falloc->next)
shmem_falloc->nr_unswapped++;
@@ -1233,6 +1235,43 @@ static int shmem_fault(struct vm_area_st
int error;
int ret = VM_FAULT_LOCKED;
 
+   /*
+* Trinity finds that probing a hole which tmpfs is punching can
+* prevent the hole-punch from ever completing: which in turn
+* locks writers out with its hold on i_mutex.  So refrain from
+* faulting pages into the hole while it's being punched, and
+* wait on i_mutex to be released if vmf->flags permits, 
+*/
+   if (unlikely(inode->i_private)) {
+   struct shmem_falloc *shmem_falloc;
+   spin_lock(&inode->i_lock);
+   shmem_falloc = inode->i_private;
+   if (!shmem_falloc ||
+   shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
+   vmf->pgoff < shmem_falloc->start ||
+   vmf->pgoff >= shmem_falloc->next)
+   shmem_falloc = NULL;
+   spin_unlock(&inode->i_lock);
+   /*
+* i_lock has protected us from taking shmem_falloc seriously
+* once return from shmem_fallocate() went back up that stack.
+* i_lock does not serialize with i_mutex at all, but it does
+* not matter if sometimes we wait unnecessarily, or sometimes
+* miss out on waiting: we just need to make those cases rare.
+*/
+   if (shmem_falloc) {
+  

Re: mm: shm: hang in shmem_fallocate

2014-06-12 Thread Sasha Levin
On 02/09/2014 08:41 PM, Sasha Levin wrote:
> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>> Would trinity be likely to have a thread or process repeatedly faulting
>> in pages from the hole while it is being punched?
> 
> I can see how trinity would do that, but just to be certain - Cc davej.
> 
> On 02/08/2014 10:25 PM, Hugh Dickins wrote:
>> Does this happen with other holepunch filesystems?  If it does not,
>> I'd suppose it's because the tmpfs fault-in-newly-created-page path
>> is lighter than a consistent disk-based filesystem's has to be.
>> But we don't want to make the tmpfs path heavier to match them.
> 
> No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
> punching in other filesystems and I make sure to get a bunch of those
> mounted before starting testing.

Just pinging this one again. I still see hangs in -next where the hang
location looks same as before:


[ 3602.443529] CPU: 6 PID: 1153 Comm: trinity-c35 Not tainted 
3.15.0-next-20140612-sasha-00022-g5e4db85-dirty #645
[ 3602.443529] task: 8801b45eb000 ti: 8801a0b9 task.ti: 
8801a0b9
[ 3602.443529] RIP: vtime_account_system (include/linux/seqlock.h:229 
include/linux/seqlock.h:234 include/linux/seqlock.h:301 
kernel/sched/cputime.c:664)
[ 3602.443529] RSP: 0018:8801b4e03ef8  EFLAGS: 0046
[ 3602.443529] RAX: b31a83b8 RBX: 8801b45eb000 RCX: 0001
[ 3602.443529] RDX: b31a80bb RSI: b7915a75 RDI: 0082
[ 3602.443529] RBP: 8801b4e03f28 R08: 0001 R09: 
[ 3602.443529] R10:  R11:  R12: 8801b45eb968
[ 3602.443529] R13: 8801b45eb938 R14: 0282 R15: 8801b45ebda0
[ 3602.443529] FS:  7f93ac8ec700() GS:8801b4e0() 
knlGS:
[ 3602.443529] CS:  0010 DS:  ES:  CR0: 80050033
[ 3602.443529] CR2: 7f93a8854c9f CR3: 00018a189000 CR4: 06a0
[ 3602.443529] Stack:
[ 3602.443529]  8801b45eb000 001d7800 b32bd749 
8801b45eb000
[ 3602.443529]  001d7800 b32bd749 8801b4e03f48 
b31a83b8
[ 3602.443529]  8801b4e03f48 8801b45eb000 8801b4e03f68 
b31666a0
[ 3602.443529] Call Trace:
[ 3602.443529]  
[ 3602.443529] vtime_common_account_irq_enter (kernel/sched/cputime.c:430)
[ 3602.443529] irq_enter (include/linux/vtime.h:63 include/linux/vtime.h:115 
kernel/softirq.c:334)
[ 3602.443529] scheduler_ipi (kernel/sched/core.c:1589 
include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 
include/linux/tick.h:168 include/linux/tick.h:199 kernel/sched/core.c:1590)
[ 3602.443529] smp_reschedule_interrupt (arch/x86/kernel/smp.c:266)
[ 3602.443529] reschedule_interrupt (arch/x86/kernel/entry_64.S:1046)
[ 3602.443529]  
[ 3602.443529] _raw_spin_unlock (include/linux/spinlock_api_smp.h:151 
kernel/locking/spinlock.c:183)
[ 3602.443529] zap_pte_range (mm/memory.c:1218)
[ 3602.443529] unmap_single_vma (mm/memory.c:1256 mm/memory.c:1277 
mm/memory.c:1302 mm/memory.c:1348)
[ 3602.443529] zap_page_range_single (include/linux/mmu_notifier.h:234 
mm/memory.c:1429)
[ 3602.443529] unmap_mapping_range (mm/memory.c:2316 mm/memory.c:2392)
[ 3602.443529] truncate_inode_page (mm/truncate.c:136 mm/truncate.c:180)
[ 3602.443529] shmem_undo_range (mm/shmem.c:429)
[ 3602.443529] shmem_truncate_range (mm/shmem.c:527)
[ 3602.443529] shmem_fallocate (mm/shmem.c:1740)
[ 3602.443529] do_fallocate (include/linux/fs.h:1281 fs/open.c:299)
[ 3602.443529] SyS_madvise (mm/madvise.c:335 mm/madvise.c:384 mm/madvise.c:534 
mm/madvise.c:465)
[ 3602.443529] tracesys (arch/x86/kernel/entry_64.S:542)
[ 3602.443529] Code: 09 00 00 48 89 5d e8 48 89 fb 4c 89 e7 4c 89 6d f8 e8 25 
69 3b 03 83 83 30 09 00 00 01 48 8b 45 08 4c 8d ab 38 09 00 00 45 31 c9 <41> b8 
01 00 00 00 31 c9 31 d2 31 f6 4c 89 ef 48 89 04 24 e8 e8
All code

   0:   09 00   or %eax,(%rax)
   2:   00 48 89add%cl,-0x77(%rax)
   5:   5d  pop%rbp
   6:   e8 48 89 fb 4c  callq  0x4cfb8953
   b:   89 e7   mov%esp,%edi
   d:   4c 89 6d f8 mov%r13,-0x8(%rbp)
  11:   e8 25 69 3b 03  callq  0x33b693b
  16:   83 83 30 09 00 00 01addl   $0x1,0x930(%rbx)
  1d:   48 8b 45 08 mov0x8(%rbp),%rax
  21:   4c 8d ab 38 09 00 00lea0x938(%rbx),%r13
  28:   45 31 c9xor%r9d,%r9d
  2b:*  41 b8 01 00 00 00   mov$0x1,%r8d<-- trapping 
instruction
  31:   31 c9   xor%ecx,%ecx
  33:   31 d2   xor%edx,%edx
  35:   31 f6   xor%esi,%esi
  37:   4c 89 efmov%r13,%rdi
  3a:   48 89 04 24 mov%rax,(%rsp)
  3e:   e8  .byte 0xe8
  3f:   e8  .byte 0xe8
...

Code starting with the faulting instruction

Re: mm: shm: hang in shmem_fallocate

2014-02-09 Thread Sasha Levin

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> Would trinity be likely to have a thread or process repeatedly faulting
> in pages from the hole while it is being punched?

I can see how trinity would do that, but just to be certain - Cc davej.

On 02/08/2014 10:25 PM, Hugh Dickins wrote:
> Does this happen with other holepunch filesystems?  If it does not,
> I'd suppose it's because the tmpfs fault-in-newly-created-page path
> is lighter than a consistent disk-based filesystem's has to be.
> But we don't want to make the tmpfs path heavier to match them.

No, this is strictly limited to tmpfs, and AFAIK trinity tests hole
punching in other filesystems and I make sure to get a bunch of those
mounted before starting testing.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: shm: hang in shmem_fallocate

2014-02-08 Thread Hugh Dickins
On Sat, 8 Feb 2014, Sasha Levin wrote:
> On 12/15/2013 11:01 PM, Sasha Levin wrote:
> > Hi all,
> > 
> > While fuzzing with trinity inside a KVM tools guest running latest -next,
> > I've noticed that
> > quite often there's a hang happening inside shmem_fallocate. There are
> > several processes stuck
> > trying to acquire inode->i_mutex (for more than 2 minutes), while the
> > process that holds it has
> > the following stack trace:
> 
> [snip]
> 
> This still happens. For the record, here's a better trace:

Thanks for the reminder, and for the better trace: I don't find those
traces where _every_ line is a "? " very useful (and whenever I puzzle
over one of those, I wonder if it's inevitable, or something we got
just slightly wrong in working out the frames... another time).

> 
> [  507.124903] CPU: 60 PID: 10864 Comm: trinity-c173 Tainted: GW
> 3.14.0-rc1-next-20140207-sasha-7-g03959f6-dirty #2
> [  507.124903] task: 8801f1e38000 ti: 8801f1e4 task.ti:
> 8801f1e4
> [  507.124903] RIP: 0010:[]  []
> __delay+0xf/0x20
> [  507.124903] RSP: :8801f1e418a8  EFLAGS: 0202
> [  507.124903] RAX: 0001 RBX: 880524cf9f40 RCX:
> e9adc2c3
> [  507.124903] RDX: 010f RSI: 8129813c RDI:
> 
> [  507.124903] RBP: 8801f1e418a8 R08:  R09:
> 
> [  507.124903] R10: 0001 R11:  R12:
> 000affe0
> [  507.124903] R13: 86c42710 R14: 8801f1e41998 R15:
> 8801f1e41ac8
> [  507.124903] FS:  7ff708073700() GS:88052b40()
> knlGS:
> [  507.124903] CS:  0010 DS:  ES:  CR0: 8005003b
> [  507.124903] CR2: 0089d010 CR3: 0001f1e2c000 CR4:
> 06e0
> [  507.124903] DR0: 00696000 DR1:  DR2:
> 
> [  507.124903] DR3:  DR6: 0ff0 DR7:
> 0600
> [  507.124903] Stack:
> [  507.124903]  8801f1e418d8 811af053 880524cf9f40
> 880524cf9f40
> [  507.124903]  880524cf9f58 8807275b1000 8801f1e41908
> 84447580
> [  507.124903]  8129813c 811ea882 3000
> 7ff705eb2000
> [  507.124903] Call Trace:
> [  507.124903]  [] do_raw_spin_lock+0xe3/0x170
> [  507.124903]  [] _raw_spin_lock+0x60/0x80
> [  507.124903]  [] ? zap_pte_range+0xec/0x580
> [  507.124903]  [] ? smp_call_function_single+0x242/0x270
> [  507.124903]  [] zap_pte_range+0xec/0x580
> [  507.124903]  [] ? flush_tlb_mm_range+0x280/0x280
> [  507.124903]  [] ? cpumask_next_and+0xa7/0xd0
> [  507.124903]  [] ? flush_tlb_mm_range+0x280/0x280
> [  507.124903]  [] unmap_page_range+0x3fe/0x410
> [  507.124903]  [] unmap_single_vma+0x101/0x120
> [  507.124903]  [] zap_page_range_single+0x119/0x160
> [  507.124903]  [] ? trace_hardirqs_on+0x8/0x10
> [  507.124903]  [] ? memcg_check_events+0x7a/0x170
> [  507.124903]  [] ? unmap_mapping_range+0x73/0x180
> [  507.124903]  [] unmap_mapping_range+0xfe/0x180
> [  507.124903]  [] truncate_inode_page+0x37/0x90
> [  507.124903]  [] shmem_undo_range+0x6aa/0x770
> [  507.124903]  [] ? unmap_mapping_range+0x168/0x180
> [  507.124903]  [] shmem_truncate_range+0x18/0x40
> [  507.124903]  [] shmem_fallocate+0x99/0x2f0
> [  507.124903]  [] ? madvise_vma+0xde/0x1c0
> [  507.124903]  [] ? __lock_release+0x1e2/0x200
> [  507.124903]  [] do_fallocate+0x126/0x170
> [  507.124903]  [] madvise_vma+0xf4/0x1c0
> [  507.124903]  [] SyS_madvise+0x188/0x250
> [  507.124903]  [] tracesys+0xdd/0xe2
> [  507.124903] Code: 66 66 66 66 90 48 c7 05 a4 66 04 05 e0 92 ae 81 c9 c3 66
> 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 ff 15 89 66 04 05 
> c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d 04
> 
> I'm still trying to figure it out. To me it seems like a series of calls to
> shmem_truncate_range() takes so long that one of the tasks triggers a hung
> task. We don't actually hang in any specific
> shmem_truncate_range() for too long though.

Okay, we're doing a FALLOC_FL_PUNCH_HOLE on tmpfs (via MADV_REMOVE).

This trace shows clearly that unmap_mapping_range is being called from
truncate_inode_range: that's supposed to be a rare inefficient fallback,
normally all the mapped pages being unmapped by the prior call to
unmap_mapping_range in shmem_fallocate itself.

Now it's conceivable that there's some kind of off-by-one wrap-around
case which doesn't behave as intended, but I was fairly careful there:
you have to be because the different functions involved have different
calling conventions and needs.  It would be interesting to know the
arguments to madvise() and to shmem_fallocate() to rule that out,
but my guess is that's not the problem.

Would trinity be likely to have a thread or process repeatedly faulting
in pages from the hole while it is being punched?

That's what it looks like to me, but I'm not sure what to do about it,
if anything.  It's a pity that shm

Re: mm: shm: hang in shmem_fallocate

2014-02-08 Thread Sasha Levin

On 12/15/2013 11:01 PM, Sasha Levin wrote:

Hi all,

While fuzzing with trinity inside a KVM tools guest running latest -next, I've 
noticed that
quite often there's a hang happening inside shmem_fallocate. There are several 
processes stuck
trying to acquire inode->i_mutex (for more than 2 minutes), while the process 
that holds it has
the following stack trace:


[snip]

This still happens. For the record, here's a better trace:

[  507.124903] CPU: 60 PID: 10864 Comm: trinity-c173 Tainted: GW 
3.14.0-rc1-next-20140207-sasha-7-g03959f6-dirty #2

[  507.124903] task: 8801f1e38000 ti: 8801f1e4 task.ti: 
8801f1e4
[  507.124903] RIP: 0010:[]  [] 
__delay+0xf/0x20
[  507.124903] RSP: :8801f1e418a8  EFLAGS: 0202
[  507.124903] RAX: 0001 RBX: 880524cf9f40 RCX: e9adc2c3
[  507.124903] RDX: 010f RSI: 8129813c RDI: 
[  507.124903] RBP: 8801f1e418a8 R08:  R09: 
[  507.124903] R10: 0001 R11:  R12: 000affe0
[  507.124903] R13: 86c42710 R14: 8801f1e41998 R15: 8801f1e41ac8
[  507.124903] FS:  7ff708073700() GS:88052b40() 
knlGS:
[  507.124903] CS:  0010 DS:  ES:  CR0: 8005003b
[  507.124903] CR2: 0089d010 CR3: 0001f1e2c000 CR4: 06e0
[  507.124903] DR0: 00696000 DR1:  DR2: 
[  507.124903] DR3:  DR6: 0ff0 DR7: 0600
[  507.124903] Stack:
[  507.124903]  8801f1e418d8 811af053 880524cf9f40 
880524cf9f40
[  507.124903]  880524cf9f58 8807275b1000 8801f1e41908 
84447580
[  507.124903]  8129813c 811ea882 3000 
7ff705eb2000
[  507.124903] Call Trace:
[  507.124903]  [] do_raw_spin_lock+0xe3/0x170
[  507.124903]  [] _raw_spin_lock+0x60/0x80
[  507.124903]  [] ? zap_pte_range+0xec/0x580
[  507.124903]  [] ? smp_call_function_single+0x242/0x270
[  507.124903]  [] zap_pte_range+0xec/0x580
[  507.124903]  [] ? flush_tlb_mm_range+0x280/0x280
[  507.124903]  [] ? cpumask_next_and+0xa7/0xd0
[  507.124903]  [] ? flush_tlb_mm_range+0x280/0x280
[  507.124903]  [] unmap_page_range+0x3fe/0x410
[  507.124903]  [] unmap_single_vma+0x101/0x120
[  507.124903]  [] zap_page_range_single+0x119/0x160
[  507.124903]  [] ? trace_hardirqs_on+0x8/0x10
[  507.124903]  [] ? memcg_check_events+0x7a/0x170
[  507.124903]  [] ? unmap_mapping_range+0x73/0x180
[  507.124903]  [] unmap_mapping_range+0xfe/0x180
[  507.124903]  [] truncate_inode_page+0x37/0x90
[  507.124903]  [] shmem_undo_range+0x6aa/0x770
[  507.124903]  [] ? unmap_mapping_range+0x168/0x180
[  507.124903]  [] shmem_truncate_range+0x18/0x40
[  507.124903]  [] shmem_fallocate+0x99/0x2f0
[  507.124903]  [] ? madvise_vma+0xde/0x1c0
[  507.124903]  [] ? __lock_release+0x1e2/0x200
[  507.124903]  [] do_fallocate+0x126/0x170
[  507.124903]  [] madvise_vma+0xf4/0x1c0
[  507.124903]  [] SyS_madvise+0x188/0x250
[  507.124903]  [] tracesys+0xdd/0xe2
[  507.124903] Code: 66 66 66 66 90 48 c7 05 a4 66 04 05 e0 92 ae 81 c9 c3 66 2e 0f 1f 84 00 00 00 
00 00 55 48 89 e5 66 66 66 66 90 ff 15 89 66 04 05  c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 
00 55 48 8d 04


I'm still trying to figure it out. To me it seems like a series of calls to shmem_truncate_range() 
takes so long that one of the tasks triggers a hung task. We don't actually hang in any specific

shmem_truncate_range() for too long though.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mm: shm: hang in shmem_fallocate

2013-12-15 Thread Sasha Levin

Hi all,

While fuzzing with trinity inside a KVM tools guest running latest -next, I've 
noticed that
quite often there's a hang happening inside shmem_fallocate. There are several 
processes stuck
trying to acquire inode->i_mutex (for more than 2 minutes), while the process 
that holds it has
the following stack trace:

[ 2059.561282] Call Trace:
[ 2059.561557]  [] ? sched_clock_cpu+0x108/0x120
[ 2059.562444]  [] ? get_lock_stats+0x2a/0x60
[ 2059.563247]  [] ? put_lock_stats+0xe/0x30
[ 2059.563930]  [] ? get_lock_stats+0x2a/0x60
[ 2059.564646]  [] ? x2apic_send_IPI_mask+0x13/0x20
[ 2059.565431]  [] ? __rcu_read_unlock+0x44/0xb0
[ 2059.566161]  [] ? generic_exec_single+0x55/0x80
[ 2059.566992]  [] ? page_remove_rmap+0x295/0x320
[ 2059.567782]  [] ? _raw_spin_lock+0x6c/0x80
[ 2059.568390]  [] ? zap_pte_range+0xec/0x590
[ 2059.569157]  [] ? zap_pte_range+0x2f0/0x590
[ 2059.569907]  [] ? flush_tlb_mm_range+0x360/0x360
[ 2059.570855]  [] ? preempt_schedule+0x53/0x80
[ 2059.571613]  [] ? ___preempt_schedule+0x56/0xb0
[ 2059.572526]  [] ? flush_tlb_mm_range+0x336/0x360
[ 2059.573368]  [] ? tlb_flush_mmu+0x3b/0x90
[ 2059.574152]  [] ? tlb_finish_mmu+0x14/0x40
[ 2059.574951]  [] ? zap_page_range_single+0x146/0x160
[ 2059.575797]  [] ? trace_hardirqs_on+0x8/0x10
[ 2059.576629]  [] ? unmap_mapping_range+0x73/0x180
[ 2059.577362]  [] ? unmap_mapping_range+0xfe/0x180
[ 2059.578194]  [] ? truncate_inode_page+0x37/0x90
[ 2059.579013]  [] ? shmem_undo_range+0x711/0x830
[ 2059.579807]  [] ? unmap_mapping_range+0x168/0x180
[ 2059.580729]  [] ? shmem_truncate_range+0x18/0x40
[ 2059.581598]  [] ? shmem_fallocate+0x99/0x2f0
[ 2059.582325]  [] ? madvise_vma+0xde/0x1c0
[ 2059.583049]  [] ? __lock_release+0x1da/0x1f0
[ 2059.583816]  [] ? do_fallocate+0x126/0x170
[ 2059.584581]  [] ? madvise_vma+0xf4/0x1c0
[ 2059.585302]  [] ? SyS_madvise+0x188/0x250
[ 2059.586012]  [] ? tracesys+0xdd/0xe2
[ 2059.586689]  880f39bc3db8 0002 880fce4b 
880fce4b
[ 2059.587768]  880f39bc2010 001d78c0 001d78c0 
001d78c0
[ 2059.588840]  880fce6a 880fce4b 880fe5bd6d40 
880fa88e8ab0


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/