On Thu 14-12-17 18:55:27, Yan, Zheng wrote: > We recently got an Oops report: > > BUG: unable to handle kernel NULL pointer dereference at (null) > IP: jbd2__journal_start+0x38/0x1a2 > [...] > Call Trace: > ext4_page_mkwrite+0x307/0x52b > _ext4_get_block+0xd8/0xd8 > do_page_mkwrite+0x6e/0xd8 > handle_mm_fault+0x686/0xf9b > mntput_no_expire+0x1f/0x21e > __do_page_fault+0x21d/0x465 > dput+0x4a/0x2f7 > page_fault+0x22/0x30 > copy_user_generic_string+0x2c/0x40 > copy_page_to_iter+0x8c/0x2b8 > generic_file_read_iter+0x26e/0x845 > timerqueue_del+0x31/0x90 > ceph_read_iter+0x697/0xa33 [ceph] > hrtimer_cancel+0x23/0x41 > futex_wait+0x1c8/0x24d > get_futex_key+0x32c/0x39a > __vfs_read+0xe0/0x130 > vfs_read.part.1+0x6c/0x123 > handle_mm_fault+0x831/0xf9b > __fget+0x7e/0xbf > SyS_read+0x4d/0xb5 > > ceph_read_iter() uses current->journal_info to pass context info to > ceph_readpages(). Because ceph_readpages() needs to know if its caller > has already gotten capability of using page cache (distinguish read > from readahead/fadvise). ceph_read_iter() set current->journal_info, > then calls generic_file_read_iter(). > > In above Oops, page fault happened when copying data to userspace. > Page fault handler called ext4_page_mkwrite(). Ext4 code read > current->journal_info and assumed it is journal handle. > > I checked other filesystems, btrfs probably suffers similar problem > for its readpage. (page fault happens when write() copies data from > userspace memory and the memory is mapped to a file in btrfs. > verify_parent_transid() can be called during readpage) > > Cc: sta...@vger.kernel.org > Signed-off-by: "Yan, Zheng" <z...@redhat.com>
I am not an FS expert so (ab)using journal_info for unrelated purposes might be acceptable in general but hooking into the generic PF path like this is just too ugly to live. Can this be limited to a FS code so that not everybody has to pay additional cycles? With a big fat warning that (ab)users might want to find a better way to comunicate their internal stuff. > --- > mm/memory.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index a728bed16c20..db2a50233c49 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, > unsigned long address, > unsigned int flags) > { > int ret; > + void *old_journal_info; > > __set_current_state(TASK_RUNNING); > > @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, > unsigned long address, > if (flags & FAULT_FLAG_USER) > mem_cgroup_oom_enable(); > > + /* > + * Fault can happen when filesystem A's read_iter()/write_iter() > + * copies data to/from userspace. Filesystem A may have set > + * current->journal_info. If the userspace memory is MAP_SHARED > + * mapped to a file in filesystem B, we later may call filesystem > + * B's vm operation. Filesystem B may also want to read/set > + * current->journal_info. > + */ > + old_journal_info = current->journal_info; > + current->journal_info = NULL; > + > if (unlikely(is_vm_hugetlb_page(vma))) > ret = hugetlb_fault(vma->vm_mm, vma, address, flags); > else > ret = __handle_mm_fault(vma, address, flags); > > + current->journal_info = old_journal_info; > + > if (flags & FAULT_FLAG_USER) { > mem_cgroup_oom_disable(); > /* > -- > 2.13.6 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a> -- Michal Hocko SUSE Labs