Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-05-16 Thread Guillaume Morin
On 02 May  5:59, Guillaume Morin wrote:
>
> On 30 Apr 21:25, David Hildenbrand wrote:
> > > I tried to get the hugepd stuff right but this was the first I heard
> > > about it :-) Afaict follow_huge_pmd and friends were already DTRT
> > 
> > I'll have to have a closer look at some details (the hugepd writability
> > check looks a bit odd), but it's mostly what I would have expected!
> 
> Ok in the meantime, here is the uprobe change on your current
> uprobes_cow trying to address the comments you made in your previous
> message. Some of them were not 100% clear to me, so it's a best effort
> patch :-) Again lightly tested

David, have you had a chance to take a look at both patches?

-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-05-01 Thread Guillaume Morin
return PTR_ERR(page);
 
@@ -1182,9 +1290,12 @@ static int __uprobe_register(struct inode *inode, loff_t 
offset,
if (!uc->handler && !uc->ret_handler)
return -EINVAL;
 
-   /* copy_insn() uses read_mapping_page() or shmem_read_mapping_page() */
+   /* copy_insn() uses read_mapping_page() or shmem/hugetlbfs specific
+* logic
+*/
if (!inode->i_mapping->a_ops->read_folio &&
-   !shmem_mapping(inode->i_mapping))
+   !shmem_mapping(inode->i_mapping) &&
+   !hugetlbfs_mapping(inode->i_mapping))
return -EIO;
/* Racy, just to catch the obvious mistakes */
if (offset > i_size_read(inode))

-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread Guillaume Morin
swap space. */
> > +   dec_mm_counter(mm, MM_ANONPAGES);
> > +   folio_remove_rmap_pte(folio, page, vma);
> > +   if (!folio_mapped(folio) && folio_test_swapcache(folio) &&
> > +   folio_trylock(folio)) {
> > +   folio_free_swap(folio);
> > +   folio_unlock(folio);
> > +   }
> > }
> > folio_put(folio);
> > @@ -461,11 +471,29 @@ static int __write_opcode_pte(pte_t *ptep, unsigned 
> > long vaddr,
> >  */
> > smp_wmb();
> > /* We modified the page. Make sure to mark the PTE dirty. */
> > -   set_pte_at(mm, vaddr, ptep, pte_mkdirty(pte));
> > +   if (folio_test_hugetlb(folio))
> > +   set_huge_pte_at(mm , vaddr, ptep, huge_pte_mkdirty(pte),
> > +   (~page_mask) + 1);
> > +   else
> > +   set_pte_at(mm, vaddr, ptep, pte_mkdirty(pte));
> > return UWO_DONE;
> >   }
> > +static int __write_opcode_hugetlb(pte_t *ptep, unsigned long hmask,
> > +   unsigned long vaddr,
> > +   unsigned long next, struct mm_walk *walk)
> > +{
> > +   return __write_opcode(ptep, vaddr, hmask, walk);
> > +}
> > +
> > +static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,
> > +   unsigned long next, struct mm_walk *walk)
> > +{
> > +   return __write_opcode(ptep, vaddr, PAGE_MASK, walk);
> > +}
> > +
> >   static const struct mm_walk_ops write_opcode_ops = {
> > +   .hugetlb_entry  = __write_opcode_hugetlb,
> > .pte_entry  = __write_opcode_pte,
> > .walk_lock  = PGWALK_WRLOCK,
> >   };
> > @@ -492,7 +520,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, 
> > struct vm_area_struct *vma,
> > unsigned long opcode_vaddr, uprobe_opcode_t opcode)
> >   {
> > struct uprobe *uprobe = container_of(auprobe, struct uprobe, arch);
> > -   const unsigned long vaddr = opcode_vaddr & PAGE_MASK;
> > +   unsigned long vaddr = opcode_vaddr & PAGE_MASK;
> > const bool is_register = !!is_swbp_insn();
> > struct uwo_data data = {
> > .opcode = opcode,
> > @@ -503,6 +531,7 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, 
> > struct vm_area_struct *vma,
> > struct mmu_notifier_range range;
> > int ret, ref_ctr_updated = 0;
> > struct page *page;
> > +   unsigned long page_size = PAGE_SIZE;
> > if (WARN_ON_ONCE(!is_cow_mapping(vma->vm_flags)))
> > return -EINVAL;
> > @@ -521,7 +550,14 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, 
> > struct vm_area_struct *vma,
> > if (ret != 1)
> > goto out;
> > -   ret = verify_opcode(page, opcode_vaddr, );
> > +
> > +   if (is_vm_hugetlb_page(vma)) {
> > +   struct hstate *h = hstate_vma(vma);
> > +   page_size = huge_page_size(h);
> > +   vaddr &= huge_page_mask(h);
> > +   page = compound_head(page);
> 
> I think we should only adjust the range we pass to the mmu notifier and for
> walking the VMA range. But we should not adjust vaddr.
> 
> Further, we should not adjust the page if possible ... ideally, we'll treat
> hugetlb folios just like large folios here and operate on subpages.
> 
> Inside __write_opcode(), we can derive the the page of interest from
> data->opcode_vaddr.

Here you mean __write_opcode_hugetlb(), right? Since we're going with
the 2 independent variants. Just want to 100% sure I am following

> find_get_page() might need some though, if it won't return a subpage of a
> hugetlb folio. Should be solvable by a wrapper, though.

We can zero out the subbits with the huge page mask in the
vaddr_to_offset() in the hugetlb variant like I do in __copy_insn() and
that should work, no? Or you prefer a wrapper?

Guillaume.

-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread Guillaume Morin
On 26 Apr 21:55, Guillaume Morin wrote:
>
> On 26 Apr  9:19, David Hildenbrand wrote:
> > A couple of points:
> > 
> > a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
> > to check PageAnonExclusive.
> > 
> > b) If you're not following the can_follow_write_pte/_pmd model, you are
> > doing something wrong :)
> > 
> > c) The code was heavily changed in mm/mm-unstable. It was merged with t
> > the common code.
> > 
> > Likely, in mm/mm-unstable, the existing can_follow_write_pte and
> > can_follow_write_pmd checks will already cover what you want in most cases.
> > 
> > We'd need a can_follow_write_pud() to cover follow_huge_pud() and
> > (unfortunately) something to handle follow_hugepd() as well similarly.
> > 
> > Copy-pasting what we do in can_follow_write_pte() and adjusting for
> > different PTE types is the right thing to do. Maybe now it's time to factor
> > out the common checks into a separate helper.
> 
> I tried to get the hugepd stuff right but this was the first I heard
> about it :-) Afaict follow_huge_pmd and friends were already DTRT

I got it working on top of your uprobes-cow branch with the foll force
patch sent friday. Still pretty lightly tested

I went with using one write uprobe function with some additional
branches. I went back and forth between that and making them 2 different
functions.

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f4e88552d3f..8a33e380f7ea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
 };
 
+bool hugetlbfs_mapping(struct address_space *mapping) {
+   return mapping->a_ops == _aops;
+}
+
 /*
  * Mask used when checking the page offset value passed in via system
  * calls.  This value will be converted to a loff_t which is signed.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 68244bb3637a..66fdcc0b5db2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,6 +511,8 @@ struct hugetlbfs_sb_info {
umode_t mode;
 };
 
+bool hugetlbfs_mapping(struct address_space *mapping);
+
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 {
return sb->s_fs_info;
@@ -557,6 +559,8 @@ static inline struct hstate *hstate_inode(struct inode *i)
 {
return NULL;
 }
+
+static inline bool hugetlbfs_mapping(struct address_space *mapping) { return 
false; }
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4237a7477cca..e6e93a574c39 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include  /* read_mapping_page */
 #include 
 #include 
@@ -120,7 +121,7 @@ struct xol_area {
  */
 static bool valid_vma(struct vm_area_struct *vma, bool is_register)
 {
-   vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
+   vm_flags_t flags = VM_MAYEXEC | VM_MAYSHARE;
 
if (is_register)
flags |= VM_WRITE;
@@ -163,21 +164,21 @@ bool __weak is_trap_insn(uprobe_opcode_t *insn)
return is_swbp_insn(insn);
 }
 
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len)
+static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len, unsigned long page_mask)
 {
void *kaddr = kmap_atomic(page);
-   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
+   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
kunmap_atomic(kaddr);
 }
 
-static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len)
+static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len, unsigned long page_mask)
 {
void *kaddr = kmap_atomic(page);
-   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
+   memcpy(kaddr + (vaddr & ~page_mask), src, len);
kunmap_atomic(kaddr);
 }
 
-static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode)
+static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode, unsigned long page_mask)
 {
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long 
vaddr, uprobe_opcode_t
 * is a trap variant; uprobes always wins over any other (gdb)
 * breakpoint.
 */
-   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE);
+   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE,
+  page_mask);
is_swbp = is_swbp_insn(_opcode);
 
if (is_swbp_insn(new_opcode)) {
@@ -376,8 +378,8 @@ struct uwo_data {
uprobe_opcode_t opcode;
 };
 
-sta

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-26 Thread Guillaume Morin
if (likely(vmf->pte && pte_same(huge_ptep_get(vmf->pte), pte))) {
-   pte_t newpte = make_huge_pte(vma, _folio->page, !unshare);
+   const bool writable = !unshare && (vma->vm_flags & VM_WRITE);
+   pte_t newpte = make_huge_pte(vma, _folio->page, writable);
 
/* Break COW or unshare */
huge_ptep_clear_flush(vma, vmf->address, vmf->pte);


-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread Guillaume Morin
On 25 Apr 21:56, David Hildenbrand wrote:
>
> On 25.04.24 17:19, Guillaume Morin wrote:
> > On 24 Apr 23:00, David Hildenbrand wrote:
> > > > One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
> > > > hugetlb mappings. However this was also on my TODO and I have a draft
> > > > patch that implements it.
> > > 
> > > Yes, I documented it back then and added sanity checks in GUP code to 
> > > fence
> > > it off. Shouldn't be too hard to implement (famous last words) and would 
> > > be
> > > the cleaner thing to use here once I manage to switch over to
> > > FOLL_WRITE|FOLL_FORCE to break COW.
> > 
> > Yes, my patch seems to be working. The hugetlb code is pretty simple.
> > And it allows ptrace and the proc pid mem file to work on the executable
> > private hugetlb mappings.
> > 
> > There is one thing I am unclear about though. hugetlb enforces that
> > huge_pte_write() is true on FOLL_WRITE in both the fault and
> > follow_page_mask paths. I am not sure if we can simply assume in the
> > hugetlb code that if the pte is not writable and this is a write fault
> > then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
> > checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?
> > 
> > The latter is more complicated in the fault path because there is no
> > FAULT_FLAG_FORCE flag.
> > 
> 
> I just pushed something to
>   https://github.com/davidhildenbrand/linux/tree/uprobes_cow
> 
> Only very lightly tested so far. Expect the worst :)


I'll try it out and send you the hugetlb bits

> 
> I still detest having the zapping logic there, but to get it all right I
> don't see a clean way around that.
> 
> 
> For hugetlb, we'd primarily have to implement the
> mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).

For FOLL_FORCE, heer is my draft. Let me know if this is what you had in
mind. 


diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..ac60e0ae64e8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1056,9 +1056,6 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
-   /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-   if (is_vm_hugetlb_page(vma))
-   return -EFAULT;
/*
 * We used to let the write,force case do COW in a
 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3548eae42cf9..73f86eddf888 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5941,7 +5941,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
   struct folio *pagecache_folio, spinlock_t *ptl,
   struct vm_fault *vmf)
 {
-   const bool unshare = flags & FAULT_FLAG_UNSHARE;
+   const bool make_writable = !(flags & FAULT_FLAG_UNSHARE) &&
+   (vma->vm_flags & VM_WRITE);
pte_t pte = huge_ptep_get(ptep);
struct hstate *h = hstate_vma(vma);
struct folio *old_folio;
@@ -5959,16 +5960,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, 
struct vm_area_struct *vma,
 * can trigger this, because hugetlb_fault() will always resolve
 * uffd-wp bit first.
 */
-   if (!unshare && huge_pte_uffd_wp(pte))
+   if (make_writable && huge_pte_uffd_wp(pte))
return 0;
 
-   /*
-* hugetlb does not support FOLL_FORCE-style write faults that keep the
-* PTE mapped R/O such as maybe_mkwrite() would do.
-*/
-   if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE)))
-   return VM_FAULT_SIGSEGV;
-
/* Let's take out MAP_SHARED mappings first. */
if (vma->vm_flags & VM_MAYSHARE) {
set_huge_ptep_writable(vma, haddr, ptep);
@@ -5989,7 +5983,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
folio_move_anon_rmap(old_folio, vma);
SetPageAnonExclusive(_folio->page);
}
-   if (likely(!unshare))
+   if (likely(make_writable))
set_huge_ptep_writable(vma, haddr, ptep);
 
delayacct_wpcopy_end();
@@ -6094,7 +6088,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
spin_lock(ptl);
ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
if (likely(ptep && pte_same(huge_

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread Guillaume Morin
On 24 Apr 23:00, David Hildenbrand wrote:
> > One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
> > hugetlb mappings. However this was also on my TODO and I have a draft
> > patch that implements it.
> 
> Yes, I documented it back then and added sanity checks in GUP code to fence
> it off. Shouldn't be too hard to implement (famous last words) and would be
> the cleaner thing to use here once I manage to switch over to
> FOLL_WRITE|FOLL_FORCE to break COW.

Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.

-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread Guillaume Morin
On 24 Apr 22:09, David Hildenbrand wrote:
> > > Let me try to see if we can get this done cleaner.
> > > 
> > > One ugly part (in general here) is the custom page replacement in the
> > > registration part.
> > > 
> > > We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing 
> > > pages
> > > ourselves (which we likely shouldn't do ...) ... maybe we could use
> > > FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
> > > populated. (like KSM does nowadays)
> > > 
> > > Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
> > > using FOLL_WRITE would not work on many file systems. So maybe we have to
> > > trigger an unsharing fault ourselves.
> 
> ^ realizing that we already use FOLL_FORCE, so we can just use FOLL_WRITE to
> break COW.

It was never clear to me why uprobes was not doing FOLL_WRITE in the
first place, I must say.

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.

> 
> > > 
> > > That would do the page replacement for us and we "should" be able to 
> > > lookup
> > > an anonymous folio that we can then just modify, like ptrace would.
> > > 
> > > But then, there is also unregistration part, with weird conditional page
> > > replacement. Zapping the anon page if the content matches the content of 
> > > the
> > > original page is one thing. But why are we placing an existing anonymous
> > > page by a new anonymous page when the content from the original page 
> > > differs
> > > (but matches the one from the just copied page?)?
> > > 
> > > I'll have to further think about that one. It's all a bit nasty.
> > 
> > Sounds good to me. I am willing to help with the code when you have a
> > plan or testing as you see fit. Let me know.
> 
> I'm hacking on a redesign that removes the manual COW breaking logic and
> *might* make it easier to integrate hugetlb. (very likely, but until I have
> the redesign running I cannot promise anything :) )
> 
> I'll let you know once I have something ready so you could integrate the
> hugetlb portion.

Sounds good.

-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-22 Thread Guillaume Morin
On 22 Apr 20:59, David Hildenbrand wrote:
> > The benefit - to me - is very clear. People do use hugetlb mappings to
> > run code in production environments. The perf benefits are there for some
> > workloads. Intel has published a whitepaper about it etc.
> > Uprobes are a very good tool to do live tracing. If you can restart the
> > process and reproduce, you should be able to disable hugetlb remapping
> > but if you need to look at a live process, there are not many options.
> > Not being able to use uprobes is crippling.
> 
> Please add all that as motivation to the patch description or cover letter.
>
> > > Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
> > > Nobody cared until now, why care now?
> > 
> > I think you could ask the same question for every new feature patch :)
> 
> I have to, because it usually indicates a lack of motivation in the
> cover-letter/patch description :P

My cover letter was indeed lacking. I will make sure to add this kind of
details next time.
 
> > Since the removal a few releases ago of the __morecore() hook in glibc,
> > the main feature of libhugetlbfs is ELF segments remapping. I think
> > there are definitely a lot of users that simply deal with this
> > unnecessary limitation.
> > 
> > I am certainly not shoving this patch through anyone's throat if there
> > is no interest. But we definitely find it a very useful feature ...
> 
> Let me try to see if we can get this done cleaner.
> 
> One ugly part (in general here) is the custom page replacement in the
> registration part.
> 
> We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
> ourselves (which we likely shouldn't do ...) ... maybe we could use
> FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
> populated. (like KSM does nowadays)
> 
> Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
> using FOLL_WRITE would not work on many file systems. So maybe we have to
> trigger an unsharing fault ourselves.
> 
> That would do the page replacement for us and we "should" be able to lookup
> an anonymous folio that we can then just modify, like ptrace would.
> 
> But then, there is also unregistration part, with weird conditional page
> replacement. Zapping the anon page if the content matches the content of the
> original page is one thing. But why are we placing an existing anonymous
> page by a new anonymous page when the content from the original page differs
> (but matches the one from the just copied page?)?
> 
> I'll have to further think about that one. It's all a bit nasty.

Sounds good to me. I am willing to help with the code when you have a
plan or testing as you see fit. Let me know.

> One thing to note is that hugetlb folios don't grow on trees. Likely, Many
> setups *don't* reserve extra hugetlb folios and you might just easily be
> running out of free hugetlb folios that you can use to break COW here
> (replace a file hugetlb by a fresh anon hugetlb page). Likely it's easy to
> make register or unregister fail.

Agreed.

-- 
Guillaume Morin 



Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-22 Thread Guillaume Morin
(Dropping Mike Kravetz as CC since he has retired and his email is no
longer valid, adding Muchun since he's the current hugetlb maintainer,
as well as linux-trace-kernel)

On 22 Apr 11:39, David Hildenbrand wrote:
>
> On 19.04.24 20:37, Guillaume Morin wrote:
> > libhugetlbfs, the Intel iodlr code both allow to remap .text onto a
> > hugetlb private mapping. It's also pretty easy to do it manually.
> > One drawback of using this functionality is the lack of support for
> > uprobes (NOTE uprobe ignores shareable vmas)
> > 
> > This patch adds support for private hugetlb mappings.  It does require 
> > exposing
> > some hugetlbfs innards and relies on copy_user_large_folio which is only
> > available when CONFIG_HUGETLBFS is used so I had to use an ugly #ifdef
> > 
> > If there is some interest in applying this patch in some form or
> > another, I am open to any refactoring suggestions (esp getting rid the
> > #ifdef in uprobes.c) . I tried to limit the
> > amount of branching.
> 
> All that hugetlb special casing  oh my. What's the benefit why we should
> be interested in making that code less clean -- to phrase it in a nice way
> ;) ?

I do appreciate the nice phrasing. Believe me, I did try to limit the
special casing to a minimum :-).

Outside of __replace_page, I added only 3-ish branches so I do not think
it's *too* bad. The uprobe code is using PAGE_{SHIFT,MASK} quite liberally so I
had to add calls to retrieve these for the hugetlb vmas.

__replace_page has a lot of special casing. I certainly agree (and
unfortunately for me it's at the beginning of the patch :)).  It's doing
something pretty uncommon outside of the mm code so it has to make a
bunch of specific hugetlb calls. I am not quite sure how to improve it
but if you have suggestions, I'd be happy to refactor.

The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.

> Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
> Nobody cared until now, why care now?

I think you could ask the same question for every new feature patch :)

Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...

Guillaume.

-- 
Guillaume Morin 



[BUG] incorrect scaling_max_freq with intel_pstate after offline/online

2021-01-28 Thread Guillaume Morin
_freq = 0;
 
if (hwp_active) {
intel_pstate_set_itmt_prio(policy->cpu);
@@ -412,8 +413,23 @@ static void intel_pstate_init_acpi_perf_limits(struct 
cpufreq_policy *policy)
 
cpu = all_cpu_data[policy->cpu];
 
-   ret = acpi_processor_register_performance(>acpi_perf_data,
- policy->cpu);
+   /*
+* The _PSS table doesn't contain whole turbo frequency range.
+* This just contains +1 MHZ above the max non turbo frequency,
+* with control value corresponding to max turbo ratio. But
+* when cpufreq set policy is called, it will call with this
+* max frequency, which will cause a reduced performance as
+* this driver uses real max turbo frequency as the max
+* frequency. So correct this frequency in _PSS table to
+* correct max turbo frequency based on the turbo state.
+* Also need to convert to MHz as _PSS freq is in MHz.
+*/
+   if (!global.turbo_disabled)
+   override_max_freq = policy->cpuinfo.max_freq / 1000;
+
+   ret = __acpi_processor_register_performance(>acpi_perf_data,
+ policy->cpu,
+ override_max_freq);
if (ret)
return;
 
@@ -442,20 +458,6 @@ static void intel_pstate_init_acpi_perf_limits(struct 
cpufreq_policy *policy)
 (u32) cpu->acpi_perf_data.states[i].control);
}
 
-   /*
-* The _PSS table doesn't contain whole turbo frequency range.
-* This just contains +1 MHZ above the max non turbo frequency,
-* with control value corresponding to max turbo ratio. But
-* when cpufreq set policy is called, it will call with this
-* max frequency, which will cause a reduced performance as
-* this driver uses real max turbo frequency as the max
-* frequency. So correct this frequency in _PSS table to
-* correct max turbo frequency based on the turbo state.
-* Also need to convert to MHz as _PSS freq is in MHz.
-*/
-   if (!global.turbo_disabled)
-   cpu->acpi_perf_data.states[0].core_frequency =
-   policy->cpuinfo.max_freq / 1000;
cpu->valid_pss_table = true;
pr_debug("_PPC limits will be enforced\n");
 
diff --git a/include/acpi/processor.h b/include/acpi/processor.h
index 683e124ad517..4b2ce80ffbec 100644
--- a/include/acpi/processor.h
+++ b/include/acpi/processor.h
@@ -250,8 +250,15 @@ extern int acpi_processor_preregister_performance(struct
  acpi_processor_performance
  __percpu *performance);
 
-extern int acpi_processor_register_performance(struct 
acpi_processor_performance
-  *performance, unsigned int cpu);
+extern int __acpi_processor_register_performance(
+   struct acpi_processor_performance
+   *performance, unsigned int cpu,
+   u64 override_pss0_freq);
+static inline int acpi_processor_register_performance(
+   struct acpi_processor_performance
+  *performance, unsigned int cpu) {
+   return __acpi_processor_register_performance(performance, cpu, 0);
+}
 extern void acpi_processor_unregister_performance(unsigned int cpu);
 
 int acpi_processor_pstate_control(void);


-- 
Guillaume Morin 


Re: kernel panics with 4.14.X versions

2018-04-16 Thread Guillaume Morin
On 16 Apr 16:40, Jan Kara wrote:
> Can you please run RIP through ./scripts/faddr2line to see where exactly
> are we looping? I expect the loop iterating over marks to notify but better
> be sure.
> 
> How easily can you hit this? Are you able to run debug kernels / inspect
> crash dumps when the issue occurs? Also testing with the latest mainline
> kernel (4.16) would be welcome whether this isn't just an issue with the
> backport of fsnotify fixes from Miklos.

I do have one proper kernel crash dump for one of the lockups we saw

PID: 30407  TASK: 9584913b2180  CPU: 8   COMMAND: "python"
 #0 [959cb7883d80] machine_kexec at 890561ff
 #1 [959cb7883dd8] __crash_kexec at 890f6dde
 #2 [959cb7883e90] panic at 89074f03
 #3 [959cb7883f10] watchdog_timer_fn at 89117388
 #4 [959cb7883f40] __hrtimer_run_queues at 890dc65c
 #5 [959cb7883f88] hrtimer_interrupt at 890dcb76
 #6 [959cb7883fd8] smp_apic_timer_interrupt at 89802f6a
 #7 [959cb7883ff0] apic_timer_interrupt at 8980227d
---  ---
 #8 [afa5c894f880] apic_timer_interrupt at 8980227d
[exception RIP: unknown or invalid address]
RIP:   RSP: 8a696820  RFLAGS: 0002
RAX: 95908f520c20  RBX:   RCX: 
RDX: 959c83c4d000  RSI:   RDI: afa5c894f9f8
RBP: 53411000   R8:    R9: 95908f520c48
R10:   R11:   R12: 1000
R13: 1000  R14: 1000  R15: 5341
ORIG_RAX:   CS:   SS: ff10
bt: WARNING: possibly bogus exception frame
 #9 [afa5c894f928] fsnotify at 892293e7
#10 [afa5c894f9e8] __fsnotify_parent at 89229686
#11 [afa5c894fa48] __kernel_write at 891e9962
#12 [afa5c894fa70] dump_emit at 892445af
#13 [afa5c894faa8] elf_core_dump at 8923f546
#14 [afa5c894fc60] do_coredump at 89244c3f
#15 [afa5c894fda0] get_signal at 89083ed0
#16 [afa5c894fe18] do_signal at 89028323
#17 [afa5c894ff10] exit_to_usermode_loop at 8900308c
#18 [afa5c894ff38] prepare_exit_to_usermode at 89003753
RIP: 7f69706935c3  RSP: 7ffeb8c1b4a8  RFLAGS: 00010206
RAX: 7f686d200034  RBX: 5591f24f0170  RCX: 7f68cb80
RDX: 7f696d20  RSI: 0061  RDI: 7f686d200034
RBP: 7f686d200010   R8:    R9: 00ff
R10: e0a9a400  R11: 0246  R12: 0001
R13: 0001  R14:   R15: 0083
ORIG_RAX:   CS: 0033  SS: 002b

faddr2line gives "fsnotify at fs/notify/fsnotify.c:368" (it's a 4.14.22).  So
it does seem that you were right about the location.

This happens with systemd handling coredumps.  It's using fsnotify to learn
about new dumps.

Note that on this machine, the dumps are on a loop mount:
/dev/loop0 /usr/cores ext4 rw,relatime,data=ordered 0 0

-- 
Guillaume Morin <guilla...@morinfr.org>


Re: kernel panics with 4.14.X versions

2018-04-16 Thread Guillaume Morin
On 16 Apr 16:40, Jan Kara wrote:
> Can you please run RIP through ./scripts/faddr2line to see where exactly
> are we looping? I expect the loop iterating over marks to notify but better
> be sure.
> 
> How easily can you hit this? Are you able to run debug kernels / inspect
> crash dumps when the issue occurs? Also testing with the latest mainline
> kernel (4.16) would be welcome whether this isn't just an issue with the
> backport of fsnotify fixes from Miklos.

I do have one proper kernel crash dump for one of the lockups we saw

PID: 30407  TASK: 9584913b2180  CPU: 8   COMMAND: "python"
 #0 [959cb7883d80] machine_kexec at 890561ff
 #1 [959cb7883dd8] __crash_kexec at 890f6dde
 #2 [959cb7883e90] panic at 89074f03
 #3 [959cb7883f10] watchdog_timer_fn at 89117388
 #4 [959cb7883f40] __hrtimer_run_queues at 890dc65c
 #5 [959cb7883f88] hrtimer_interrupt at 890dcb76
 #6 [959cb7883fd8] smp_apic_timer_interrupt at 89802f6a
 #7 [959cb7883ff0] apic_timer_interrupt at 8980227d
---  ---
 #8 [afa5c894f880] apic_timer_interrupt at 8980227d
[exception RIP: unknown or invalid address]
RIP:   RSP: 8a696820  RFLAGS: 0002
RAX: 95908f520c20  RBX:   RCX: 
RDX: 959c83c4d000  RSI:   RDI: afa5c894f9f8
RBP: 53411000   R8:    R9: 95908f520c48
R10:   R11:   R12: 1000
R13: 1000  R14: 1000  R15: 5341
ORIG_RAX:   CS:   SS: ff10
bt: WARNING: possibly bogus exception frame
 #9 [afa5c894f928] fsnotify at 892293e7
#10 [afa5c894f9e8] __fsnotify_parent at 89229686
#11 [afa5c894fa48] __kernel_write at 891e9962
#12 [afa5c894fa70] dump_emit at 892445af
#13 [afa5c894faa8] elf_core_dump at 8923f546
#14 [afa5c894fc60] do_coredump at 89244c3f
#15 [afa5c894fda0] get_signal at 89083ed0
#16 [afa5c894fe18] do_signal at 89028323
#17 [afa5c894ff10] exit_to_usermode_loop at 8900308c
#18 [afa5c894ff38] prepare_exit_to_usermode at 89003753
RIP: 7f69706935c3  RSP: 7ffeb8c1b4a8  RFLAGS: 00010206
RAX: 7f686d200034  RBX: 5591f24f0170  RCX: 7f68cb80
RDX: 7f696d20  RSI: 0061  RDI: 7f686d200034
RBP: 7f686d200010   R8:    R9: 00ff
R10: e0a9a400  R11: 0246  R12: 0001
R13: 0001  R14:   R15: 0083
ORIG_RAX:   CS: 0033  SS: 002b

faddr2line gives "fsnotify at fs/notify/fsnotify.c:368" (it's a 4.14.22).  So
it does seem that you were right about the location.

This happens with systemd handling coredumps.  It's using fsnotify to learn
about new dumps.

Note that on this machine, the dumps are on a loop mount:
/dev/loop0 /usr/cores ext4 rw,relatime,data=ordered 0 0

-- 
Guillaume Morin 


Re: 3.14 stable regression don't remove from shrink list in select_collect()

2016-01-25 Thread Guillaume Morin
On 25 Jan 11:25, Jason Baron wrote:
>
> On 01/25/2016 11:05 AM, Shawn Bohrer wrote:
> > I recently updated some machines to 3.14.58 and they reliably get soft
> > lockups.  Sometimes the soft lockup recovers and some times it does
> > not.  I've bisected this on the 3.14 stable branch and arrived at:
> > 
> > c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink
> > list in select_collect()
> > 
> > Reverting this commit plus adding back dentry_lru_del() which was
> > removed later on top of 3.14.58 resolves the issue for me.  I've
> > included a patch at the bottom with the revert.
> 
> I ran into this too on 3.14. You can revert these patches as you've done
> or backport the upstream fixes mentioned here:
> 
> http://www.spinics.net/lists/stable/msg111544.html

I actually messed up the commit ids in that message, the fixed list is
here:
http://www.spinics.net/lists/stable/msg111553.html

-- 
Guillaume Morin 


Re: 3.14 stable regression don't remove from shrink list in select_collect()

2016-01-25 Thread Guillaume Morin
On 25 Jan 11:25, Jason Baron wrote:
>
> On 01/25/2016 11:05 AM, Shawn Bohrer wrote:
> > I recently updated some machines to 3.14.58 and they reliably get soft
> > lockups.  Sometimes the soft lockup recovers and some times it does
> > not.  I've bisected this on the 3.14 stable branch and arrived at:
> > 
> > c214cb82cdc744225d85899fc138251527f75fff don't remove from shrink
> > list in select_collect()
> > 
> > Reverting this commit plus adding back dentry_lru_del() which was
> > removed later on top of 3.14.58 resolves the issue for me.  I've
> > included a patch at the bottom with the revert.
> 
> I ran into this too on 3.14. You can revert these patches as you've done
> or backport the upstream fixes mentioned here:
> 
> http://www.spinics.net/lists/stable/msg111544.html

I actually messed up the commit ids in that message, the fixed list is
here:
http://www.spinics.net/lists/stable/msg111553.html

-- 
Guillaume Morin <guilla...@morinfr.org>


Re: Linux 3.14.58

2015-12-22 Thread Guillaume Morin
Jason,

On 22 Dec 11:02, Jason Baron wrote:
> We've noticed livelocks in shrink_dentry_list() very similar to this
> report on stable 3.14.56:
> http://lkml.iu.edu/hypermail/linux/kernel/1405.3/00470.html
> 
> It appears that the patches that introduced this issue came into 3.14.51
> in August with the same set of patches mentioned in the referenced
> report. We unfortunately haven't narrowed down a reproducer, so I can't
> bisect the issue, but 3.14.50 did not appear to have the issue.
> 
> Upstream merge commit - 6f6111e4a73d0f6370eb8be4f8e4523210b6a67d (Pull
> vfs dcache livelock fix from Al Viro) brought in a series to address
> this issue. So that may be needed here as well...

FWIW I noticed the same thing.  The commits listed in
http://www.spinics.net/lists/stable/msg111553.html fixed the problem for me.

I assume Greg will pick them up soon...

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.14.58

2015-12-22 Thread Guillaume Morin
Jason,

On 22 Dec 11:02, Jason Baron wrote:
> We've noticed livelocks in shrink_dentry_list() very similar to this
> report on stable 3.14.56:
> http://lkml.iu.edu/hypermail/linux/kernel/1405.3/00470.html
> 
> It appears that the patches that introduced this issue came into 3.14.51
> in August with the same set of patches mentioned in the referenced
> report. We unfortunately haven't narrowed down a reproducer, so I can't
> bisect the issue, but 3.14.50 did not appear to have the issue.
> 
> Upstream merge commit - 6f6111e4a73d0f6370eb8be4f8e4523210b6a67d (Pull
> vfs dcache livelock fix from Al Viro) brought in a series to address
> this issue. So that may be needed here as well...

FWIW I noticed the same thing.  The commits listed in
http://www.spinics.net/lists/stable/msg111553.html fixed the problem for me.

I assume Greg will pick them up soon...

-- 
Guillaume Morin <guilla...@morinfr.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: llite: rw.c: use gfp_t to store GFP flags

2014-07-21 Thread Guillaume Morin
On 21 Jul 19:08, Greg KH wrote:
> Odd thing is, that function doesn't seem to even use that variable
> anywhere...  Can you just remove it entirely?

Oh that's right.  Resubmitted.

Thanks

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] staging: llite: rw.c: remove gfp_mask

2014-07-21 Thread Guillaume Morin
sparse reported that gfp_mask was of the wrong type to store gfp flags.  The
variable is not used so it can be removed.

Signed-off-by: Guillaume Morin 
Suggested-by: gre...@linuxfoundation.org
---
v2: remove the variable instead of just fixing the type since it is not used

 drivers/staging/lustre/lustre/llite/rw.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c 
b/drivers/staging/lustre/lustre/llite/rw.c
index 5616210..1b5416f 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -496,14 +496,9 @@ static int ll_read_ahead_page(const struct lu_env *env, 
struct cl_io *io,
struct cl_object *clob  = ll_i2info(mapping->host)->lli_clob;
struct cl_page   *page;
enum ra_stat  which = _NR_RA_STAT; /* keep gcc happy */
-   unsigned int  gfp_mask;
intrc= 0;
const char   *msg   = NULL;
 
-   gfp_mask = GFP_HIGHUSER & ~__GFP_WAIT;
-#ifdef __GFP_NOWARN
-   gfp_mask |= __GFP_NOWARN;
-#endif
vmpage = grab_cache_page_nowait(mapping, index);
if (vmpage != NULL) {
/* Check if vmpage was truncated or reclaimed */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: llite: rw.c: use gfp_t to store GFP flags

2014-07-21 Thread Guillaume Morin
Use gfp_t to store GPF_* flags instead of unsigned int. 
This was reported by sparse.

Signed-off-by: Guillaume Morin 
---
 drivers/staging/lustre/lustre/llite/rw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c 
b/drivers/staging/lustre/lustre/llite/rw.c
index 5616210..3554272 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -496,7 +496,7 @@ static int ll_read_ahead_page(const struct lu_env *env, 
struct cl_io *io,
struct cl_object *clob  = ll_i2info(mapping->host)->lli_clob;
struct cl_page   *page;
enum ra_stat  which = _NR_RA_STAT; /* keep gcc happy */
-   unsigned int  gfp_mask;
+   gfp_t gfp_mask;
intrc= 0;
const char   *msg   = NULL;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: llite: rw.c: use gfp_t to store GFP flags

2014-07-21 Thread Guillaume Morin
Use gfp_t to store GPF_* flags instead of unsigned int. 
This was reported by sparse.

Signed-off-by: Guillaume Morin guilla...@morinfr.org
---
 drivers/staging/lustre/lustre/llite/rw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c 
b/drivers/staging/lustre/lustre/llite/rw.c
index 5616210..3554272 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -496,7 +496,7 @@ static int ll_read_ahead_page(const struct lu_env *env, 
struct cl_io *io,
struct cl_object *clob  = ll_i2info(mapping-host)-lli_clob;
struct cl_page   *page;
enum ra_stat  which = _NR_RA_STAT; /* keep gcc happy */
-   unsigned int  gfp_mask;
+   gfp_t gfp_mask;
intrc= 0;
const char   *msg   = NULL;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] staging: llite: rw.c: remove gfp_mask

2014-07-21 Thread Guillaume Morin
sparse reported that gfp_mask was of the wrong type to store gfp flags.  The
variable is not used so it can be removed.

Signed-off-by: Guillaume Morin guilla...@morinfr.org
Suggested-by: gre...@linuxfoundation.org
---
v2: remove the variable instead of just fixing the type since it is not used

 drivers/staging/lustre/lustre/llite/rw.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c 
b/drivers/staging/lustre/lustre/llite/rw.c
index 5616210..1b5416f 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -496,14 +496,9 @@ static int ll_read_ahead_page(const struct lu_env *env, 
struct cl_io *io,
struct cl_object *clob  = ll_i2info(mapping-host)-lli_clob;
struct cl_page   *page;
enum ra_stat  which = _NR_RA_STAT; /* keep gcc happy */
-   unsigned int  gfp_mask;
intrc= 0;
const char   *msg   = NULL;
 
-   gfp_mask = GFP_HIGHUSER  ~__GFP_WAIT;
-#ifdef __GFP_NOWARN
-   gfp_mask |= __GFP_NOWARN;
-#endif
vmpage = grab_cache_page_nowait(mapping, index);
if (vmpage != NULL) {
/* Check if vmpage was truncated or reclaimed */
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: llite: rw.c: use gfp_t to store GFP flags

2014-07-21 Thread Guillaume Morin
On 21 Jul 19:08, Greg KH wrote:
 Odd thing is, that function doesn't seem to even use that variable
 anywhere...  Can you just remove it entirely?

Oh that's right.  Resubmitted.

Thanks

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] new copy_hugetlb_page_range() causing crashes

2014-07-17 Thread Guillaume Morin
On 17 Jul 17:33, Naoya Horiguchi wrote:
> In my environment (kernel-3.14.12, libhugetlbfs-utils-2.16-2.fc20.x86_64),
> the crash looks like this:
> 
> [root@test_140717-1333 hugetlbfs_test]# $ export HUGETLB_MORECORE=yes ; 
> export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
> bash: $: command not found...
> p 0x2200010
> pid 2809
> *** Error in `./heap': break adjusted to free malloc space: 
> 0x02501000 ***
> === Backtrace: =
> /lib64/libc.so.6[0x3940e75cff]
> /lib64/libc.so.6[0x3940e7f121]
> /lib64/libc.so.6(__libc_malloc+0x5c)[0x3940e7ff6c]
> ./heap[0x400767]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x3940e21d65]
> ./heap[0x400619]
> === Memory map: 
> 0040-00401000 r-xp  fd:01 272411 
> /root/hugetlbfs_test/heap
> 0060-00601000 r--p  fd:01 272411 
> /root/hugetlbfs_test/heap
> 00601000-00602000 rw-p 1000 fd:01 272411 
> /root/hugetlbfs_test/heap
> 0220-0260 rw-p  00:0c 23209  
> /anon_hugepage (deleted)
> 0260-0280 rw-p 0040 00:0c 25663  
> /anon_hugepage (deleted)
> 3940a0-3940a2 r-xp  fd:01 175094 
> /usr/lib64/ld-2.18.so
> 3940c1f000-3940c2 r--p 0001f000 fd:01 175094 
> /usr/lib64/ld-2.18.so
> 3940c2-3940c21000 rw-p 0002 fd:01 175094 
> /usr/lib64/ld-2.18.so
> 3940c21000-3940c22000 rw-p  00:00 0
> 3940e0-3940fb4000 r-xp  fd:01 175095 
> /usr/lib64/libc-2.18.so
> 3940fb4000-39411b4000 ---p 001b4000 fd:01 175095 
> /usr/lib64/libc-2.18.so
> 39411b4000-39411b8000 r--p 001b4000 fd:01 175095 
> /usr/lib64/libc-2.18.so
> 39411b8000-39411ba000 rw-p 001b8000 fd:01 175095 
> /usr/lib64/libc-2.18.so
> 39411ba000-39411bf000 rw-p  00:00 0
> 394120-3941203000 r-xp  fd:01 175098 
> /usr/lib64/libdl-2.18.so
> 3941203000-3941402000 ---p 3000 fd:01 175098 
> /usr/lib64/libdl-2.18.so
> 3941402000-3941403000 r--p 2000 fd:01 175098 
> /usr/lib64/libdl-2.18.so
> 3941403000-3941404000 rw-p 3000 fd:01 175098 
> /usr/lib64/libdl-2.18.so
> 7f3860277000-7f386028c000 r-xp  fd:01 184953 
> /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f386028c000-7f386048b000 ---p 00015000 fd:01 184953 
> /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f386048b000-7f386048c000 r--p 00014000 fd:01 184953 
> /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f386048c000-7f386048d000 rw-p 00015000 fd:01 184953 
> /usr/lib64/libgcc_s-4.8.3-20140624.so.1
> 7f38604a2000-7f38604a6000 rw-p  00:00 0
> 7f38604a6000-7f38604b6000 r-xp  fd:01 177014 
> /usr/lib64/libhugetlbfs.so
> 7f38604b6000-7f38606b5000 ---p 0001 fd:01 177014 
> /usr/lib64/libhugetlbfs.so
> 7f38606b5000-7f38606b6000 r--p f000 fd:01 177014 
> /usr/lib64/libhugetlbfs.so
> 7f38606b6000-7f38606b7000 rw-p 0001 fd:01 177014 
> /usr/lib64/libhugetlbfs.so
> 7f38606b7000-7f38606c2000 rw-p  00:00 0
> 7f38606d5000-7f38606d6000 rw-p  00:00 0
> 7f38606d6000-7f38606d8000 rw-p  00:00 0
> 7fff07c44000-7fff07c65000 rw-p  00:00 0  
> [stack]
> 7fff07d52000-7fff07d54000 r-xp  00:00 0  
> [vdso]
> ff60-ff601000 r-xp  00:00 0  
> [vsyscall]
> 
> Is this crash the same as yours?

It's very similar in the sense that's an assert in malloc.  My error is
slightly different but I guess the exact error depends on a few runtime
parameters

> And it seems that this also happens on v3.16-rc5.
> So it might be an upstream bug, not a stable-specific matter.

That's my understanding as well. I just reported it for 3.4 and 3.14
since these were the kernels I could easily try my original test with.

> It looks strange to me that the problem is gone by removing the commit
> 4a705fef98 (although I confirmed it is,) because the kernel's behavior
> shouldn't change unless (is_hugetlb_entry_migration(entry) ||
> is_hugetlb_entry_hwpoisoned(entry)) is true. And I checked with systemtap
> that both these check returned false in the above test program.
> So I'm wondering why the commit makes difference for this test program.

I don't know why I am just thinking about that now.  Isn't this the

Re: [BUG] new copy_hugetlb_page_range() causing crashes

2014-07-17 Thread Guillaume Morin
Nayoa,

Thanks for your answer.

On 17 Jul 14:35, Horiguchi, Naoya wrote:
> I tried some simple operation (below) on 3.14.12, but not reproduced the 
> crash,
> so some non-trivial condition seemed to trigger this.
> Could you elaborate about how you reproduced the crash?

Well, It's just a bunch of fairly random unit tests and I was not able
to make it happen on a small program.  However, I modified your program
a bit and managed to reproduce the problem:

$ cat heap.c 
#include 
#include 
#include 
#include 

int main() {
int i;
char *p = malloc(4096*512);
for (i = 0; i < 512; i++)
p[i*4096] = '1';
printf("p %p\n", p);
for (i = 0 ; i < 10; i++)
if (!fork()) {
memset(p, '2', 4096*512);
p = malloc(4096*512);
printf("pid %d\n", getpid());
memset(p, '3', 4096*512);
free(p);

return 0;
}
pause();
}

This is what happens on my 3.14.12 machine:
$ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./h
p 0x800010
pid 7974
pid 7975
h: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) 
&((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd 
&& old_size == 0) || ((unsigned long) (old_size) >= (unsigned 
long)__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * 
(sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 
0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.

Sometimes the process gets stuck instead asserting out.  But I could not
make it SIGSEGV

Same result with the 3.4.98 kernel.

It works fine when I remove your patch though

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] new copy_hugetlb_page_range() causing crashes

2014-07-17 Thread Guillaume Morin
Naoya, Hugh,

I am seeing lots of crashes with the new copy_hugetlb_page_range() code
added by 4a705fef986231a3e7a6b1a6d3c37025f021f49f for some set of
programs.

Specifically, I am running some test programs which use huge pages for
malloc (through libhugetlbfs with HUGETLB_MORECORE=yes and
HUGETLB_NO_PREFAULT= set) that also do fork() a lot.  The crashes are
very diverse: assertion failures in malloc() or the python GC code, some
SIGSEGV as well.

I started observing these crashes with 3.4.98 and 3.14.12 which just got
a backport of the patch above (as
2bcdd4933ff4dc46445dcae93cb37c648283b782 in the stable branch).  The 3.4
and 3.14 patches are identical to the upstream commit so that's not a patch
backport issue.

If I revert only 2bcdd4933ff4dc46445dcae93cb37c648283b782 in my 3.4
tree, the crashes disappear right away and everything is stable. 

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] new copy_hugetlb_page_range() causing crashes

2014-07-17 Thread Guillaume Morin
Naoya, Hugh,

I am seeing lots of crashes with the new copy_hugetlb_page_range() code
added by 4a705fef986231a3e7a6b1a6d3c37025f021f49f for some set of
programs.

Specifically, I am running some test programs which use huge pages for
malloc (through libhugetlbfs with HUGETLB_MORECORE=yes and
HUGETLB_NO_PREFAULT= set) that also do fork() a lot.  The crashes are
very diverse: assertion failures in malloc() or the python GC code, some
SIGSEGV as well.

I started observing these crashes with 3.4.98 and 3.14.12 which just got
a backport of the patch above (as
2bcdd4933ff4dc46445dcae93cb37c648283b782 in the stable branch).  The 3.4
and 3.14 patches are identical to the upstream commit so that's not a patch
backport issue.

If I revert only 2bcdd4933ff4dc46445dcae93cb37c648283b782 in my 3.4
tree, the crashes disappear right away and everything is stable. 

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] new copy_hugetlb_page_range() causing crashes

2014-07-17 Thread Guillaume Morin
Nayoa,

Thanks for your answer.

On 17 Jul 14:35, Horiguchi, Naoya wrote:
 I tried some simple operation (below) on 3.14.12, but not reproduced the 
 crash,
 so some non-trivial condition seemed to trigger this.
 Could you elaborate about how you reproduced the crash?

Well, It's just a bunch of fairly random unit tests and I was not able
to make it happen on a small program.  However, I modified your program
a bit and managed to reproduce the problem:

$ cat heap.c 
#include stdio.h
#include unistd.h
#include stdlib.h
#include string.h

int main() {
int i;
char *p = malloc(4096*512);
for (i = 0; i  512; i++)
p[i*4096] = '1';
printf(p %p\n, p);
for (i = 0 ; i  10; i++)
if (!fork()) {
memset(p, '2', 4096*512);
p = malloc(4096*512);
printf(pid %d\n, getpid());
memset(p, '3', 4096*512);
free(p);

return 0;
}
pause();
}

This is what happens on my 3.14.12 machine:
$ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./h
p 0x800010
pid 7974
pid 7975
h: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) 
((av)-bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd 
 old_size == 0) || ((unsigned long) (old_size) = (unsigned 
long)__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * 
(sizeof(size_t))) - 1))  ~((2 * (sizeof(size_t))) - 1)))  ((old_top)-size  
0x1)  ((unsigned long)old_end  pagemask) == 0)' failed.

Sometimes the process gets stuck instead asserting out.  But I could not
make it SIGSEGV

Same result with the 3.4.98 kernel.

It works fine when I remove your patch though

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] new copy_hugetlb_page_range() causing crashes

2014-07-17 Thread Guillaume Morin
On 17 Jul 17:33, Naoya Horiguchi wrote:
 In my environment (kernel-3.14.12, libhugetlbfs-utils-2.16-2.fc20.x86_64),
 the crash looks like this:
 
 [root@test_140717-1333 hugetlbfs_test]# $ export HUGETLB_MORECORE=yes ; 
 export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
 bash: $: command not found...
 p 0x2200010
 pid 2809
 *** Error in `./heap': break adjusted to free malloc space: 
 0x02501000 ***
 === Backtrace: =
 /lib64/libc.so.6[0x3940e75cff]
 /lib64/libc.so.6[0x3940e7f121]
 /lib64/libc.so.6(__libc_malloc+0x5c)[0x3940e7ff6c]
 ./heap[0x400767]
 /lib64/libc.so.6(__libc_start_main+0xf5)[0x3940e21d65]
 ./heap[0x400619]
 === Memory map: 
 0040-00401000 r-xp  fd:01 272411 
 /root/hugetlbfs_test/heap
 0060-00601000 r--p  fd:01 272411 
 /root/hugetlbfs_test/heap
 00601000-00602000 rw-p 1000 fd:01 272411 
 /root/hugetlbfs_test/heap
 0220-0260 rw-p  00:0c 23209  
 /anon_hugepage (deleted)
 0260-0280 rw-p 0040 00:0c 25663  
 /anon_hugepage (deleted)
 3940a0-3940a2 r-xp  fd:01 175094 
 /usr/lib64/ld-2.18.so
 3940c1f000-3940c2 r--p 0001f000 fd:01 175094 
 /usr/lib64/ld-2.18.so
 3940c2-3940c21000 rw-p 0002 fd:01 175094 
 /usr/lib64/ld-2.18.so
 3940c21000-3940c22000 rw-p  00:00 0
 3940e0-3940fb4000 r-xp  fd:01 175095 
 /usr/lib64/libc-2.18.so
 3940fb4000-39411b4000 ---p 001b4000 fd:01 175095 
 /usr/lib64/libc-2.18.so
 39411b4000-39411b8000 r--p 001b4000 fd:01 175095 
 /usr/lib64/libc-2.18.so
 39411b8000-39411ba000 rw-p 001b8000 fd:01 175095 
 /usr/lib64/libc-2.18.so
 39411ba000-39411bf000 rw-p  00:00 0
 394120-3941203000 r-xp  fd:01 175098 
 /usr/lib64/libdl-2.18.so
 3941203000-3941402000 ---p 3000 fd:01 175098 
 /usr/lib64/libdl-2.18.so
 3941402000-3941403000 r--p 2000 fd:01 175098 
 /usr/lib64/libdl-2.18.so
 3941403000-3941404000 rw-p 3000 fd:01 175098 
 /usr/lib64/libdl-2.18.so
 7f3860277000-7f386028c000 r-xp  fd:01 184953 
 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
 7f386028c000-7f386048b000 ---p 00015000 fd:01 184953 
 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
 7f386048b000-7f386048c000 r--p 00014000 fd:01 184953 
 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
 7f386048c000-7f386048d000 rw-p 00015000 fd:01 184953 
 /usr/lib64/libgcc_s-4.8.3-20140624.so.1
 7f38604a2000-7f38604a6000 rw-p  00:00 0
 7f38604a6000-7f38604b6000 r-xp  fd:01 177014 
 /usr/lib64/libhugetlbfs.so
 7f38604b6000-7f38606b5000 ---p 0001 fd:01 177014 
 /usr/lib64/libhugetlbfs.so
 7f38606b5000-7f38606b6000 r--p f000 fd:01 177014 
 /usr/lib64/libhugetlbfs.so
 7f38606b6000-7f38606b7000 rw-p 0001 fd:01 177014 
 /usr/lib64/libhugetlbfs.so
 7f38606b7000-7f38606c2000 rw-p  00:00 0
 7f38606d5000-7f38606d6000 rw-p  00:00 0
 7f38606d6000-7f38606d8000 rw-p  00:00 0
 7fff07c44000-7fff07c65000 rw-p  00:00 0  
 [stack]
 7fff07d52000-7fff07d54000 r-xp  00:00 0  
 [vdso]
 ff60-ff601000 r-xp  00:00 0  
 [vsyscall]
 
 Is this crash the same as yours?

It's very similar in the sense that's an assert in malloc.  My error is
slightly different but I guess the exact error depends on a few runtime
parameters

 And it seems that this also happens on v3.16-rc5.
 So it might be an upstream bug, not a stable-specific matter.

That's my understanding as well. I just reported it for 3.4 and 3.14
since these were the kernels I could easily try my original test with.

 It looks strange to me that the problem is gone by removing the commit
 4a705fef98 (although I confirmed it is,) because the kernel's behavior
 shouldn't change unless (is_hugetlb_entry_migration(entry) ||
 is_hugetlb_entry_hwpoisoned(entry)) is true. And I checked with systemtap
 that both these check returned false in the above test program.
 So I'm wondering why the commit makes difference for this test program.

I don't know why I am just thinking about that now.  Isn't this the fact
that your patch moved the huge_ptep_get() before
huge_ptep_set_wrprotect() in the pte_present() cow case?

Actually, I've just tried to re-add the huge_ptep_get call for that
case and it's fixing the problem for me...

Hmm, want a patch?

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux

[PATCH v2] staging: dgnc_driver.c: code style fixes

2014-06-28 Thread Guillaume Morin
From: Guillaume Morin 

Simple code style fixes:
 - "if(" -> "if ("
 - "switch(" -> "switch ("
 - move one open brace to the line of the declaration instead of
   its own line
 - remove trailing whitespace

Signed-off-by: Guillaume Morin 
---
Changes since v1:
 - added explicit description of the fixes

 drivers/staging/dgnc/dgnc_driver.c |   11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_driver.c 
b/drivers/staging/dgnc/dgnc_driver.c
index d52a9e8..68460af 100644
--- a/drivers/staging/dgnc/dgnc_driver.c
+++ b/drivers/staging/dgnc/dgnc_driver.c
@@ -88,8 +88,7 @@ module_exit(dgnc_cleanup_module);
 /*
  * File operations permitted on Control/Management major.
  */
-static const struct file_operations dgnc_BoardFops =
-{
+static const struct file_operations dgnc_BoardFops = {
.owner  =   THIS_MODULE,
.unlocked_ioctl =   dgnc_mgmt_ioctl,
.open   =   dgnc_mgmt_open,
@@ -407,7 +406,7 @@ static void dgnc_cleanup_board(struct dgnc_board *brd)
 {
int i = 0;
 
-   if(!brd || brd->magic != DGNC_BOARD_MAGIC)
+   if (!brd || brd->magic != DGNC_BOARD_MAGIC)
return;
 
switch (brd->device) {
@@ -480,7 +479,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
/* get the board structure and prep it */
brd = dgnc_Board[dgnc_NumBoards] =
kzalloc(sizeof(*brd), GFP_KERNEL);
-   if (!brd) 
+   if (!brd)
return -ENOMEM;
 
/* make a temporary message buffer for the boot messages */
@@ -523,7 +522,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
brd->irq = pci_irq;
 
 
-   switch(brd->device) {
+   switch (brd->device) {
 
case PCI_DEVICE_CLASSIC_4_DID:
case PCI_DEVICE_CLASSIC_8_DID:
@@ -887,7 +886,7 @@ int dgnc_ms_sleep(ulong ms)
  */
 char *dgnc_ioctl_name(int cmd)
 {
-   switch(cmd) {
+   switch (cmd) {
 
case TCGETA:return "TCGETA";
case TCGETS:return "TCGETS";
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 1/1] staging: iio: ad9850.c: code cleanup

2014-06-28 Thread Guillaume Morin
checkpath.pl was complaining about value_mask:
ERROR: Macros with complex values should be enclosed in parenthesis

I fixed this by simply removing it since it's not used (as well as another
macro).  Got rid of the un-necessary error_ret label as well.

Signed-off-by: Guillaume Morin 
Reported-by: Michael Welling 
---
Changes since v3:
- Added Reported-by: Michael Welling 

 drivers/staging/iio/frequency/ad9850.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..8727933 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,9 +21,6 @@
 
 #define DRV_NAME "ad9850"
 
-#define value_mask (u16)0xf000
-#define addr_shift 12
-
 /* Register format: 4 bits addr + 12 bits value */
 struct ad9850_config {
u8 control[5];
@@ -50,9 +47,6 @@ static ssize_t ad9850_set_parameter(struct device *dev,
mutex_lock(>lock);
 
ret = spi_sync_transfer(st->sdev, , 1);
-   if (ret)
-   goto error_ret;
-error_ret:
mutex_unlock(>lock);
 
return ret ? ret : len;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/1] staging: iio: ad9850.c: code cleanup

2014-06-28 Thread Guillaume Morin
checkpath.pl was complaining about value_mask:
ERROR: Macros with complex values should be enclosed in parenthesis

I fixed this by simply removing it since it's not used (as well as another 
macro).
Got rid of the un-necessary error_ret label as well.  Both changes were 
suggested
by Michael Welling.

Signed-off-by: Guillaume Morin 
---
Changes since v2:
- Got rid of the macro altogether since it's unused
- removed unused addr_shit
- remove error_ret since it's not needed

 drivers/staging/iio/frequency/ad9850.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..8727933 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,9 +21,6 @@
 
 #define DRV_NAME "ad9850"
 
-#define value_mask (u16)0xf000
-#define addr_shift 12
-
 /* Register format: 4 bits addr + 12 bits value */
 struct ad9850_config {
u8 control[5];
@@ -50,9 +47,6 @@ static ssize_t ad9850_set_parameter(struct device *dev,
mutex_lock(>lock);
 
ret = spi_sync_transfer(st->sdev, , 1);
-   if (ret)
-   goto error_ret;
-error_ret:
mutex_unlock(>lock);
 
return ret ? ret : len;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/1] staging: iio: ad9850.c: code cleanup

2014-06-28 Thread Guillaume Morin
checkpath.pl was complaining about value_mask:
ERROR: Macros with complex values should be enclosed in parenthesis

I fixed this by simply removing it since it's not used (as well as another 
macro).
Got rid of the un-necessary error_ret label as well.  Both changes were 
suggested
by Michael Welling.

Signed-off-by: Guillaume Morin guilla...@morinfr.org
---
Changes since v2:
- Got rid of the macro altogether since it's unused
- removed unused addr_shit
- remove error_ret since it's not needed

 drivers/staging/iio/frequency/ad9850.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..8727933 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,9 +21,6 @@
 
 #define DRV_NAME ad9850
 
-#define value_mask (u16)0xf000
-#define addr_shift 12
-
 /* Register format: 4 bits addr + 12 bits value */
 struct ad9850_config {
u8 control[5];
@@ -50,9 +47,6 @@ static ssize_t ad9850_set_parameter(struct device *dev,
mutex_lock(st-lock);
 
ret = spi_sync_transfer(st-sdev, xfer, 1);
-   if (ret)
-   goto error_ret;
-error_ret:
mutex_unlock(st-lock);
 
return ret ? ret : len;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 1/1] staging: iio: ad9850.c: code cleanup

2014-06-28 Thread Guillaume Morin
checkpath.pl was complaining about value_mask:
ERROR: Macros with complex values should be enclosed in parenthesis

I fixed this by simply removing it since it's not used (as well as another
macro).  Got rid of the un-necessary error_ret label as well.

Signed-off-by: Guillaume Morin guilla...@morinfr.org
Reported-by: Michael Welling mwell...@ieee.org
---
Changes since v3:
- Added Reported-by: Michael Welling mwell...@ieee.org

 drivers/staging/iio/frequency/ad9850.c |6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..8727933 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,9 +21,6 @@
 
 #define DRV_NAME ad9850
 
-#define value_mask (u16)0xf000
-#define addr_shift 12
-
 /* Register format: 4 bits addr + 12 bits value */
 struct ad9850_config {
u8 control[5];
@@ -50,9 +47,6 @@ static ssize_t ad9850_set_parameter(struct device *dev,
mutex_lock(st-lock);
 
ret = spi_sync_transfer(st-sdev, xfer, 1);
-   if (ret)
-   goto error_ret;
-error_ret:
mutex_unlock(st-lock);
 
return ret ? ret : len;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] staging: dgnc_driver.c: code style fixes

2014-06-28 Thread Guillaume Morin
From: Guillaume Morin guilla...@morinfr.org

Simple code style fixes:
 - if( - if (
 - switch( - switch (
 - move one open brace to the line of the declaration instead of
   its own line
 - remove trailing whitespace

Signed-off-by: Guillaume Morin guilla...@morinfr.org
---
Changes since v1:
 - added explicit description of the fixes

 drivers/staging/dgnc/dgnc_driver.c |   11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_driver.c 
b/drivers/staging/dgnc/dgnc_driver.c
index d52a9e8..68460af 100644
--- a/drivers/staging/dgnc/dgnc_driver.c
+++ b/drivers/staging/dgnc/dgnc_driver.c
@@ -88,8 +88,7 @@ module_exit(dgnc_cleanup_module);
 /*
  * File operations permitted on Control/Management major.
  */
-static const struct file_operations dgnc_BoardFops =
-{
+static const struct file_operations dgnc_BoardFops = {
.owner  =   THIS_MODULE,
.unlocked_ioctl =   dgnc_mgmt_ioctl,
.open   =   dgnc_mgmt_open,
@@ -407,7 +406,7 @@ static void dgnc_cleanup_board(struct dgnc_board *brd)
 {
int i = 0;
 
-   if(!brd || brd-magic != DGNC_BOARD_MAGIC)
+   if (!brd || brd-magic != DGNC_BOARD_MAGIC)
return;
 
switch (brd-device) {
@@ -480,7 +479,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
/* get the board structure and prep it */
brd = dgnc_Board[dgnc_NumBoards] =
kzalloc(sizeof(*brd), GFP_KERNEL);
-   if (!brd) 
+   if (!brd)
return -ENOMEM;
 
/* make a temporary message buffer for the boot messages */
@@ -523,7 +522,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
brd-irq = pci_irq;
 
 
-   switch(brd-device) {
+   switch (brd-device) {
 
case PCI_DEVICE_CLASSIC_4_DID:
case PCI_DEVICE_CLASSIC_8_DID:
@@ -887,7 +886,7 @@ int dgnc_ms_sleep(ulong ms)
  */
 char *dgnc_ioctl_name(int cmd)
 {
-   switch(cmd) {
+   switch (cmd) {
 
case TCGETA:return TCGETA;
case TCGETS:return TCGETS;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: dgnc_driver.c: code style fixes

2014-06-27 Thread Guillaume Morin
From: Guillaume Morin 

Simple code style fixes

Signed-off-by: Guillaume Morin 
---
 drivers/staging/dgnc/dgnc_driver.c |   11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_driver.c 
b/drivers/staging/dgnc/dgnc_driver.c
index d52a9e8..68460af 100644
--- a/drivers/staging/dgnc/dgnc_driver.c
+++ b/drivers/staging/dgnc/dgnc_driver.c
@@ -88,8 +88,7 @@ module_exit(dgnc_cleanup_module);
 /*
  * File operations permitted on Control/Management major.
  */
-static const struct file_operations dgnc_BoardFops =
-{
+static const struct file_operations dgnc_BoardFops = {
.owner  =   THIS_MODULE,
.unlocked_ioctl =   dgnc_mgmt_ioctl,
.open   =   dgnc_mgmt_open,
@@ -407,7 +406,7 @@ static void dgnc_cleanup_board(struct dgnc_board *brd)
 {
int i = 0;
 
-   if(!brd || brd->magic != DGNC_BOARD_MAGIC)
+   if (!brd || brd->magic != DGNC_BOARD_MAGIC)
return;
 
switch (brd->device) {
@@ -480,7 +479,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
/* get the board structure and prep it */
brd = dgnc_Board[dgnc_NumBoards] =
kzalloc(sizeof(*brd), GFP_KERNEL);
-   if (!brd) 
+   if (!brd)
return -ENOMEM;
 
/* make a temporary message buffer for the boot messages */
@@ -523,7 +522,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
brd->irq = pci_irq;
 
 
-   switch(brd->device) {
+   switch (brd->device) {
 
case PCI_DEVICE_CLASSIC_4_DID:
case PCI_DEVICE_CLASSIC_8_DID:
@@ -887,7 +886,7 @@ int dgnc_ms_sleep(ulong ms)
  */
 char *dgnc_ioctl_name(int cmd)
 {
-   switch(cmd) {
+   switch (cmd) {
 
case TCGETA:return "TCGETA";
case TCGETS:return "TCGETS";
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin
On 27 Jun 22:37, Greg Kroah-Hartman wrote:
> Put that below the --- line.

Will do.

> > > And what checkpatch error did you fix?  And are you sure it needs to be
> > > fixed?
> > 
> > That's what I changed:
> > 
> > $ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
> > ERROR: Macros with complex values should be enclosed in parenthesis
> 
> Then why didn't you say that :)

Well it was not totally clear to me if that was obvious or not.  Anyway,
I'll mention it in the future.

> 
> > I assumed that if it was reported as an error, it needed to be fixed...
> 
> Use your judgement, checkpatch is a tool, it isn't always correct.

Right, I guess it's borderline.  Should I resend the patch or just drop
it?

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin
On 27 Jun 19:09, Greg Kroah-Hartman wrote:
> > v2: add missing Signed-off-by 
> 
> That doesn't go here.

I guess I am struggling to get git send-email do what I want

> And what checkpatch error did you fix?  And are you sure it needs to be
> fixed?

That's what I changed:

$ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
ERROR: Macros with complex values should be enclosed in parenthesis
#24: FILE: drivers/staging/iio/frequency/ad9850.c:24:
+#define value_mask (u16)0xf000

I assumed that if it was reported as an error, it needed to be fixed...

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin
v2: add missing Signed-off-by 

Signed-off-by: Guillaume Morin 

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..6183670 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,7 +21,7 @@
 
 #define DRV_NAME "ad9850"
 
-#define value_mask (u16)0xf000
+#define value_mask ((u16)0xf000)
 #define addr_shift 12
 
 /* Register format: 4 bits addr + 12 bits value */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..6183670 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,7 +21,7 @@
 
 #define DRV_NAME "ad9850"
 
-#define value_mask (u16)0xf000
+#define value_mask ((u16)0xf000)
 #define addr_shift 12
 
 /* Register format: 4 bits addr + 12 bits value */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..6183670 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,7 +21,7 @@
 
 #define DRV_NAME ad9850
 
-#define value_mask (u16)0xf000
+#define value_mask ((u16)0xf000)
 #define addr_shift 12
 
 /* Register format: 4 bits addr + 12 bits value */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin
v2: add missing Signed-off-by 

Signed-off-by: Guillaume Morin guilla...@morinfr.org

diff --git a/drivers/staging/iio/frequency/ad9850.c 
b/drivers/staging/iio/frequency/ad9850.c
index af877ff..6183670 100644
--- a/drivers/staging/iio/frequency/ad9850.c
+++ b/drivers/staging/iio/frequency/ad9850.c
@@ -21,7 +21,7 @@
 
 #define DRV_NAME ad9850
 
-#define value_mask (u16)0xf000
+#define value_mask ((u16)0xf000)
 #define addr_shift 12
 
 /* Register format: 4 bits addr + 12 bits value */
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin
On 27 Jun 19:09, Greg Kroah-Hartman wrote:
  v2: add missing Signed-off-by 
 
 That doesn't go here.

I guess I am struggling to get git send-email do what I want

 And what checkpatch error did you fix?  And are you sure it needs to be
 fixed?

That's what I changed:

$ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
ERROR: Macros with complex values should be enclosed in parenthesis
#24: FILE: drivers/staging/iio/frequency/ad9850.c:24:
+#define value_mask (u16)0xf000

I assumed that if it was reported as an error, it needed to be fixed...

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] staging: iio: ad9850.c: fix checkpatch.pl error

2014-06-27 Thread Guillaume Morin
On 27 Jun 22:37, Greg Kroah-Hartman wrote:
 Put that below the --- line.

Will do.

   And what checkpatch error did you fix?  And are you sure it needs to be
   fixed?
  
  That's what I changed:
  
  $ scripts/checkpatch.pl -f drivers/staging/iio/frequency/ad9850.c
  ERROR: Macros with complex values should be enclosed in parenthesis
 
 Then why didn't you say that :)

Well it was not totally clear to me if that was obvious or not.  Anyway,
I'll mention it in the future.

 
  I assumed that if it was reported as an error, it needed to be fixed...
 
 Use your judgement, checkpatch is a tool, it isn't always correct.

Right, I guess it's borderline.  Should I resend the patch or just drop
it?

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging: dgnc_driver.c: code style fixes

2014-06-27 Thread Guillaume Morin
From: Guillaume Morin guilla...@morinfr.org

Simple code style fixes

Signed-off-by: Guillaume Morin guilla...@morinfr.org
---
 drivers/staging/dgnc/dgnc_driver.c |   11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/dgnc/dgnc_driver.c 
b/drivers/staging/dgnc/dgnc_driver.c
index d52a9e8..68460af 100644
--- a/drivers/staging/dgnc/dgnc_driver.c
+++ b/drivers/staging/dgnc/dgnc_driver.c
@@ -88,8 +88,7 @@ module_exit(dgnc_cleanup_module);
 /*
  * File operations permitted on Control/Management major.
  */
-static const struct file_operations dgnc_BoardFops =
-{
+static const struct file_operations dgnc_BoardFops = {
.owner  =   THIS_MODULE,
.unlocked_ioctl =   dgnc_mgmt_ioctl,
.open   =   dgnc_mgmt_open,
@@ -407,7 +406,7 @@ static void dgnc_cleanup_board(struct dgnc_board *brd)
 {
int i = 0;
 
-   if(!brd || brd-magic != DGNC_BOARD_MAGIC)
+   if (!brd || brd-magic != DGNC_BOARD_MAGIC)
return;
 
switch (brd-device) {
@@ -480,7 +479,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
/* get the board structure and prep it */
brd = dgnc_Board[dgnc_NumBoards] =
kzalloc(sizeof(*brd), GFP_KERNEL);
-   if (!brd) 
+   if (!brd)
return -ENOMEM;
 
/* make a temporary message buffer for the boot messages */
@@ -523,7 +522,7 @@ static int dgnc_found_board(struct pci_dev *pdev, int id)
brd-irq = pci_irq;
 
 
-   switch(brd-device) {
+   switch (brd-device) {
 
case PCI_DEVICE_CLASSIC_4_DID:
case PCI_DEVICE_CLASSIC_8_DID:
@@ -887,7 +886,7 @@ int dgnc_ms_sleep(ulong ms)
  */
 char *dgnc_ioctl_name(int cmd)
 {
-   switch(cmd) {
+   switch (cmd) {
 
case TCGETA:return TCGETA;
case TCGETS:return TCGETS;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] selftests: ipc: handle msgget failure return correctly

2014-03-05 Thread Guillaume Morin
On 05 Mar 17:51, Colin King wrote:
> From: Colin Ian King 
> 
> A failed msgget causes the test to return an uninitialised value
> in ret. Assign ret to -errno on error exit.
> 
> Signed-off-by: Colin Ian King 
> ---
>  tools/testing/selftests/ipc/msgque.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/ipc/msgque.c 
> b/tools/testing/selftests/ipc/msgque.c
> index d664182..be8f294 100644
> --- a/tools/testing/selftests/ipc/msgque.c
> +++ b/tools/testing/selftests/ipc/msgque.c
> @@ -202,6 +202,7 @@ int main(int argc, char **argv)
>   msgque.msq_id = msgget(msgque.key, IPC_CREAT | IPC_EXCL | 0666);
>   if (msgque.msq_id == -1) {
>   printf("Can't create queue\n");
> + err = -errno;
>   goto err_out;
>   }
>  
> -- 

Maybe I am nitpicking here but printf() could modify errno, so you might
as well save it before printf() is called.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] selftests: ipc: handle msgget failure return correctly

2014-03-05 Thread Guillaume Morin
On 05 Mar 17:51, Colin King wrote:
 From: Colin Ian King colin.k...@canonical.com
 
 A failed msgget causes the test to return an uninitialised value
 in ret. Assign ret to -errno on error exit.
 
 Signed-off-by: Colin Ian King colin.k...@canonical.com
 ---
  tools/testing/selftests/ipc/msgque.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/tools/testing/selftests/ipc/msgque.c 
 b/tools/testing/selftests/ipc/msgque.c
 index d664182..be8f294 100644
 --- a/tools/testing/selftests/ipc/msgque.c
 +++ b/tools/testing/selftests/ipc/msgque.c
 @@ -202,6 +202,7 @@ int main(int argc, char **argv)
   msgque.msq_id = msgget(msgque.key, IPC_CREAT | IPC_EXCL | 0666);
   if (msgque.msq_id == -1) {
   printf(Can't create queue\n);
 + err = -errno;
   goto err_out;
   }
  
 -- 

Maybe I am nitpicking here but printf() could modify errno, so you might
as well save it before printf() is called.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + exitc-call-proc_exit_connector-after-exit_state-is-set.patch added to -mm tree

2014-02-27 Thread Guillaume Morin
On 27 Feb 17:47, Oleg Nesterov wrote:
> Anyway, I only tried to say that "a small window between when the event
> is delivered and the child become wait()-able." is not closed by this
> patch. Sorry for not being clear enough.

Ok, thanks for it making it clear.

> Nevermind. Please consider this trivial example:
> 
>   tfunc(void *)
>   {
>   for (;;)
>   pause();
>   }
> 
>   int main(void)
>   {
>   pthread_create(tfunc);
>   pthread_exit();
>   }
> 
> The main thread can exit and call proc_exit_connector() before
> register_interest_for_pid(), but WNOHANG obviously can't succeed.
 
What matters is not the exit message of the main thread but the exit
message from the last threaded dying.  In your example, it's fine that
waitpid fails since the process is still around.  If you kill it, the
connector will get a connector message for the thread you created in
main().

> But let me repeat just in case: I am not arguing with this change.

That was my understanding from my message.  But I wanted to respond to
make sure I understood your points correctly.

Thanks for your help.  I appreciate it.

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + exitc-call-proc_exit_connector-after-exit_state-is-set.patch added to -mm tree

2014-02-27 Thread Guillaume Morin
On 25 Feb 16:10, Oleg Nesterov wrote:
> > pid_t pid = fork();
> > if (pid > 0) {
> > register_interest_for_pid(pid);
> > if (waitpid(pid, NULL, WNOHANG) > 0)
> > {
> >   /* We might have raced with exit() */
> > }
> 
> Just in case... Even with this patch the code above is still "racy" if the
> child is multi-threaded. Plus it should obviously filter-out subthreads.
> And afaics there is no way to make it reliable, even if you change the
> code above so that waitpid() is called only after the last thread exits
> WNOHANG still can fail.
> Not that I am not arguing with this change. Although I hope that someone
> can confirm that netlink_broadcast() is safe even if release_task(current)
> was already called, so that the caller has no pids, sighand, is not visible
> via /proc/, etc.

I was too succinct, I think.  What I am trying to do is to close a race
when a short-lived *process* dies before register_interest_for_pid()
interprets the connector message correctly, (i.e realizes this is an
exit message for a pid that the parent created).

For example, let's say that the parent has an independent thread that
just reads from the netlink socket or uses a BPF filter to see only the
events it cares about.  In that case, it's possible that the exit
connector message will be discarded (either by a reader thread or the
BPF filter) before the parent realizes it should care about messages
about a new pid (the child pid)

You clarified for me that a ptraced process is a case where this race
could still happen.  That's a good point.  Fortunately, in the case of a
short-lived process, this is not a common scenario.

If we ignore the ptrace() case, I am not sure I see the problem with
multithreaded processes.  Even if the main thread exits right away, what is
important is that:
- *either* the exit connector message of the last thread that dies is be
  seen after register_interest_for_pid completes
- *or* that waitpid(WNOHANG) succeeds right after
  register_interest_for_pid()

You seem to say it's possible for all threads to have completed
exit_notify() and sent their exit message to the connector before
register_interest_for_pid() does its job and still have waitpid(WNOHANG)
fails.  Is it correct?  If so, could you give a bit more details on how
this could happen?

My understanding is that if all threads exited before waitpid() is
called, exit->state will be set to EXIT_ZOMBIE for the pid and that
delay_group_leader() will be false (because all sub-threads have
exited), so that waitpid(WNOHANG) will successfully reap the process.
What am I missing?

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + exitc-call-proc_exit_connector-after-exit_state-is-set.patch added to -mm tree

2014-02-27 Thread Guillaume Morin
On 25 Feb 16:10, Oleg Nesterov wrote:
  pid_t pid = fork();
  if (pid  0) {
  register_interest_for_pid(pid);
  if (waitpid(pid, NULL, WNOHANG)  0)
  {
/* We might have raced with exit() */
  }
 
 Just in case... Even with this patch the code above is still racy if the
 child is multi-threaded. Plus it should obviously filter-out subthreads.
 And afaics there is no way to make it reliable, even if you change the
 code above so that waitpid() is called only after the last thread exits
 WNOHANG still can fail.
 Not that I am not arguing with this change. Although I hope that someone
 can confirm that netlink_broadcast() is safe even if release_task(current)
 was already called, so that the caller has no pids, sighand, is not visible
 via /proc/, etc.

I was too succinct, I think.  What I am trying to do is to close a race
when a short-lived *process* dies before register_interest_for_pid()
interprets the connector message correctly, (i.e realizes this is an
exit message for a pid that the parent created).

For example, let's say that the parent has an independent thread that
just reads from the netlink socket or uses a BPF filter to see only the
events it cares about.  In that case, it's possible that the exit
connector message will be discarded (either by a reader thread or the
BPF filter) before the parent realizes it should care about messages
about a new pid (the child pid)

You clarified for me that a ptraced process is a case where this race
could still happen.  That's a good point.  Fortunately, in the case of a
short-lived process, this is not a common scenario.

If we ignore the ptrace() case, I am not sure I see the problem with
multithreaded processes.  Even if the main thread exits right away, what is
important is that:
- *either* the exit connector message of the last thread that dies is be
  seen after register_interest_for_pid completes
- *or* that waitpid(WNOHANG) succeeds right after
  register_interest_for_pid()

You seem to say it's possible for all threads to have completed
exit_notify() and sent their exit message to the connector before
register_interest_for_pid() does its job and still have waitpid(WNOHANG)
fails.  Is it correct?  If so, could you give a bit more details on how
this could happen?

My understanding is that if all threads exited before waitpid() is
called, exit-state will be set to EXIT_ZOMBIE for the pid and that
delay_group_leader() will be false (because all sub-threads have
exited), so that waitpid(WNOHANG) will successfully reap the process.
What am I missing?

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + exitc-call-proc_exit_connector-after-exit_state-is-set.patch added to -mm tree

2014-02-27 Thread Guillaume Morin
On 27 Feb 17:47, Oleg Nesterov wrote:
 Anyway, I only tried to say that a small window between when the event
 is delivered and the child become wait()-able. is not closed by this
 patch. Sorry for not being clear enough.

Ok, thanks for it making it clear.

 Nevermind. Please consider this trivial example:
 
   tfunc(void *)
   {
   for (;;)
   pause();
   }
 
   int main(void)
   {
   pthread_create(tfunc);
   pthread_exit();
   }
 
 The main thread can exit and call proc_exit_connector() before
 register_interest_for_pid(), but WNOHANG obviously can't succeed.
 
What matters is not the exit message of the main thread but the exit
message from the last threaded dying.  In your example, it's fine that
waitpid fails since the process is still around.  If you kill it, the
connector will get a connector message for the thread you created in
main().

 But let me repeat just in case: I am not arguing with this change.

That was my understanding from my message.  But I wanted to respond to
make sure I understood your points correctly.

Thanks for your help.  I appreciate it.

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] exit.c: call proc_exit_connector() after exit_state is set

2014-02-24 Thread Guillaume Morin
From: Guillaume Morin 

The process events connector delivers a notification when a process
exits.  This is really convenient for a process that spawns and wants
to monitor its children through an epoll-able() interface.

Unfortunately, there is a small window between when the event is
delivered and the child become wait()-able.

This is creates a race if the parent wants to make sure that it knows
about the exit, e.g

pid_t pid = fork();
if (pid > 0) {
register_interest_for_pid(pid);
if (waitpid(pid, NULL, WNOHANG) > 0)
{
  /* We might have raced with exit() */
}
return;
}

/* Child */
execve(...)

register_interest_for_pid() would be telling the the connector socket
reader to pay attention to events related to pid.

Though this is not a bug, I think it would make the connector a bit
more usable if this race was closed by simply moving the call to
proc_exit_connector() from just before exit_notify() to right after.

Signed-off-by: Guillaume Morin 
Cc: matt.hels...@gmail.com
---
diff --git a/kernel/exit.c b/kernel/exit.c
index 1e77fc6..9b0ac8c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -804,7 +804,6 @@ void do_exit(long code)
 
module_put(task_thread_info(tsk)->exec_domain->module);
 
-   proc_exit_connector(tsk);
 
/*
 * FIXME: do that only when needed, using sched_exit tracepoint
@@ -812,6 +811,7 @@ void do_exit(long code)
flush_ptrace_hw_breakpoint(tsk);
 
exit_notify(tsk, group_dead);
+   proc_exit_connector(tsk);
 #ifdef CONFIG_NUMA
task_lock(tsk);
mpol_put(tsk->mempolicy);



-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: Bad page state in process with linux 3.4.76

2014-02-24 Thread Guillaume Morin
On 24 Feb 12:39, Jan Kara wrote:
>   I'm going through some old emails... Did this get resolved with later 3.4
> stable kernels? If not, I guess you should ping Greg / Khalid to either
> revert that commit (I guess preferable given the nature of the change) or
> merge some additional fixup...

Yes, it did get resolved.  Khalid backported
27c73ae759774e63313c1fbfeb17ba076cea64c5 which fixed the problem in the
mainline kernel and it was released in 3.4.79 as
50d8f1b5c57bb29f02ab5834be334b4f7922b856 (and included the other stable
branches as well).

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: Bad page state in process with linux 3.4.76

2014-02-24 Thread Guillaume Morin
On 24 Feb 12:39, Jan Kara wrote:
   I'm going through some old emails... Did this get resolved with later 3.4
 stable kernels? If not, I guess you should ping Greg / Khalid to either
 revert that commit (I guess preferable given the nature of the change) or
 merge some additional fixup...

Yes, it did get resolved.  Khalid backported
27c73ae759774e63313c1fbfeb17ba076cea64c5 which fixed the problem in the
mainline kernel and it was released in 3.4.79 as
50d8f1b5c57bb29f02ab5834be334b4f7922b856 (and included the other stable
branches as well).

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] exit.c: call proc_exit_connector() after exit_state is set

2014-02-24 Thread Guillaume Morin
From: Guillaume Morin guilla...@morinfr.org

The process events connector delivers a notification when a process
exits.  This is really convenient for a process that spawns and wants
to monitor its children through an epoll-able() interface.

Unfortunately, there is a small window between when the event is
delivered and the child become wait()-able.

This is creates a race if the parent wants to make sure that it knows
about the exit, e.g

pid_t pid = fork();
if (pid  0) {
register_interest_for_pid(pid);
if (waitpid(pid, NULL, WNOHANG)  0)
{
  /* We might have raced with exit() */
}
return;
}

/* Child */
execve(...)

register_interest_for_pid() would be telling the the connector socket
reader to pay attention to events related to pid.

Though this is not a bug, I think it would make the connector a bit
more usable if this race was closed by simply moving the call to
proc_exit_connector() from just before exit_notify() to right after.

Signed-off-by: Guillaume Morin guilla...@morinfr.org
Cc: matt.hels...@gmail.com
---
diff --git a/kernel/exit.c b/kernel/exit.c
index 1e77fc6..9b0ac8c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -804,7 +804,6 @@ void do_exit(long code)
 
module_put(task_thread_info(tsk)-exec_domain-module);
 
-   proc_exit_connector(tsk);
 
/*
 * FIXME: do that only when needed, using sched_exit tracepoint
@@ -812,6 +811,7 @@ void do_exit(long code)
flush_ptrace_hw_breakpoint(tsk);
 
exit_notify(tsk, group_dead);
+   proc_exit_connector(tsk);
 #ifdef CONFIG_NUMA
task_lock(tsk);
mpol_put(tsk-mempolicy);



-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/37] 3.4.79-stable review

2014-02-04 Thread Guillaume Morin
On 04 Feb 14:11, Greg Kroah-Hartman wrote:
> > > Borislav Petkov 
> > > rtc-cmos: Add an alarm disable quirk
> > 
> > This is just one of the 3 patches posted by John Stultz on
> > http://git.linaro.org/people/john.stultz/linux.git/shortlog/refs/heads/stable/3.4/timefix
> 
> I have no idea what that tree is, nor what I am supposed to do with it.

Ok, sorry. This is all coming from 
http://article.gmane.org/gmane.linux.kernel.stable/77575/

I assumed you were aware of it because you participated in the thread
and one of the patch mentioned in this message (Borislav's change) is
applied to the 3.4 branch

> > The other 2 are backports of 6fdda9a9c5db367130cf32df5d6618d08b89f46a
> > (timekeeping: Avoid possible deadlock from clock_was_set_delayed) and
> > ec145babe754f9ea1079034a108104b6001e001c (timekeeping: fix 32-bit
> > overflow in get_monotonic_boottime)
> > 
> > Could you explain why there were not included?
> 
> Can you explain why I would include them?

:) they seem to fix important issues and you seemed to be ok with
including them in stable trees at some point:
http://article.gmane.org/gmane.linux.kernel.stable/77261/

Note that I am not pushing for anything.  I was just under the
(incorrect) impression that you would apply them to the 3.4 branch and
was surprised when it did not happen

I hope the situation is a bit clearer.

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/37] 3.4.79-stable review

2014-02-04 Thread Guillaume Morin
On 04 Feb 13:00, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.4.79 release.
> There are 37 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> (snip)
> Borislav Petkov 
> rtc-cmos: Add an alarm disable quirk

This is just one of the 3 patches posted by John Stultz on
http://git.linaro.org/people/john.stultz/linux.git/shortlog/refs/heads/stable/3.4/timefix

The other 2 are backports of 6fdda9a9c5db367130cf32df5d6618d08b89f46a
(timekeeping: Avoid possible deadlock from clock_was_set_delayed) and
ec145babe754f9ea1079034a108104b6001e001c (timekeeping: fix 32-bit
overflow in get_monotonic_boottime)

Could you explain why there were not included?  Are you simply planning
to include these in 3.4.80?

Sorry if this was discussed somewhere but the original thread does not
say when/if they are going to be included in 3.4.  If this information
is somewhere else, could you just point me to the discussion?

Thanks in advance for your help,

Guillaume.

-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/37] 3.4.79-stable review

2014-02-04 Thread Guillaume Morin
On 04 Feb 13:00, Greg Kroah-Hartman wrote:
 This is the start of the stable review cycle for the 3.4.79 release.
 There are 37 patches in this series, all will be posted as a response
 to this one.  If anyone has any issues with these being applied, please
 let me know.
 (snip)
 Borislav Petkov b...@alien8.de
 rtc-cmos: Add an alarm disable quirk

This is just one of the 3 patches posted by John Stultz on
http://git.linaro.org/people/john.stultz/linux.git/shortlog/refs/heads/stable/3.4/timefix

The other 2 are backports of 6fdda9a9c5db367130cf32df5d6618d08b89f46a
(timekeeping: Avoid possible deadlock from clock_was_set_delayed) and
ec145babe754f9ea1079034a108104b6001e001c (timekeeping: fix 32-bit
overflow in get_monotonic_boottime)

Could you explain why there were not included?  Are you simply planning
to include these in 3.4.80?

Sorry if this was discussed somewhere but the original thread does not
say when/if they are going to be included in 3.4.  If this information
is somewhere else, could you just point me to the discussion?

Thanks in advance for your help,

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/37] 3.4.79-stable review

2014-02-04 Thread Guillaume Morin
On 04 Feb 14:11, Greg Kroah-Hartman wrote:
   Borislav Petkov b...@alien8.de
   rtc-cmos: Add an alarm disable quirk
  
  This is just one of the 3 patches posted by John Stultz on
  http://git.linaro.org/people/john.stultz/linux.git/shortlog/refs/heads/stable/3.4/timefix
 
 I have no idea what that tree is, nor what I am supposed to do with it.

Ok, sorry. This is all coming from 
http://article.gmane.org/gmane.linux.kernel.stable/77575/

I assumed you were aware of it because you participated in the thread
and one of the patch mentioned in this message (Borislav's change) is
applied to the 3.4 branch

  The other 2 are backports of 6fdda9a9c5db367130cf32df5d6618d08b89f46a
  (timekeeping: Avoid possible deadlock from clock_was_set_delayed) and
  ec145babe754f9ea1079034a108104b6001e001c (timekeeping: fix 32-bit
  overflow in get_monotonic_boottime)
  
  Could you explain why there were not included?
 
 Can you explain why I would include them?

:) they seem to fix important issues and you seemed to be ok with
including them in stable trees at some point:
http://article.gmane.org/gmane.linux.kernel.stable/77261/

Note that I am not pushing for anything.  I was just under the
(incorrect) impression that you would apply them to the 3.4 branch and
was surprised when it did not happen

I hope the situation is a bit clearer.

Guillaume.

-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: Bad page state in process with linux 3.4.76

2014-01-14 Thread Guillaume Morin
Greg,

I am going to do more testing but it seems that reverting this patch
from 3.4.69 fixes the BUG
commit b07ef016454ff46f98e633b5a6247ca7e343fb67
Author: Khalid Aziz 

I also verified that I cannot reproduce this problem with 3.13-rc8

Guillaume.

On 14 Jan 21:34, Guillaume Morin wrote:
>
> Hi,
> 
> I wrote this simple program (attached) to play around with kernel AIO.
> It simply does kernel AIO with O_DIRECT on a small temp file stored on
> an ext4 filesystem.
> 
> When I run it with "HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so", it
> triggers the kernel bug on exit every time.
> 
> Removing HUGETLB_MORECORE from the command line fixes the problem.  Note
> that my kernel does not use THP, it is NOT compiled with
> CONFIG_TRANSPARENT_HUGEPAGE.
> 
> I've tried it only with this 3.4.76 but I've been able to reproduce it without
> any issue on multiple machines running the same kernel.
> 
> BUG: Bad page state in process aio_test  pfn:1b7a01
> page:ea0006de8040 count:0 mapcount:1 mapping:  (null) index:0x0
> page flags: 0x208000(tail)
> Modules linked in: nfsd exportfs nfs nfs_acl auth_rpcgss fscache lockd sunrpc
> rdma_ucm rdma_cm ib_addr iw_cm ib_uverbs ib_cm ib_sa ib_mad ib_core ipmi_si
> ipmi_devintf coretemp pcspkr microcode serio_raw i2c_i801 ioatdma i2c_core dca
> dm_mod sg sr_mod cdrom crc32c_intel ahci libahci [last unloaded: 
> scsi_wait_scan]
> Pid: 4441, comm: aio_test Not tainted 3.4.76bug #1
> Call Trace:
> [] ? is_free_buddy_page+0xa0/0xd0
> [] bad_page+0xe6/0xfc
> [] free_pages_prepare+0xfc/0x110
> [] __free_pages_ok+0x2f/0xd0
> [] __free_pages+0x20/0x40
> [] update_and_free_page+0x77/0x80
> [] free_huge_page+0x16e/0x180
> [] __put_compound_page+0x20/0x50
> [] put_compound_page+0x78/0x140
> [] put_page+0x36/0x40
> [] __unmap_hugepage_range+0x1ce/0x230
> [] unmap_hugepage_range+0x51/0x90
> [] unmap_single_vma+0x730/0x740
> [] unmap_vmas+0x5f/0x80
> [] exit_mmap+0xbc/0x130
> [] ? kmem_cache_free+0x20/0xe0
> [] mmput+0x35/0xf0
> [] exit_mm+0xfd/0x120
> [] do_exit+0x16c/0x8b0
> [] ? mntput+0x24/0x40
> [] ? fput+0x192/0x250
> [] do_group_exit+0x3f/0xa0
> [] sys_exit_group+0x17/0x20
> [] system_call_fastpath+0x16/0x1b
> 


-- 
Guillaume Morin 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


BUG: Bad page state in process with linux 3.4.76

2014-01-14 Thread Guillaume Morin
Hi,

I wrote this simple program (attached) to play around with kernel AIO.
It simply does kernel AIO with O_DIRECT on a small temp file stored on
an ext4 filesystem.

When I run it with "HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so", it
triggers the kernel bug on exit every time.

Removing HUGETLB_MORECORE from the command line fixes the problem.  Note
that my kernel does not use THP, it is NOT compiled with
CONFIG_TRANSPARENT_HUGEPAGE.

I've tried it only with this 3.4.76 but I've been able to reproduce it without
any issue on multiple machines running the same kernel.

BUG: Bad page state in process aio_test  pfn:1b7a01
page:ea0006de8040 count:0 mapcount:1 mapping:  (null) index:0x0
page flags: 0x208000(tail)
Modules linked in: nfsd exportfs nfs nfs_acl auth_rpcgss fscache lockd sunrpc
rdma_ucm rdma_cm ib_addr iw_cm ib_uverbs ib_cm ib_sa ib_mad ib_core ipmi_si
ipmi_devintf coretemp pcspkr microcode serio_raw i2c_i801 ioatdma i2c_core dca
dm_mod sg sr_mod cdrom crc32c_intel ahci libahci [last unloaded: scsi_wait_scan]
Pid: 4441, comm: aio_test Not tainted 3.4.76bug #1
Call Trace:
[] ? is_free_buddy_page+0xa0/0xd0
[] bad_page+0xe6/0xfc
[] free_pages_prepare+0xfc/0x110
[] __free_pages_ok+0x2f/0xd0
[] __free_pages+0x20/0x40
[] update_and_free_page+0x77/0x80
[] free_huge_page+0x16e/0x180
[] __put_compound_page+0x20/0x50
[] put_compound_page+0x78/0x140
[] put_page+0x36/0x40
[] __unmap_hugepage_range+0x1ce/0x230
[] unmap_hugepage_range+0x51/0x90
[] unmap_single_vma+0x730/0x740
[] unmap_vmas+0x5f/0x80
[] exit_mmap+0xbc/0x130
[] ? kmem_cache_free+0x20/0xe0
[] mmput+0x35/0xf0
[] exit_mm+0xfd/0x120
[] do_exit+0x16c/0x8b0
[] ? mntput+0x24/0x40
[] ? fput+0x192/0x250
[] do_group_exit+0x3f/0xa0
[] sys_exit_group+0x17/0x20
[] system_call_fastpath+0x16/0x1b


-- 
Guillaume Morin 
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FILE_SIZE 4096

int main(void)
{
io_context_t ctx;
int fd,fd_odirect,i,event_fd,epoll_fd;
struct epoll_event ev;
void *buf;
size_t offset = 0;
struct iocb cb;
struct iocb * cbs[1] = {  };

fd = open("/tmp/foo",O_RDWR|O_CREAT);
if (fd == -1) {
perror("open");
return 1;
}
for (i = 0; i < FILE_SIZE; ++i) {
char c = rand() % 255;
write(fd, , 1);
}
close(fd);

fd_odirect = open("/tmp/foo",O_RDONLY|O_DIRECT);
if (fd_odirect == -1) {
perror("open");
return 1;
}
memset(, 0, sizeof(ctx));
if (0 != io_queue_init(1, )) {
perror("ctx");
return 1;
}
event_fd = eventfd(0, EFD_CLOEXEC);
if (event_fd == -1) {
perror("eventfd");
return -1;
}

epoll_fd = epoll_create(1);
if (epoll_fd == -1) {
perror("epoll_fd");
return 1;
}

ev.events = EPOLLIN;

if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, ) == -1) {
perror("epoll_ctl");
return 1;
}

posix_memalign(, 512, 32768);

while (1) {
struct timespec ts = { 0, 0 };
struct io_event ioev;
int ret;
long v;
io_prep_pread(, fd_odirect, buf + offset, 512, offset);
io_set_eventfd(, event_fd);
if (1 != io_submit(ctx, 1, cbs)) {
perror("io_submit");
return 1;
}

ret = epoll_wait(epoll_fd, , 1, -1);
if (ret != 1) {
perror("epoll_wait");
}

read(event_fd, , 8);

printf("event_fd returned %ld\n", v);


if (io_getevents(ctx, 1, 1, , ) != 1) {
perror("io_getevents");
return 1;
}

printf("Read 1 res %ld res2 %ld\n", ioev.res, ioev.res2);
offset += ioev.res;

if (ioev.res == 0) {
break;
} 
if ((offset + 512) > 32768) {
puts("ERROR - reading past buffer");
return 1;
}
}

free(buf);
io_destroy(ctx);
close(event_fd);
close(epoll_fd);
close(fd_odirect);

return 0;
}


BUG: Bad page state in process with linux 3.4.76

2014-01-14 Thread Guillaume Morin
Hi,

I wrote this simple program (attached) to play around with kernel AIO.
It simply does kernel AIO with O_DIRECT on a small temp file stored on
an ext4 filesystem.

When I run it with HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so, it
triggers the kernel bug on exit every time.

Removing HUGETLB_MORECORE from the command line fixes the problem.  Note
that my kernel does not use THP, it is NOT compiled with
CONFIG_TRANSPARENT_HUGEPAGE.

I've tried it only with this 3.4.76 but I've been able to reproduce it without
any issue on multiple machines running the same kernel.

BUG: Bad page state in process aio_test  pfn:1b7a01
page:ea0006de8040 count:0 mapcount:1 mapping:  (null) index:0x0
page flags: 0x208000(tail)
Modules linked in: nfsd exportfs nfs nfs_acl auth_rpcgss fscache lockd sunrpc
rdma_ucm rdma_cm ib_addr iw_cm ib_uverbs ib_cm ib_sa ib_mad ib_core ipmi_si
ipmi_devintf coretemp pcspkr microcode serio_raw i2c_i801 ioatdma i2c_core dca
dm_mod sg sr_mod cdrom crc32c_intel ahci libahci [last unloaded: scsi_wait_scan]
Pid: 4441, comm: aio_test Not tainted 3.4.76bug #1
Call Trace:
[810f3300] ? is_free_buddy_page+0xa0/0xd0
[814c0791] bad_page+0xe6/0xfc
[810f3dbc] free_pages_prepare+0xfc/0x110
[810f3dff] __free_pages_ok+0x2f/0xd0
[810f4080] __free_pages+0x20/0x40
[81124737] update_and_free_page+0x77/0x80
[8112633e] free_huge_page+0x16e/0x180
[810f8030] __put_compound_page+0x20/0x50
[810f8108] put_compound_page+0x78/0x140
[810f8546] put_page+0x36/0x40
[81126ede] __unmap_hugepage_range+0x1ce/0x230
[81127331] unmap_hugepage_range+0x51/0x90
[8110e880] unmap_single_vma+0x730/0x740
[8110f05f] unmap_vmas+0x5f/0x80
[8111672c] exit_mmap+0xbc/0x130
[8112e170] ? kmem_cache_free+0x20/0xe0
[81035155] mmput+0x35/0xf0
[8103a58d] exit_mm+0xfd/0x120
[8103bb6c] do_exit+0x16c/0x8b0
[811540c4] ? mntput+0x24/0x40
[81138962] ? fput+0x192/0x250
[8103c5ff] do_group_exit+0x3f/0xa0
[8103c677] sys_exit_group+0x17/0x20
[814d03d2] system_call_fastpath+0x16/0x1b


-- 
Guillaume Morin guilla...@morinfr.org
#define _GNU_SOURCE
#include libaio.h
#include errno.h
#include unistd.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include sys/eventfd.h
#include sys/epoll.h
#include sys/mman.h
#include stdio.h
#include stdlib.h

#define FILE_SIZE 4096

int main(void)
{
io_context_t ctx;
int fd,fd_odirect,i,event_fd,epoll_fd;
struct epoll_event ev;
void *buf;
size_t offset = 0;
struct iocb cb;
struct iocb * cbs[1] = { cb };

fd = open(/tmp/foo,O_RDWR|O_CREAT);
if (fd == -1) {
perror(open);
return 1;
}
for (i = 0; i  FILE_SIZE; ++i) {
char c = rand() % 255;
write(fd, c, 1);
}
close(fd);

fd_odirect = open(/tmp/foo,O_RDONLY|O_DIRECT);
if (fd_odirect == -1) {
perror(open);
return 1;
}
memset(ctx, 0, sizeof(ctx));
if (0 != io_queue_init(1, ctx)) {
perror(ctx);
return 1;
}
event_fd = eventfd(0, EFD_CLOEXEC);
if (event_fd == -1) {
perror(eventfd);
return -1;
}

epoll_fd = epoll_create(1);
if (epoll_fd == -1) {
perror(epoll_fd);
return 1;
}

ev.events = EPOLLIN;

if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, ev) == -1) {
perror(epoll_ctl);
return 1;
}

posix_memalign(buf, 512, 32768);

while (1) {
struct timespec ts = { 0, 0 };
struct io_event ioev;
int ret;
long v;
io_prep_pread(cb, fd_odirect, buf + offset, 512, offset);
io_set_eventfd(cb, event_fd);
if (1 != io_submit(ctx, 1, cbs)) {
perror(io_submit);
return 1;
}

ret = epoll_wait(epoll_fd, ev, 1, -1);
if (ret != 1) {
perror(epoll_wait);
}

read(event_fd, v, 8);

printf(event_fd returned %ld\n, v);


if (io_getevents(ctx, 1, 1, ioev, ts) != 1) {
perror(io_getevents);
return 1;
}

printf(Read 1 res %ld res2 %ld\n, ioev.res, ioev.res2);
offset += ioev.res;

if (ioev.res == 0) {
break;
} 
if ((offset + 512)  32768) {
puts(ERROR - reading past buffer);
return 1;
}
}

free(buf);
io_destroy(ctx);
close(event_fd);
close(epoll_fd);
close(fd_odirect);

return 0;
}


Re: BUG: Bad page state in process with linux 3.4.76

2014-01-14 Thread Guillaume Morin
Greg,

I am going to do more testing but it seems that reverting this patch
from 3.4.69 fixes the BUG
commit b07ef016454ff46f98e633b5a6247ca7e343fb67
Author: Khalid Aziz khalid.a...@oracle.com

I also verified that I cannot reproduce this problem with 3.13-rc8

Guillaume.

On 14 Jan 21:34, Guillaume Morin wrote:

 Hi,
 
 I wrote this simple program (attached) to play around with kernel AIO.
 It simply does kernel AIO with O_DIRECT on a small temp file stored on
 an ext4 filesystem.
 
 When I run it with HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so, it
 triggers the kernel bug on exit every time.
 
 Removing HUGETLB_MORECORE from the command line fixes the problem.  Note
 that my kernel does not use THP, it is NOT compiled with
 CONFIG_TRANSPARENT_HUGEPAGE.
 
 I've tried it only with this 3.4.76 but I've been able to reproduce it without
 any issue on multiple machines running the same kernel.
 
 BUG: Bad page state in process aio_test  pfn:1b7a01
 page:ea0006de8040 count:0 mapcount:1 mapping:  (null) index:0x0
 page flags: 0x208000(tail)
 Modules linked in: nfsd exportfs nfs nfs_acl auth_rpcgss fscache lockd sunrpc
 rdma_ucm rdma_cm ib_addr iw_cm ib_uverbs ib_cm ib_sa ib_mad ib_core ipmi_si
 ipmi_devintf coretemp pcspkr microcode serio_raw i2c_i801 ioatdma i2c_core dca
 dm_mod sg sr_mod cdrom crc32c_intel ahci libahci [last unloaded: 
 scsi_wait_scan]
 Pid: 4441, comm: aio_test Not tainted 3.4.76bug #1
 Call Trace:
 [810f3300] ? is_free_buddy_page+0xa0/0xd0
 [814c0791] bad_page+0xe6/0xfc
 [810f3dbc] free_pages_prepare+0xfc/0x110
 [810f3dff] __free_pages_ok+0x2f/0xd0
 [810f4080] __free_pages+0x20/0x40
 [81124737] update_and_free_page+0x77/0x80
 [8112633e] free_huge_page+0x16e/0x180
 [810f8030] __put_compound_page+0x20/0x50
 [810f8108] put_compound_page+0x78/0x140
 [810f8546] put_page+0x36/0x40
 [81126ede] __unmap_hugepage_range+0x1ce/0x230
 [81127331] unmap_hugepage_range+0x51/0x90
 [8110e880] unmap_single_vma+0x730/0x740
 [8110f05f] unmap_vmas+0x5f/0x80
 [8111672c] exit_mmap+0xbc/0x130
 [8112e170] ? kmem_cache_free+0x20/0xe0
 [81035155] mmput+0x35/0xf0
 [8103a58d] exit_mm+0xfd/0x120
 [8103bb6c] do_exit+0x16c/0x8b0
 [811540c4] ? mntput+0x24/0x40
 [81138962] ? fput+0x192/0x250
 [8103c5ff] do_group_exit+0x3f/0xa0
 [8103c677] sys_exit_group+0x17/0x20
 [814d03d2] system_call_fastpath+0x16/0x1b
 


-- 
Guillaume Morin guilla...@morinfr.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/