Re: [RFC PATCH 1/4] ftrace: Make recordmcount.c handle __fentry__
On 08/07/2012 12:38 PM, Steven Rostedt wrote: > From: Steven Rostedt > > With gcc 4.6.0 the -mfentry feature places the function profiling call at the > start of the function. When this is used, the call is to __fentry__ and not > mcount. > > Change recordmcount.c to record both callers to __fentry__ and mcount. [snip] > - (altmcount && strcmp(altmcount, symname) == 0)) > + (altmcount && strcmp(altmcount, symname) == 0) || > + (strcmp(fentry, symname) == 0)) The proposed change will work as long as all the *.o use the same name. Only one of {"__fentry__", "mcount", "_mcount", altmcount} is allowed for all the *.o as input for a particular run. [Modulo the hack of ignoring a leading '.' for 64-bit PowerPC, of course.] If the user changes compilers (or changes CFLAGS by insert/remove "-mfentry") without doing a "make clean", then recordmcount will omit some calls. Those restrictions are easy to guess, and recovery is easy. Therefore, Ack'ed by: John Reiser -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: STT_FUNC for assembler checksum and semaphore ops" in git-x86
Andi Kleen wrote: > On Thu, Jan 10, 2008 at 04:59:52PM -0800, John Reiser wrote: > >>Andi Kleen wrote: >> >>>But actually checking the default implementation in linkage.h already >>>implements size: [snip] >> >>>Are you sure it doesn't work? Your patch should be not needed. If it's >>>still wrong then just ENDPROCs() need to be added. >> >>The ENDPROCs() were not used everywhere. Some code used just END() instead, >>while other code used nothing. um/sys-i386/checksum.S didn't #include > > > END() is fine too since it contains .size too: > > #ifndef END > #define END(name) \ > .size name, .-name > #endif > > >>diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S >>index 444fba4..e2c6e0d 100644 >>--- a/arch/x86/lib/semaphore_32.S >>+++ b/arch/x86/lib/semaphore_32.S >>@@ -49,7 +49,7 @@ ENTRY(__down_failed) >> ENDFRAME >> ret >> CFI_ENDPROC >>- END(__down_failed) >>+ ENDPROC(__down_failed) > > > I don't think these change makes sense given the definition of END() > shown above. > > The only change that would make sense is adding END() (or ENDPROC()) > to a function that doesn't have either of them yet. No. The pseudo op ".type name, @function" appears only in ENDPROC; it does not appear in END. So changing END to ENDPROC *does* alter the Elf32_Sym for 'name'. Just END produces STT_NOTYPE; ENDPROC produces STT_FUNC. A static analysis tool can get the info it wants much more easily if all subroutines are marked as STT_FUNC. In theory the tool could sort the symbols, notice the disjoint coverage of the address space by the .size intervals of consecutive symbols that are the targets of a CALL instruction, and thus deduce that ".type foo, @function" *should* have been specified. But this is a heuristic, and it fails on boundaries where assembly code is invoked via trap, interrupt, or exception (anything other than CALL). Instead, specify STT_FUNC for each subroutine in the first place. That requires .type, which means ENDPROC (not END) from linux/linkage.h. -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: STT_FUNC for assembler checksum and semaphore ops" in git-x86
Andi Kleen wrote: > But actually checking the default implementation in linkage.h already > implements size: [snip] > Are you sure it doesn't work? Your patch should be not needed. If it's > still wrong then just ENDPROCs() need to be added. The ENDPROCs() were not used everywhere. Some code used just END() instead, while other code used nothing. um/sys-i386/checksum.S didn't #include . I also got confused because gcc puts the .type near the ENTRY, while ENDPROC puts it on the opposite end. Here is a revised patch against linux-2.6-x86.git , including a comment in linkage.h which explains one motivation. PowerPC is different ("_GLOBAL" instead of "ENTRY") and foreign to me, so I left it alone. Signed off by: John Reiser <[EMAIL PROTECTED]> diff --git a/include/linux/linkage.h b/include/linux/linkage.h index ff203dd..024cfab 100644 --- a/include/linux/linkage.h +++ b/include/linux/linkage.h @@ -53,6 +53,10 @@ .size name, .-name #endif +/* If symbol 'name' is treated as a subroutine (gets called, and returns) + * then please use ENDPROC to mark 'name' as STT_FUNC for the benefit of + * static analysis tools such as stack depth analyzer. + */ #ifndef ENDPROC #define ENDPROC(name) \ .type name, @function; \ diff --git a/arch/x86/lib/semaphore_32.S b/arch/x86/lib/semaphore_32.S index 444fba4..e2c6e0d 100644 --- a/arch/x86/lib/semaphore_32.S +++ b/arch/x86/lib/semaphore_32.S @@ -49,7 +49,7 @@ ENTRY(__down_failed) ENDFRAME ret CFI_ENDPROC - END(__down_failed) + ENDPROC(__down_failed) ENTRY(__down_failed_interruptible) CFI_STARTPROC @@ -70,7 +70,7 @@ ENTRY(__down_failed_interruptible) ENDFRAME ret CFI_ENDPROC - END(__down_failed_interruptible) + ENDPROC(__down_failed_interruptible) ENTRY(__down_failed_trylock) CFI_STARTPROC @@ -91,7 +91,7 @@ ENTRY(__down_failed_trylock) ENDFRAME ret CFI_ENDPROC - END(__down_failed_trylock) + ENDPROC(__down_failed_trylock) ENTRY(__up_wakeup) CFI_STARTPROC @@ -112,7 +112,7 @@ ENTRY(__up_wakeup) ENDFRAME ret CFI_ENDPROC - END(__up_wakeup) + ENDPROC(__up_wakeup) /* * rw spinlock fallbacks @@ -132,7 +132,7 @@ ENTRY(__write_lock_failed) ENDFRAME ret CFI_ENDPROC - END(__write_lock_failed) + ENDPROC(__write_lock_failed) ENTRY(__read_lock_failed) CFI_STARTPROC @@ -148,7 +148,7 @@ ENTRY(__read_lock_failed) ENDFRAME ret CFI_ENDPROC - END(__read_lock_failed) + ENDPROC(__read_lock_failed) #endif @@ -170,7 +170,7 @@ ENTRY(call_rwsem_down_read_failed) CFI_ADJUST_CFA_OFFSET -4 ret CFI_ENDPROC - END(call_rwsem_down_read_failed) + ENDPROC(call_rwsem_down_read_failed) ENTRY(call_rwsem_down_write_failed) CFI_STARTPROC @@ -182,7 +182,7 @@ ENTRY(call_rwsem_down_write_failed) CFI_ADJUST_CFA_OFFSET -4 ret CFI_ENDPROC - END(call_rwsem_down_write_failed) + ENDPROC(call_rwsem_down_write_failed) ENTRY(call_rwsem_wake) CFI_STARTPROC @@ -196,7 +196,7 @@ ENTRY(call_rwsem_wake) CFI_ADJUST_CFA_OFFSET -4 1: ret CFI_ENDPROC - END(call_rwsem_wake) + ENDPROC(call_rwsem_wake) /* Fix up special calling conventions */ ENTRY(call_rwsem_downgrade_wake) @@ -214,6 +214,6 @@ ENTRY(call_rwsem_downgrade_wake) CFI_ADJUST_CFA_OFFSET -4 ret CFI_ENDPROC - END(call_rwsem_downgrade_wake) + ENDPROC(call_rwsem_downgrade_wake) #endif diff --git a/arch/um/sys-i386/checksum.S b/arch/um/sys-i386/checksum.S index 62c7e56..4f3f62b 100644 --- a/arch/um/sys-i386/checksum.S +++ b/arch/um/sys-i386/checksum.S @@ -26,6 +26,7 @@ */ #include +#include /* * computes a partial checksum, e.g. for TCP/UDP fragments @@ -48,7 +49,7 @@ unsigned int csum_partial(const unsigned char * buff, int len, unsigned int sum) * Fortunately, it is easy to convert 2-byte alignment to 4-byte * alignment for the unrolled loop. */ -csum_partial: +ENTRY(csum_partial) pushl %esi pushl %ebx movl 20(%esp),%eax # Function arg: unsigned int sum @@ -113,12 +114,13 @@ csum_partial: popl %ebx popl %esi ret + ENDPROC(csum_partial) #else /* Version for PentiumII/PPro */ -csum_partial: +ENTRY(csum_partial) pushl %esi pushl %ebx movl 20(%esp),%eax # Function arg: unsigned int sum @@ -211,6 +213,7 @@ csum_partial: popl %ebx popl %esi ret + ENDPROC(csum_partial) #endif @@ -250,7 +253,7 @@ unsigned int csum_partial_copy_generic (const char *src, char *dst, #define ARGBASE 16 #define FP 12
Re: STT_FUNC for assembler checksum and semaphore ops" in git-x86
>>+ .type csum_partial, @function >> ENTRY(csum_partial) >>+ .type csum_partial, @function >> ENTRY(csum_partial) >>CFI_STARTPROC >>pushl %esi >>@@ -141,11 +142,13 @@ ENTRY(csum_partial) >>ret >>CFI_ENDPROC >> ENDPROC(csum_partial) >>+ .size csum_partial, . - csum_partial >>AK [Andi Kleen]: >>Better option would be to just add to ENTRY/ENDPROC ... > John got more or less same comment from me [Sam Ravnborg] - but > I did not hear further. > As it stands out now it not nice. It does look ugly. But .size and .type are oriented towards static description, while the dwarf2 CurrentFrameInfo (CFI) annotations are oriented towards dynamic traceback and unwinding. Forcing the coupling between static and dynamic annotation might lead to trouble when the fast path to 'ret' is in the middle, and code for the slow path appears after the 'ret'; or when a recursive tail 'falls through" into the entry point. A quick test shows that multiple .size pseudo-ops for the same symbol are accepted (the last one wins) so default coupling can be overridden. I'll test revised macros and report shortly. -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
STT_FUNC for assembler checksum and semaphore ops
;; r10 - src @@ -121,4 +122,4 @@ addu.b [$r10],$r12 ret move.d $r12, $r10 - + .size csum_partial, . - csum_partial --- ./arch/cris/arch-v32/lib/checksum.S.orig2007-11-02 07:03:45.0 -0700 +++ ./arch/cris/arch-v32/lib/checksum.S 2008-01-07 08:45:39.0 -0800 @@ -6,6 +6,7 @@ */ .globl csum_partial + .type csum_partial, @function csum_partial: ;; r10 - src @@ -109,3 +110,4 @@ addu.b [$r10],$r12 ret move.d $r12,$r10 + .size csum_partial, . - csum_partial -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slab quirks in DEBUG, ctor, and initialization
Hi Pekka, >>In mm/slab.c, the DEBUG variant of cache_alloc_debugcheck_after >>might call cachep->ctor(objp, cachep, 0); but the non-DEBUG >>variant does absolutely nothing. idr_pre_get is a routine >>which notices the difference. > > > How does ipr_pre_get notice this? idr_pre_get calls kmem_cache_alloc, which constructs 'struct idr_layer' via the cachep->ctor() call from cache_alloc_debugcheck_after to idr_cache_ctor, and not via cache_init_objs. So if DEBUG is off, then idr_cache_ctor does not get its chance to call memset, which could leave the struct idr_layer potentially undefined. >>Even when cache_alloc_debugcheck_after does invoke the ctor, >>then it is conditional upon cachep->flags & SLAB_POISON. This >>assumes that the only two states are poisoned and all-zero >>(from .bss static, or via a cleared new page frame.) >>So if SLAB_POISON is not specified, then a ctor which >>does anything other than memset(,0,) is out of luck. >>Instead: if a ctor is specified then it should be called >>for each successful allocation. > > > Sorry, I don't understand at all what's the problem is here. For the > common (non-poison) case, we initialize all objects *once* whenever a > cache is grown (see cache_grow calling cache_init_objs) which is the > whole point of having constructors. When poisoning is enabled, we > obviously cannot do this which is why we call the constructor for > every allocation. There is a comment near the beginning of mm/slab.c: * This means, that your constructor is used only for newly allocated * slabs and you must pass objects with the same initializations to * kmem_cache_free. Note that the comment should be updated to account for the constructor also being called if SLAB_POISON is specified. A significant implication is that calling the constructor always establishes good state, even though the call might not be necessary, and although it might take more time than not calling it. In order to avoid the allocator having to call the constructor, then the caller of kmem_cache_free must call the constructor just before freeing, or establish the same values. In contrast to calling the constructor every time, the existing convention increases complexity and risk of bugs, in exchange for faster performance when actions equivalent to the ctor are performed in the dtor instead, or when state that is equivalent to the ctor occurs as a natural byproduct of being done with the object. The real gripe is that the current code makes it cumbersome for a dynamic checking program such as valgrind(memcheck) to record correctly the state of the object. Is the object initialized (and which parts of it, and were the initializations performed into allocated space), or is it merely allocated and uninitialized? -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: /dev/urandom uses uninit bytes, leaks user data
Theodore Tso wrote: > On Fri, Dec 14, 2007 at 04:30:08PM -0800, John Reiser wrote: > >>There is a path that goes from user data into the pool. Note particularly that the path includes data from other users. Under the current implementation, anyone who accesses /dev/urandom is subject to having some bytes from their address space being captured and mixed into the pool. >> This path >>is subject to manipulation by an attacker, for both reading and >>writing. Are you going to guarantee that in five years nobody >>will discover a way to take advantage of it? > Yep, I'm confident about making such a guarantee. Very confident. A direct attack (determining the specific values or partial values of some captured bytes) is not the only way to steal secrets. An indirect attack, such as traffic analysis, also may be effective. Here is one idea. Use output from /dev/urandom to generate a random permutation group. Analyze the group: determine all its subgroups, etc. If the structure of those groups has different properties depending on the captured bytes, even after SHA1 and folding and twisting, then that may be enough to help steal secrets. Indirect attacks may be subject to "exponent doubling." The state modulo 2**(2n) may correspond to a system of 2**n congruences in 2**n variables. So a property modulo 2**n might be hoisted to a related property modulo 2**(2n). This might make 2**1024 seem to be not so big. Also, "getting lucky" is allowed, both via initial conditions and via other coincidences. Running on a newly-booted, newly-installed system might be especially advantageous. A completely formal Goedel-numbering proof often has a formal checker that is logarithmic in the length of the proof. If such a logarithmic property applies every once in a while to /dev/urandom, then that might be enough. The bottom line: At a cost of at most three unpredictable branches (whether to clear the bytes in the last word with indices congruent to 1, 2, or 3 modulo 4), then the code can reduce the risk from something small but positive, to zero. This is very inexpensive insurance. -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
slab quirks in DEBUG, ctor, and initialization
In mm/slab.c, the DEBUG variant of cache_alloc_debugcheck_after might call cachep->ctor(objp, cachep, 0); but the non-DEBUG variant does absolutely nothing. idr_pre_get is a routine which notices the difference. Even when cache_alloc_debugcheck_after does invoke the ctor, then it is conditional upon cachep->flags & SLAB_POISON . This assumes that the only two states are poisoned and all-zero (from .bss static, or via a cleared new page frame.) So if SLAB_POISON is not specified, then a ctor which does anything other than memset(,0,) is out of luck. Instead: if a ctor is specified then it should be called for each successful allocation. Invoking the ctor from within cache_alloc_debugcheck_after makes it awkward for a dynamic checker of allocations and initializations. Valgrind would be happier if allocation and initialization were invoked at the same subroutine nesting level, such as in __cache_alloc: cache_alloc_debugcheck_before(cachep, flags); local_irq_save(save_flags); objp = __do_cache_alloc(cachep, flags); /* checker marks the space as allocated */ local_irq_restore(save_flags); objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); if (objp && cachep->ctor) cachep->ctor(objp, cachep, 0); /* checker notes initializations during ctor [above] */ -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: /dev/urandom uses uninit bytes, leaks user data
Theodore Tso wrote: > On Fri, Dec 14, 2007 at 12:45:23PM -0800, John Reiser wrote: > >>>It's getting folded into the random number pool, where it will be >>>impossible to recover it unless you already know what was in the >>>pool. And if you know what's in the pool, you've already broken into >>>the kernel. >> >>The combination of capturing data from other users, plus seeding >>the pool with your own data, just might be powerful enough to help >>steal secrets, sometime in the next five years, from data that is >>recorded today. > > > Um, no. Just seeding the pool with your own data won't help, since > that still won't tell you the initial contents of the pool. And if > you know the initial contents of the pool, then you've broken root. > And being able to steal from the pool also assumes that you've broken > into the system; it is never, ever exported to userspace, even if > you're root (unless you use something like /dev/kmem). Furthermore, > if you don't know the previous contents of the pool, you'll never be > able to recover the information, either now or five years in the > future, since information is XOR'ed into the pool. There is a path that goes from user data into the pool. This path is subject to manipulation by an attacker, for both reading and writing. Are you going to guarantee that in five years nobody will discover a way to take advantage of it? Five years ago there were no public attacks against MD5 except brute force; now MD5 is on the "weak" list. >>>But I'm sympathetic to making Valgrind happy. ... > > > How about wrapping it in a #ifdef CONFIG_UML, which is the only way > you can use Valgrind? The memset will slow down things unnecessarily, > and mixing in the unknown previous contents of the stack (a) doesn't > hurt, and (b) could make life more complicated for an attacker. If speed matters that much, then please recoup 33 cycles on x86 by using shifts instead of three divides, such as (gcc 4.1.2): add_entropy_words(r, tmp, (bytes + 3) / 4); 0x8140689 :lea0x3(%esi),%eax 0x814068c :mov$0x4,%dl 0x814068e :mov%edx,%edi 0x8140690 :cltd 0x8140691 :idiv %edi -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: /dev/urandom uses uninit bytes, leaks user data
Matt Mackall wrote: > Yes, we use uninitialized data. But it's not a leak in any useful > sense. To the extent the previous data is secret, this actually > improves our entropy. > > It's getting folded into the random number pool, where it will be > impossible to recover it unless you already know what was in the > pool. And if you know what's in the pool, you've already broken into > the kernel. The combination of capturing data from other users, plus seeding the pool with your own data, just might be powerful enough to help steal secrets, sometime in the next five years, from data that is recorded today. > But I'm sympathetic to making Valgrind happy. ... > [The code in John's patch is] hideous. How about a memset instead ... > And [John's] change is broken.. We have to add precisely the > number of bytes returned by extract_entropy to keep the books > balanced. Matt is correct. The rolled-up result follows. Signed off by: [EMAIL PROTECTED] --- ./drivers/char/random.c.orig2007-12-14 11:06:03.0 -0800 +++ ./drivers/char/random.c 2007-12-14 12:27:23.0 -0800 @@ -708,6 +708,8 @@ bytes=extract_entropy(r->pull, tmp, bytes, random_read_wakeup_thresh / 8, rsvd); + /* clear uninitialized bytes at the end to make valgrind happy */ + memset((char *)tmp + bytes, 0, -bytes & 3); add_entropy_words(r, tmp, (bytes + 3) / 4); credit_entropy_store(r, bytes*8); } -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
/dev/urandom uses uninit bytes, leaks user data
xfer_secondary_pool() in drivers/char/random.c tells add_entropy_words() to use uninitialized tmp[] whenever bytes is not a multiple of 4. Besides being unfriendly to automated dynamic checkers, this is a potential leak of user data into the output stream. When called from extract_entropy_user, then uninit tmp[] can capture leftover data from a previous copy_from_user(). Signed off by: [EMAIL PROTECTED] --- ./drivers/char/random.c.orig2007-12-14 11:06:03.0 -0800 +++ ./drivers/char/random.c 2007-12-14 11:06:57.0 -0800 @@ -708,7 +708,19 @@ bytes=extract_entropy(r->pull, tmp, bytes, random_read_wakeup_thresh / 8, rsvd); - add_entropy_words(r, tmp, (bytes + 3) / 4); + /* +* 2007-12-13 (valgrind/memcheck) Do not use undefined bytes. +* Avoid info leak when called from extract_entropy_user: +* uninit tmp[] can have data from previous copy_from_user(). +* Instead: fill last word using first bytes. +*/ + { + __u8 *src = (__u8 *)&tmp[0]; + __u8 *dst = bytes + src; + for (; 0!=(3 & bytes); ++bytes) + *dst++ = *src++; + } + add_entropy_words(r, tmp, bytes>>2); credit_entropy_store(r, bytes*8); } -- John Reiser, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + fully-honor-vdso_enabled.patch added to -mm tree
Chuck Ebbert wrote: > John Reiser wrote: > >>The value of ->sysenter_return is interpreted in user space by the >>sysexit instruction; nobody else cares what the value is. The kernel >>is not required to provide a good value when vdso_enabled is zero, >>because the kernel has not told the process that sysenter is valid >>(by setting AT_SYSINFO.) > > > Doesn't matter because a malicious user can still execute sysenter. > We do have to deal with that somehow, so we have to put something > safe in there. All values are safe as far as the kernel is concerned. The sysexit instruction is the only consumer of the value. The sysexit instruction interprets the value as a _user_ virtual address, and jumps to it, in _user_ mode. If user code does not like jumping to a random address when vdso_enabled is zero, then user code should not use sysenter when vdso_enabled is zero. But execution of kernel code is not affected in any way regardless of the value. I'd be happy to set ->sysenter_return to 0 (or any other suggested value) when vdso_enabled is 0, but this would be a politeness only. There is no added security risk to the kernel by leaving it unset. >>Correct. Changing vdso_enabled from 0 to non-zero must be prepared >>to lose this race if it is not prevented. Ordinarily it won't matter >>because the administrator will perform such changes at a "quiet" time. >> > > > We have to deal with all the possibilities here, too. Documentation is one method of dealing with it. -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + fully-honor-vdso_enabled.patch added to -mm tree
Oleg Nesterov wrote: > Still, I don't understand why we don't pass NEW_AUX_ENT(AT_SYSINFO) when > vdso_enabled == 0. We don't need linux-gate.so to use __kernel_vsyscall, > we have FIX_VDSO. In that case we should s/PAGE_KERNEL_RO/PAGE_READONLY/ > of course. I guess the reason is some magic in glibc. In the past there were problems where user code could not single-step (or even read the code of) such system calls. Can these two operations be performed today with what you propose, particularly if FIX_VDSO is near 0xf000 ? -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + fully-honor-vdso_enabled.patch added to -mm tree
Oleg Nesterov wrote: > John Reiser wrote: >>+ switch (vdso_enabled) { >>+ case 0: /* none */ >>+ return 0; > > > This means we don't initialize mm->context.vdso and ->sysenter_return. > > Is it ok? For example, setup_rt_frame() uses VDSO_SYM(&__kernel_rt_sigreturn), > sysenter_past_esp pushes ->sysenter_return on stack. Paul Mundt has commented on setup_rt_frame() and provided a patch which bullet-proofs that area. I will include that patch into the next revision. The value of ->sysenter_return is interpreted in user space by the sysexit instruction; nobody else cares what the value is. The kernel is not required to provide a good value when vdso_enabled is zero, because the kernel has not told the process that sysenter is valid (by setting AT_SYSINFO.) The kernel requires specific register values for sysenter+sysexit and these values may change at the whim of the kernel, so correct code must follow the kernel's protocol. glibc uses sysenter only when AT_SYSINFO is present. User code can screw up even when vdso_enabled is non-zero, by overwriting or re- mapping the vdso page (clobber memory at the destination of sysexit.) Both context.vdso and sysenter_return could be set to zero whenever vdso_enabled is zero; those two values might even be defaulted. I'll add such a change to the next revision of the patch, if you'll defend it against claims of "unnecessary code." > > Note also that load_elf_binary does > > arch_setup_additional_pages() > create_elf_tables() > > , looks like application can crash after exec if vdso_enabled changes from 0 > to 1 in between. Correct. Changing vdso_enabled from 0 to non-zero must be prepared to lose this race if it is not prevented. Ordinarily it won't matter because the administrator will perform such changes at a "quiet" time. -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
fully honor vdso_enabled [i386, sh; x86_64?]
Architectures such as i386, sh, x86_64 have a flag /proc/sys/vm/vdso_enabled to choose whether the kernel should setup a process to use vdso after execve(). Informing the user code via AT_SYSINFO* is controlled by macro ARCH_DLINFO in fs/binfmt_elf.c and include/asm-$ARCH/elf.h, but the vdso page is established always via arch_setup_additonal_pages() called from load_elf_binary(). If vdso_enabled is off, then current code wastes kernel time during execve() and fragments the address space unnecessarily. This patch changes arch_setup_additonal_pages() to honor vdso_enabled. For i386 it also allows the option of a fixed addresss to avoid fragmenting the address space. Compiles and runs on i386. x86_64 [IA32 support] and sh maintainers also please comment. For some related history, including interaction with exec-shield, see: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229304 and also 207020 and 162797. Signed-off-by: John Reiser <[EMAIL PROTECTED]> diff --git a/arch/i386/kernel/sysenter.c b/arch/i386/kernel/sysenter.c index 13ca54a..f8c4d76 100644 --- a/arch/i386/kernel/sysenter.c +++ b/arch/i386/kernel/sysenter.c @@ -22,6 +22,8 @@ #include #include #include +#include +#include /* * Should the kernel map a VDSO page into processes and pass its @@ -105,10 +107,25 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int exstack) { struct mm_struct *mm = current->mm; unsigned long addr; + unsigned long flags; int ret; + switch (vdso_enabled) { + case 0: /* none */ + return 0; + default: + case 1: /* vdso in random available page */ + addr = 0ul; + flags = 0ul; + break; + case 2: /* out of user's way */ + addr = STACK_TOP; + flags = MAP_FIXED; + break; + } + down_write(&mm->mmap_sem); - addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0); + addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, flags); if (IS_ERR_VALUE(addr)) { ret = addr; goto up_fail; diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c index 7b0f66f..2b789fb 100644 --- a/arch/sh/kernel/vsyscall/vsyscall.c +++ b/arch/sh/kernel/vsyscall/vsyscall.c @@ -64,6 +64,9 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long addr; int ret; + if (!vdso_enabled) + return 0; + down_write(&mm->mmap_sem); addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0); if (IS_ERR_VALUE(addr)) { diff --git a/include/asm-i386/a.out.h b/include/asm-i386/a.out.h index ab17bb8..9894f73 100644 --- a/include/asm-i386/a.out.h +++ b/include/asm-i386/a.out.h @@ -19,7 +19,7 @@ struct exec #ifdef __KERNEL__ -#define STACK_TOP TASK_SIZE +#define STACK_TOP (TASK_SIZE - PAGE_SIZE) /* 1 page optional for vdso */ #endif -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
bug in load_elf_binary [PATCH]
In kernel 2.4.3 and 2.2.18, there is a bug in fs/binfmt_elf.c function load_elf_binary(). An ET_DYN file that asks for a PT_INTERP, and whose first PT_LOAD is not at 0, gets AT_PHDR that is (load_bias + 2 * p_vaddr), instead of the correct (load_bias + p_vaddr). The patch for 2.4.3 is --- fs/OLDbinfmt_elf.c Mon Mar 19 17:05:16 2001 +++ fs/binfmt_elf.c Sat Jun 2 16:20:54 2001 @@ -632,5 +632,5 @@ load_bias += error - ELF_PAGESTART(load_bias + vaddr); - load_addr += error; + load_addr += load_bias; } } = Please cc: me if appropriate; I'm not subscribed. To demonstrate the problem on i386: cat >foo.s </maps # I see pid_of_foo.so = 2 + pid_of_nice # output: # 88048000-88049000 r-xp 03:07 2207 foo.so # load_bias is 0x8000 # 88049000-8804a000 rw-p 03:07 2207 foo.so gdb foo.so # Attach to process, and look at its memory. x/64x $esp # and continue until seeing the AT_* on the stack. I see # 0xba90: 0xbfe8 0x 0x0003 0x90090034 # 0xbaa0: 0x0004 0x0020 0x0005 0x0005 # 0xbab0: 0x0006 0x1000 0x0007 0x4000 # 0xbac0: 0x0008 0x 0x0009 0x88048114 # which shows AT_PHDR at 0xba9c of 0x90090034 # which is too big by 0x08048000 [p_vaddr of first PT_LOAD] # because the correct value is 0x88048034. # Note that AT_ENTRY at 0xbacc is 0x88048114 which is correct. -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: tighter compression for x86 kernels
> > Both source (GPLv2) and pre-compiled binary for x86 are available. >^ > That's not true. Read > http://wildsau.idv.uni-linz.ac.at/mfx/upx-license.html The UPX team owns all copyright in all of UPX and in each part of UPX. Therefore, the UPX team may choose which license(s), and has chosen two. One of them is GPLv2. The UPX team understands, and fully intends to abide by, its obligations under GPLv2 when any software that is subject to GPLv2 is contributed to UPX and re-distributed by the UPX team. The other license is detailed in the LICENSE file, but may be summarized as: free to use if unmodified, and if used only to invoke the program, and sublicensable only under the same terms. This permits using UPX to pack a non-GPL executable. Users of UPX (as distributed by the UPX team) may choose whether to use UPX according to GPLv2, or according to the other license. [I am not subscribed to this mailing list, so CC: or mail me if appropriate.] -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
tighter compression for x86 kernels
Beta release v1.11 of the UPX executable compressor http://upx.tsx.org offers new, tighter re-compression of compressed Linux kernels for x86. Additional space savings of about 15% have been seen using "upx --best vmlinuz" (example: 617431 ==> 525099, saving 92332 bytes). Both source (GPLv2) and pre-compiled binary for x86 are available. [I'm not subscribed to this mailing list, so CC: or mail me if appropriate.] -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/