Re: Controversy over dynamic linking -- how to end the panic

2001-06-21 Thread Andrea Arcangeli

> 1. Userland programs which request kernel services via normal system
 ^^
>calls *are not* to be considered derivative works of the kernel.

Please, at least don't say "normal" or it will be non obvious that it is
ok for the vsyscalls too (which aren't *that* normal system calls). I'd
rather use "via any kind of official system call (vsyscalls included)".
Otherwise I guess a malicious could try to say that the vsyscalls are
basically dynamically linking the userspace with the kernel (dynamically
linking GPL code in the kernel to whatever non GPL userspace).

vsyscalls cannot give any advantage to the dark side (satellite is
flooding me with the star wars movies sorry ;) anything you can do with
a vsyscall, you can do with a real syscall too, just slower.  They can
only improve performance when it is possible to provide the same
functionality without entering/exiting kernel. So nobody sane could ever
complain about the vsyscalls but since you're writing that stuff it
worth to make it explicit I think ;).

Thanks,

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: softirq in pre3 and all linux ports

2001-06-19 Thread Andrea Arcangeli

On Wed, Jun 20, 2001 at 01:33:19PM +1000, Paul Mackerras wrote:
> Well, I object to the "without thinking" bit. [..]

agreed, apologies.

> BHs disabled is buggy - why would you want to do that?  And if we do

tasklet_schedule

> want to allow that, shouldn't we put the check in raise_softirq or the
> equivalent, to get the minimum latency?

We should release the stack before running the softirq (some place uses
softirqs to release the stack and avoid overflows).

> Soft irqs should definitely not be much heavier than an irq handler,
> if they are then we have implemented them wrongly somehow.

ip + tcp are more intensive than just queueing a packet in a blacklog.
That's why they're not done in irq context in first place.

> ksoftirqd seems like the wrong solution to the problem to me, if we
> really getting starved by softirqs then we need to look at whether
> whatever is doing it should be a kernel thread itself rather than
> doing it in softirqs.  Do you have a concrete example of the
> starvation/live lockup that you can describe to us?

I don't have gigabit ethernet so I cannot flood my boxes to death.
But I think it's real, and a softirq marking itself runnable again is
another case to handle without live lockups or starvation.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-26 Thread Andrea Arcangeli

On Tue, Jun 26, 2001 at 10:47:12AM -0400, Bulent Abali wrote:
> Andrea,
> I would like try your patch but so far I can trigger the bug only when
> running TUX 2.0-B6 which runs on 2.4.5-ac4.  /bulent
> 

to run tux you can apply those patches in `ls` order to 2.4.6pre5aa1:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.6pre5aa1/30_tux/*

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Alpha compile problem solved by Andrea (pte_alloc)

2001-04-29 Thread Andrea Arcangeli

On Sun, Apr 29, 2001 at 05:27:10PM -0600, Eric W. Biederman wrote:
> 
> Do you know if anyone has fixed the lazy vmalloc code?  I know of
> as of early 2.4 it was broken on alpha.  At the time I noticed it I didn't
> have time to persue it, but before I forget to even put in a bug
> report I thought I'd ask if you know anything about it?

On alpha it's racy if you set CONFIG_ALPHA_LARGE_VMALLOC y (so don't do
that as you don't need it). As long as you use only 1 entry of the pgd
for the whole vmalloc space (CONFIG_ALPHA_LARGE_VMALLOC n) alpha is
safe.

OTOH x86 is racy and there's no workaround available at the moment.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: X15 alpha release: as fast as TUX but in user space (fwd)

2001-04-29 Thread Andrea Arcangeli

On Sun, Apr 29, 2001 at 09:38:04PM +0200, Jamie Lokier wrote:
> Fwiw, modern x86 has global TLB entries too.

my x86-64 implementation is marking the tlb entry global of course (so
it's not flushed during context switch):

#define __PAGE_KERNEL_VSYSCALL \
(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
#define MAKE_GLOBAL(x) __pgprot((x) | _PAGE_GLOBAL)
#define PAGE_KERNEL_VSYSCALL MAKE_GLOBAL(__PAGE_KERNEL_VSYSCALL)

static void __init map_vsyscall(void)
{
extern char __vsyscall_0;
unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - 
__START_KERNEL_map;

__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
}

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: X15 alpha release: as fast as TUX but in user space (fwd)

2001-04-29 Thread Andrea Arcangeli

On Sun, Apr 29, 2001 at 04:18:27PM -0400, Gregory Maxwell wrote:
> having both the code and a comprehensive jump-table might become tough in a

In the x86-64 implementation there's no jump table. The original design
had a jump table but Peter raised the issue that indirect jumps are very
costly and he suggested to jump to a fixed virtual address instead, I
agreed with his suggestion. So this is what I implemented for x86-64
with regard to the userspace vsyscall API (which will be used by glibc):

enum vsyscall_num {
__NR_vgettimeofday,
__NR_vtime,
};

#define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))

the linker can prelink the vsyscall virtual address into the binary as a
weak symbol and the dynamic linker will need to patch it only if
somebody is overriding the weak symbol with a LD_PRELOAD.

Virtual address space is relatively cheap. Currently the 64bit
vgettimeofday bytecode + data is nearly 200 bytes, and the first two
slots are large 512bytes each. So with 1024 bytes we do the whole thing,
and we still have space for further 6 vsyscalls without paying any
additional tlb entry.

(the implementation of the above #define will change shortly but the
VSYSCALL_ADDR() API for glibc will remain the same)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Alpha compile problem solved by Andrea (pte_alloc)

2001-04-30 Thread Andrea Arcangeli

On Sun, Apr 29, 2001 at 09:55:06PM -0600, Eric W. Biederman wrote:
> Hmm. I was having problems reproducible with
> CONFIG_ALPHA_LARGE_VMALLOC n.
> 
> Enabling the large vmalloc was my work around, because the large
> vmalloc whet back to the prelazy allocation code.

I don't have a clue about your problems but certainly the
CONFIG_ALPHA_LARGE_VMALLOC n is not racy while the
CONFIG_ALPHA_LARGE_VMALLOC y is racy.

> problem I had was entries failed to propagate across different tasks.

With CONFIG_ALPHA_LARGE_VMALLOC n the entry is propagated before
starting using the new pgd so it cannot race, there's no special page
fault case for that beacuse you will never get a page fault because of
an unmapped pgd entry in the vmalloc space in first place.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4: Kernel crash, possibly tcp related

2001-04-30 Thread Andrea Arcangeli

On Sun, Apr 29, 2001 at 11:58:20PM -0700, David S. Miller wrote:
> 
> Andrew Morton writes:
>  > "David S. Miller" wrote:
>  > > 
>  > > I'm having a devil of a time finding the tcpblast sources on the
>  > > net, can you point me to where I can get them?
>  > 
>  > I seem to have a copy. 
>  > 
>  > http://www.zip.com.au/~akpm/tcpblast-19990504.tar.gz
> 
> Thanks to everyone who pointed me at this and the debian copy :-)
> 
> Anyways, I just tried to reproduce Ralf's problem on two of my
> machines.  One was an SMP sparc64 system, and the other was my
> uniprocessor Athlon.
> 
> What kind of machine are you reproducing this on Ralf?  I'm not

JFYI: I reproduced too on my UP athlon. I run:

tcpblast -d0 -s 40481 another_host 9000

two times and after the second it locked hard. I didn't had any fork
bomb at the same time but there was an high computing load in the
background.

the nic is:

Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet LANCE] (rev 36)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Alpha compile problem solved by Andrea (pte_alloc)

2001-04-30 Thread Andrea Arcangeli

On Mon, Apr 30, 2001 at 05:56:41PM +0100, Alan Cox wrote:
> > On alpha it's racy if you set CONFIG_ALPHA_LARGE_VMALLOC y (so don't do
> > that as you don't need it). As long as you use only 1 entry of the pgd
> > for the whole vmalloc space (CONFIG_ALPHA_LARGE_VMALLOC n) alpha is
> > safe.
> 
> Its racy for all cases on the Alpha because the exception table fixes are
> not done.

you're talking about the module races, I was only talking only about
vmalloc lazy pgd mapping, they're different things even if they are
both related to the page fault hanlder.

I don't use modules on the alpha so...

> > OTOH x86 is racy and there's no workaround available at the moment.
> 
> -ac fixes all known problems there 

I will check that shortly, thanks. (so far all the fixes I seen floating
around for such races were wrong)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Alpha compile problem solved by Andrea (pte_alloc)

2001-04-30 Thread Andrea Arcangeli

On Mon, Apr 30, 2001 at 07:07:47PM +0200, Andrea Arcangeli wrote:
> On Mon, Apr 30, 2001 at 05:56:41PM +0100, Alan Cox wrote:
> > > On alpha it's racy if you set CONFIG_ALPHA_LARGE_VMALLOC y (so don't do
> > > that as you don't need it). As long as you use only 1 entry of the pgd
> > > for the whole vmalloc space (CONFIG_ALPHA_LARGE_VMALLOC n) alpha is
> > > safe.
> > 
> > Its racy for all cases on the Alpha because the exception table fixes are
> > not done.
> 
> you're talking about the module races, I was only talking only about

here the fix for your module race (still untested though):

diff -urN 2.4.4/arch/alpha/mm/extable.c alpha-modrace/arch/alpha/mm/extable.c
--- 2.4.4/arch/alpha/mm/extable.c   Thu Nov 16 15:37:26 2000
+++ alpha-modrace/arch/alpha/mm/extable.c   Mon Apr 30 19:28:21 2001
@@ -45,20 +45,25 @@
/* There is only the kernel to search.  */
ret = search_one_table(__start___ex_table, __stop___ex_table - 1,
   addr - gp);
-   if (ret) return ret;
 #else
+   unsigned long flags;
/* The kernel is the last "module" -- no need to treat it special. */
struct module *mp;
+
+   ret = 0;
+   spin_lock_irqsave(&modlist_lock, flags);
for (mp = module_list; mp ; mp = mp->next) {
-   if (!mp->ex_table_start)
+   if (!mp->ex_table_start || !(mp->flags&(MOD_RUNNING|MOD_INITIALIZING)))
continue;
ret = search_one_table(mp->ex_table_start,
   mp->ex_table_end - 1, addr - mp->gp);
-   if (ret) return ret;
+   if (ret)
+   break;
}
+   spin_unlock_irqrestore(&modlist_lock, flags);
 #endif
 
-   return 0;
+   return ret;
 }
 
 unsigned

For the large-vmalloc races I'd take a very lazy approch:

--- alpha-modrace/arch/alpha/config.in.~1~  Sat Apr 28 05:24:29 2001
+++ alpha-modrace/arch/alpha/config.in  Mon Apr 30 19:31:24 2001
@@ -211,13 +211,15 @@
 
 # The machine must be able to support more than 8GB physical memory
 # before large vmalloc might even pretend to be an issue.
-if [ "$CONFIG_ALPHA_GENERIC" = "y" -o "$CONFIG_ALPHA_DP264" = "y" \
-   -o "$CONFIG_ALPHA_WILDFIRE" = "y" -o "$CONFIG_ALPHA_TITAN" = "y" ]
-then
-   bool 'Large VMALLOC support' CONFIG_ALPHA_LARGE_VMALLOC
-else
-   define_bool CONFIG_ALPHA_LARGE_VMALLOC n
-fi
+#if [ "$CONFIG_ALPHA_GENERIC" = "y" -o "$CONFIG_ALPHA_DP264" = "y" \
+#  -o "$CONFIG_ALPHA_WILDFIRE" = "y" -o "$CONFIG_ALPHA_TITAN" = "y" ]
+#then
+#  bool 'Large VMALLOC support' CONFIG_ALPHA_LARGE_VMALLOC
+#else
+#  define_bool CONFIG_ALPHA_LARGE_VMALLOC n
+#fi
+# LARGE_VMALLOC is racy, if you *really* need it then fix it first
+define_bool CONFIG_ALPHA_LARGE_VMALLOC n
 
 source drivers/pci/Config.in
 

I mean: I certainly don't need it, not even on the 256G boxes, the non
LARGE_VMALLOC is simpler and _faster_ (it drops a branch from the page
fault handler fast path) and so I'd prefer to spend my time on other
things than fixing LARGE_VMALLOC races, but still the above will avoid
people to get bitten by such race until somebody fixes it.  If anybody
has a rasonable example for which I may need more than 8giga of kernel
vmalloc memory then I can change my mind of course.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4 sluggish under fork load

2001-04-30 Thread Andrea Arcangeli

On Sun, Apr 29, 2001 at 10:26:57AM +0200, Peter Osterlund wrote:
> On Sat, 28 Apr 2001, Linus Torvalds wrote:
> 
> > > could we leave it at half, but set the parent to SCHED_YIELD?
> >
> > Sounds like a good idea. Peter, how does that feel to you? I bet that I'v
> > enever seen it simply because all my machines are (a) much too powerful
> > for any reasonable use and (b) SMP.
> 
> That seems to work. The scheduling delays are back to 20ms and the
> sluggishness feeling is gone. I wrote a simple test program to verify that
> the child is still scheduled before the parent, so the performance
> advantage should still be there. The only annoying thing is that it hides
> the bash bug ;)
> 
> Patch below:
> 
> --- linux-2.4.4.orig/kernel/fork.cSat Apr 28 10:17:00 2001
> +++ linux-2.4.4/kernel/fork.c Sun Apr 29 10:06:42 2001
> @@ -666,16 +666,18 @@
>   p->pdeath_signal = 0;
> 
>   /*
> -  * Give the parent's dynamic priority entirely to the child.  The
> -  * total amount of dynamic priorities in the system doesn't change
> -  * (more scheduling fairness), but the child will run first, which
> -  * is especially useful in avoiding a lot of copy-on-write faults
> -  * if the child for a fork() just wants to do a few simple things
> -  * and then exec(). This is only important in the first timeslice.
> -  * In the long run, the scheduling behavior is unchanged.
> +  * "share" dynamic priority between parent and child, thus the
> +  * total amount of dynamic priorities in the system doesn't change,
> +  * more scheduling fairness. The parent yields to let the child run
> +  * first, which is especially useful in avoiding a lot of
> +  * copy-on-write faults if the child for a fork() just wants to do a
> +  * few simple things and then exec(). This is only important in the
> +  * first timeslice. In the long run, the scheduling behavior is
> +  * unchanged.
>*/
> - p->counter = current->counter;
> - current->counter = 0;
> + p->counter = (current->counter + 1) >> 1;
> + current->counter >>= 1;
> + current->policy |= SCHED_YIELD;
>   current->need_resched = 1;
> 
>   /*

please try to reproduce the bad behaviour with 2.4.4aa2. There's a bug
in the parent-timeslice patch in 2.4 that I fixed while backporting it
to 2.2aa and that I now forward ported the fix to 2.4aa. The fact 2.4.4
gives the whole timeslice to the child just gives more light to such
bug. Unfortunately the fix doesn't apply cleanly to 2.4.4 (it's
incremental with the numa-scheduler patch) and I need to finish a few
more things before I can backport it myself.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Andrea Arcangeli

On Mon, Apr 30, 2001 at 06:55:54PM +0100, Alan Cox wrote:
> A couple. It looks lik the VM changes may have upset something (based on
> reports saying it began at that point). Can you see if 2.2.19pre stuff is
> stable ?

I also have reports but related to the network driver updates. So I
suggest to try again with 2.2.19 but with the drivers/net/* of 2.2.18.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.19 locks up on SMP

2001-04-30 Thread Andrea Arcangeli

On Mon, Apr 30, 2001 at 08:15:47PM +0200, Andrea Arcangeli wrote:
> suggest to try again with 2.2.19 but with the drivers/net/* of 2.2.18.

even better try vanilla 2.2.19aa2 and if it crashes too, try 2.2.19aa2
plus the drivers/net/* of 2.2.18.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4 sluggish under fork load

2001-04-30 Thread Andrea Arcangeli

On Mon, Apr 30, 2001 at 11:38:23PM -0300, Rik van Riel wrote:
> On Mon, 30 Apr 2001, Andrea Arcangeli wrote:
> > On Sun, Apr 29, 2001 at 10:26:57AM +0200, Peter Osterlund wrote:
> 
> > > - p->counter = current->counter;
> > > - current->counter = 0;
> > > + p->counter = (current->counter + 1) >> 1;
> > > + current->counter >>= 1;
> > > + current->policy |= SCHED_YIELD;
> > >   current->need_resched = 1;
> > 
> > please try to reproduce the bad behaviour with 2.4.4aa2. There's a bug
> > in the parent-timeslice patch in 2.4 that I fixed while backporting it
> > to 2.2aa and that I now forward ported the fix to 2.4aa. The fact
> > 2.4.4 gives the whole timeslice to the child just gives more light to
> > such bug.
> 
> The fact that 2.4.4 gives the whole timeslice to the child
> is just bogus to begin with.
> 
> The problem people tried to solve was "make sure the kernel
> runs the child first after a fork", this has just about
> NOTHING to do with how the timeslice is distributed.
> 
> Now, since we are in a supposedly stable branch of the kernel,
> why mess with the timeslice distribution between parent and
> child?  The timeslice distribution that has worked very well
> for the last YEARS...

I'm running with this below patch applied since a some time (I didn't
submitted it because for some reason unless I do p->policy &=
~SCHED_YIELD ksoftirqd deadlocks at boot and I didn't yet investigated
why, and I'd like to have the whole picture on it first):

diff -urN z/include/linux/sched.h z1/include/linux/sched.h
--- z/include/linux/sched.h Mon Apr 30 04:22:25 2001
+++ z1/include/linux/sched.hMon Apr 30 02:45:07 2001
@@ -301,7 +301,7 @@
  * all fields in a single cacheline that are needed for
  * the goodness() loop in schedule().
  */
-   int counter;
+   volatile int counter;
int nice;
unsigned int policy;
struct mm_struct *mm;
diff -urN z/kernel/fork.c z1/kernel/fork.c
--- z/kernel/fork.c Mon Apr 30 04:22:25 2001
+++ z1/kernel/fork.cMon Apr 30 03:49:26 2001
@@ -666,17 +666,17 @@
p->pdeath_signal = 0;
 
/*
-* Give the parent's dynamic priority entirely to the child.  The
-* total amount of dynamic priorities in the system doesn't change
-* (more scheduling fairness), but the child will run first, which
-* is especially useful in avoiding a lot of copy-on-write faults
-* if the child for a fork() just wants to do a few simple things
-* and then exec(). This is only important in the first timeslice.
-* In the long run, the scheduling behavior is unchanged.
+* Scheduling the child first is especially useful in avoiding a
+* lot of copy-on-write faults if the child for a fork() just wants
+* to do a few simple things and then exec().
 */
-   p->counter = current->counter;
-   current->counter = 0;
-   current->need_resched = 1;
+   {
+   int counter = current->counter >> 1;
+   current->counter = p->counter = counter;
+   p->policy &= ~SCHED_YIELD;
+   current->policy |= SCHED_YIELD;
+   current->need_resched = 1;
+   }
/* Tell the parent if it can get back its timeslice when child exits */
p->get_child_timeslice = 1;
 

The only point of my previous email is that if a fork loop has very
invasive effect on the rest of the system that more probably indicates
people got bitten by the bug in the parent-timeslice logic, furthmore I
never noticed any sluggish behaviour on my systems and before posting my
previous email I had 1 definitive feedback that the bad beahviour
observed on vanilla 2.4.4 with parallel compiles in the background got
cured *completly* by my tree (that in the tested revision didn't
included the above inlined change yet). So I thought it was worth
mentioning about the effect of the parent-timeslice bugfix here too.
This doesn't mean I don't want something like the above inlined patch
integrated.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4: Kernel crash, possibly tcp related

2001-05-01 Thread Andrea Arcangeli

On Mon, Apr 30, 2001 at 09:00:09PM +0400, [EMAIL PROTECTED] wrote:
> Hello!
> 
> > My current theory is that tcpblast does something erratic when the
> > error occurs.
> 
> It has buffer size of 32K, so that it faults at enough large chunk sizes.
> 
> Erratic errno is because this applet prints errno on partial write.
> 
> Oops is apparently because I did something wrong in do_fault yet.
> Seems, you were right telling that this place looks dubious. 8)

this is the strict fix:

diff -urN z/net/ipv4/tcp.c z1/net/ipv4/tcp.c
--- z/net/ipv4/tcp.cTue May  1 12:14:14 2001
+++ z1/net/ipv4/tcp.c   Tue May  1 12:12:35 2001
@@ -1184,7 +1184,7 @@
 do_fault:
if (skb->len==0) {
if (tp->send_head == skb) {
-   tp->send_head = skb->prev;
+   tp->send_head = skb->next;
if (tp->send_head == (struct sk_buff*)&sk->write_queue)
tp->send_head = NULL;
}


really the logic can be implemented more efficiently this way:

--- 2.4.4aa3/net/ipv4/tcp.c.~1~ Tue May  1 10:44:57 2001
+++ 2.4.4aa3/net/ipv4/tcp.c Tue May  1 12:00:25 2001
@@ -1183,11 +1183,8 @@
 
 do_fault:
if (skb->len==0) {
-   if (tp->send_head == skb) {
-   tp->send_head = skb->next;
-   if (tp->send_head == (struct sk_buff*)&sk->write_queue)
-   tp->send_head = NULL;
-   }
+   if (tp->send_head == skb)
+   tp->send_head = NULL;
__skb_unlink(skb, skb->list);
tcp_free_skb(sk, skb);
}

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4 sluggish under fork load

2001-05-01 Thread Andrea Arcangeli

On Tue, May 01, 2001 at 07:18:49AM +0200, Andrea Arcangeli wrote:
> I'm running with this below patch applied since a some time (I didn't
> submitted it because for some reason unless I do p->policy &=
> ~SCHED_YIELD ksoftirqd deadlocks at boot and I didn't yet investigated
> why, and I'd like to have the whole picture on it first):

OK I found the explanation now. The reason ksoftirqd was deadlocking on
me without the explicit clear of SCHED_YIELD in p->policy is because a
softirq event was pending at the time of the first kernel_thread() and
then while returning from the syscall it was so taking the ret_from_irq
path that skips the reschedule [which was supposed to clear the
sched_yield and to reschedule the child] because CS was pointing to the
kernel descriptor. So init then runs with SCHED_YIELD set and when it
executes kernel_thread(ksoftirqd) also ksoftirqd inherit SCHED_YIELD set
too (copied at top of do_fork) and it never gets scheduled -> deadlock.

Basically there's no guarantee that any kernel_thread will return with
SCHED_YIELD cleared.

And if you fork off a child with its p->policy SCHED_YIELD set it will
never get scheduled in.

Only "just" running tasks can have SCHED_YIELD set.

So the below lines are the *right* and most robust approch as far I can
tell. (plus counter needs to be volatile, as every variable that can
change under the C code, even while it's probably not required by the
code involved with current->counter)

> + {
> + int counter = current->counter >> 1;
> + current->counter = p->counter = counter;
> + p->policy &= ~SCHED_YIELD;
> + current->policy |= SCHED_YIELD;
> + current->need_resched = 1;
> + }

Alan, the patch you merged in 2.4.4ac2 can fail like mine, but it may fail in
a much more subtle way, while I notice if ksoftirqd never get scheduled
because I synchronize on it and I deadlock, your kupdate/bdflush/kswapd
may be forked off correctly but they can all have SCHED_YIELD set and
they will *never* get scheduled. You know what can happen if kupdate
never gets scheduled... I recommend to be careful with 2.4.4ac2.

My patch (part of it quoted above) is the right replacement for the code
in 2.4.4ac2 (you may want to do `counter = current->counter + 1 >> 1'
tricks additionally to that, I will change it a bit too for that minor
part.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Both 2.4.4aa2 and 2.4.4aa3 fail to compile

2001-05-02 Thread Andrea Arcangeli

On Wed, May 02, 2001 at 10:24:18AM +0900, Maintaniner on duty wrote:
> 
> With gcc-2.95.2 provided by SuSE-7.0 for Alpha on UP2000 SMP with 2GB memory
> 
> 
> gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes -O2 
>-fomit-frame-pointer -fno-strict-aliasing -pipe -mno-fp-regs -ffixed-8 -mcpu=ev6 
>-Wa,-mev6-c -o extable.o extable.c
> extable.c: In function `search_exception_table_without_gp':
> extable.c:54: `modlist_lock' undeclared (first use in this function)
> extable.c:54: (Each undeclared identifier is reported only once
> extable.c:54: for each function it appears in.)
> make[2]: *** [extable.o] Error 1
> make[2]: Leaving directory `/usr/src/linux/arch/alpha/mm'
> make[1]: *** [first_rule] Error 2
> make[1]: Leaving directory `/usr/src/linux/arch/alpha/mm'
> make: *** [_dir_arch/alpha/mm] Error 2

Sorry for that, please try this incremental patch (also for Alan) [it
didn't triggered because you know I don't use modules on my 2G alpha]:

--- 2.4.4aa3/arch/alpha/mm/extable.c.~1~Tue May  1 13:30:02 2001
+++ 2.4.4aa3/arch/alpha/mm/extable.cWed May  2 21:40:49 2001
@@ -46,6 +46,7 @@
ret = search_one_table(__start___ex_table, __stop___ex_table - 1,
   addr - gp);
 #else
+   extern spinlock_t modlist_lock;
unsigned long flags;
/* The kernel is the last "module" -- no need to treat it special. */
struct module *mp;
@@ -76,15 +77,23 @@
   addr - exc_gp);
if (ret) return ret;
 #else
+   extern spinlock_t modlist_lock;
+   unsigned long flags;
/* The kernel is the last "module" -- no need to treat it special. */
struct module *mp;
+
+   ret = 0;
+   spin_lock_irqsave(&modlist_lock, flags);
for (mp = module_list; mp ; mp = mp->next) {
-   if (!mp->ex_table_start)
+   if (!mp->ex_table_start || !(mp->flags&(MOD_RUNNING|MOD_INITIALIZING)))
continue;
ret = search_one_table(mp->ex_table_start,
   mp->ex_table_end - 1, addr - exc_gp);
-   if (ret) return ret;
+   if (ret)
+   break;
}
+   spin_unlock_irqrestore(&modlist_lock, flags);
+   if (ret) return ret;
 #endif
 
/*

Also note that none 2.4 kernel will ever run stable on a alpha if
compiled with 2.95.*, you _must_ use egcs 1.1.2 or the very latest 2.96
with two houndred patches if you want to have a chance to run a 2.4
kernel stable on an alpha (NOTE: only on the alpha, x86 and other
architectures are a completly different matter). So I have to reject any
(runtime) bugreport of 2.4 alpha kernels compiled with any 2.95.*, sorry.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4: Kernel crash, possibly tcp related

2001-05-01 Thread Andrea Arcangeli

On Tue, May 01, 2001 at 09:25:43PM +0400, [EMAIL PROTECTED] wrote:
> Hello!
> 
> > zero and we are running in such slow path, it is obvious the send_head
> > _was_ NULL when we entered the critical section, so it's perfectly fine
> 
> It is not only not obvious, it is not true almost always.
> On normally working tcp send_head is almost never NULL,
> it is NULL only when application is so slow that is not able
> to saturate pipe. If you do not believe my word, add printk checking this. 8)

Note: I said: ".. if send_head points to skb and skb->len is
  ^^
zero and we are running in such slow path ..".

If send_head doesn't point to skb then it is before it (and it cannot
advance under us of course because we hold the sock lock) and so in such
case we didn't clobbered the send_head at all in skb_entail, and so we
don't need to touch send_head in order to undo (we only need to unlink).

See?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4: Kernel crash, possibly tcp related

2001-05-01 Thread Andrea Arcangeli

On Tue, May 01, 2001 at 08:44:52PM +0400, [EMAIL PROTECTED] wrote:
> Hello!
> 
> > this is the strict fix:
> 
> Andrea, you caught the problem!
> 
> The fix is not right though (it is equivalent to straight
> tp->send_head=NULL, as you noticed. It also corrupts queue in
> an opposite manner.) Right fix is appended.
> 
> Explanation: in do_fault we must undo effect of enqueueing new segment
> in the case the segment remained empty. tp->send_head points to
> the first unsent skb in queue and it is NULL when and only when
> all the skbs are already sent. (Invariant is: tp->send_head==NULL ||
> tp->send_head->seq == tp->snd_nxt)
> I crapped this case except for the case when queue is completely empty,
> so that the last sent skb was accounted in packets_out twice...

I understsand the explanation but I don't think my patch is wrong, I
think it's simpler and faster instead.

My argument is very simple, if send_head points to skb and skb->len is
zero and we are running in such slow path, it is obvious the send_head
_was_ NULL when we entered the critical section, so it's perfectly fine
to set send_head back to null and to unlink the skb as the only actions
to undo the skb_entail. That's all. I don't see how my patch can fail.
If I'm missing something I'd love a further explanation indeed. Thanks!

> 
> Damn, what a silly mistake was it... shame.
> 
> Alexey
> 
> 
> --- ../vger3-010426/linux/net/ipv4/tcp.c  Wed Apr 25 21:02:18 2001
> +++ linux/net/ipv4/tcp.c  Tue May  1 20:38:44 2001
> @@ -1185,7 +1187,7 @@
>   if (skb->len==0) {
>   if (tp->send_head == skb) {
>   tp->send_head = skb->prev;
> - if (tp->send_head == (struct sk_buff*)&sk->write_queue)
> + if (TCP_SKB_CB(skb)->seq == tp->snd_nxt)
>   tp->send_head = NULL;
>   }
>   __skb_unlink(skb, skb->list);


Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.4 alpha semaphores optimization

2001-05-03 Thread Andrea Arcangeli

On Thu, May 03, 2001 at 07:47:47PM +0400, Ivan Kokshaysky wrote:
> Initially I tried to use __builtin_expect in the rwsem.h, but found
> that it doesn't help at all in the small inline functions - it works
> as expected only in a reasonably large block of code. Converting these
> functions into the macros won't help, as callers are inline
> functions also.
> On the other hand, gcc 3.0 generates quite a good code for
> conditional branches (comparisons like value < 0, value == 0
> predicted as false etc.). In the cases where expected value is 0,
> we can use cmpeq instruction.
> Other changes:
>  - added atomic_add_return_prev() for __down_write()
>  - removed some mb's for non-SMP
>  - removed non-inline up()/down_xx() when semaphore/waitqueue debugging
>isn't enabled.

I'd love if you could port it on top of this one and to fix it so that
it can handle up to 2^32 sleepers and not only 2^16 like we have to do
on the 32bit archs to get good performance:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.4aa3/00_rwsem-11

I just wrote the prototype, it only needs to be implemented see
linux/include/asm-alpha/rwsem_xchgadd.h:

--
#ifndef _ALPHA_RWSEM_XCHGADD_H
#define _ALPHA_RWSEM_XCHGADD_H

/* WRITEME */

static inline void __down_read(struct rw_semaphore *sem)
{
}

static inline void __down_write(struct rw_semaphore *sem)
{
}

static inline void __up_read(struct rw_semaphore *sem)
{
}

static inline void __up_write(struct rw_semaphore *sem)
{
}

static inline long rwsem_xchgadd(long value, long * count)
{
return value;
}

#endif
--

You only need to fill the above 5 inlined fast paths to make it working
and that's the only thing in the whole alpha tree about the rwsem.

The above patch also provides the fastest write fast path for x86 archs
and the fastest rwsem spinlock based. I didn't yet re-benchmarked the
whole thing yet but still my up_write definitely has to be faster than
the one in 2.4.4 vanilla and the other fast paths have to be the same
speed.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.4+fork patch still sluggish

2001-05-03 Thread Andrea Arcangeli

On Thu, May 03, 2001 at 06:38:27PM -0700, Jeffrey Kuskin wrote:
> This is basically a followup to the "2.4.4 sluggish under fork load"
> thread.
> 
> I am using Redhat 7.1 on a 128MB 400 MHz PII system.  I have a
> locally-built 2.4.4 kernel to which I manually applied the patch that backs
> out the child-before-parent behavior on a fork.  Namely, this patch:
> 
>   
> 
> However, even with this patch applied, I still see extremley jerky mouse
> pointer behavior when I run any kind of job that does lots of forking.  For
> example, a kernel compile or even just the "configure" in preparation for
> compiling XEmacs.
> 
> The same behavior, on exactly the same machine, did _not_ occur with Redhat
> 6.2/kernel 2.2.19.
> 
> I see that this patch has recently been merged into 2.4.5-pre1, but I am
> concerned that it does actually fix the underlying problem.
> 
> Do others continue to see "jerky mouse pointer" behavior even with this
> patch installed, or should I look for other causes?  For instance, are
> there known problems with jerky mouse pointer behavior under heavy swapping
> load?

That's a bug in the get-child-timeslice logic that I mentioned a few
days ago.

Interesting strict fixes for this issue are here (they won't apply
cleanly to 2.4.5pre1 but fixing reject is trivial):


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa1/10_parent-timeslice-6

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa1/20_share-timeslice-2

If you can reproduce on 2.4.5pre1aa1 let us know. Thanks!

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.4 alpha semaphores optimization

2001-05-04 Thread Andrea Arcangeli

On Fri, May 04, 2001 at 01:15:28PM +0400, Ivan Kokshaysky wrote:
> However, there are 3 reasons why I prefer 16-bit counters:

I assume you mean 32bit counter. (that gives max 2^16 sleepers)

> a. "max user processes" ulimit is much lower than 64K anyway;

the 2^16 limit is not a per-user limit it is a global one so the max
user process ulimit is irrelevant.

Only the number of pid and the max number of tasks supported by the
architecture is a relevant limit for this.

> b. "long" count would cost extra 8 bytes in the struct rw_semaphore;

correct but that's the "feature" to be able to support 2^32 concurrent
sleepers at not relevant runtime cost 8).

> c. I can use existing atomic routines which deal with ints.

I was thinking at a dedicated routine that implements the slow path by
hand as well like x86 just do. Then using ldq instead of ldl isn't
really a big deal programmer wise.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-04 Thread Andrea Arcangeli

On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
> Or you can rewrite block_read/write to use the page cache, in which case
> you'd have more luck doing the above.

once block_dev is in pagecache there will obviously be no-way to share
cache between the block device and the filesystem, because all the
caches will be in completly different address spaces.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch] 2.4.4 alpha semaphores optimization

2001-05-04 Thread Andrea Arcangeli

On Fri, May 04, 2001 at 09:02:33PM +0400, Ivan Kokshaysky wrote:
> But I can't imagine how this "feature" could be useful in a real life :-)

It will be required by the time we can fork more than 2^16 tasks (which
I'm wondering if it could be just the case if you use CLONE_PID as
root, I didn't checked the code yet to be sure).

> You meant "the fast path", I guess? Then it's true. However with those

Yes, I guess the slow path is quite painful to maintain, however I'd add
at least the __builtin_expect() so it gets optimized by 2.96 and 3.[01].

> atomic functions code is much more readable.

Your attached code is nice enough IMHO ;).

> Anyway, I've attached asm-alpha/rwsem_xchgadd.h for your implementation.

Sweet, thanks.

> However I got processes in D state early on boot with it -- maybe
> I've made a typo somewhere...

It has to be a bug in a non contention case then, or maybe you run some
threaded app during boot?  Note that my version is a bit different than
David's one, my fast path has less requirements in up_write and so it
can be implemented with less instructions. I will check and integrate
your code soon into my patch, thanks. If you find the bug meanwhile let
me know (to beat it hard you can use my userspace threaded app that
faults and mmap/munmap in loop from dozen of threads).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sat, May 05, 2001 at 03:18:08PM +1200, Chris Wedgwood wrote:
> On Fri, May 04, 2001 at 05:29:40PM +0200, Andrea Arcangeli wrote:
> 
> once block_dev is in pagecache there will obviously be no-way to
> share cache between the block device and the filesystem, because
> all the caches will be in completly different address spaces.
> 
> Once we are at this point... will there be any use in having block
> devices? FreeBSD appears to have done without them completely about a

moving block_dev in pagecache won't change anything from userspace point
of view, it's a transparent change (if we ignore the total loss of
cache coherency between block_dev and fs metadata that it implies, but
as Linus said such loss of coherency will happen anyways eventually
because metadata will go into its address space too). Basically there
will still be a use for the block devices as far as there are fsck and
other userspace applications that want to use it.

Andrea SYNAPSE (very amusing movie ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sun, May 06, 2001 at 02:14:37PM +1200, Chris Wedgwood wrote:
> You don't need block device for fsck, in fact some OS require you use
> character devices (e.g. Solaris).

Moving e2fsck into the kernel is a completly different matter than
caching the blockdevice accesses with pagecache instead of buffercache.

And even if you move e2fsck or reiserfsck into the kernel (you could
technically do that just now regardless of where the block_dev cache
lives) there will still be partd that wants to mmap the blockdevice to
get rid of part of the fat32 partition (right now it uses read/write of
course because buffer cache cannot be mapped in userspace), there will
still be mtools, not self caching dbms, od  I'm not saying we don't need block devices, but I really don't see
> much of a use for them once everything in in the page cache... I
> assume this is why others have got rid of them completely.

I have no idea why/if other got rid of it completly, but the fact block_dev
is useful has nothing to do if it's in pagecache or in buffercache,
really. It's just that by doing it in pagecache you can mmap it as well
and it will provide overall better performance and it's probably cleaner
design. The only visible change is that you will be able to mmap a
blockdevice as well.

About a kernel based fsck Alexander told me he likes it, I personally
don't care about it that much because I believe there's not that much to
share at the source level, fsck and real fs are quite different
problems, and what can be shared can be copied and by not sharing we get
the flexibility of not breaking fsck every time we change the kernel and
more in general the flexibility of doing it in userspace, sharing such
bytecode at runtime definitely doesn't matter.  It also partly depends
from the fs but current ext2 situation is really fine to me and I
wouldn't consier a wortwhile project to move e2fsck into the kernel. 

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] SMP race in ext2 - metadata corruption.

2001-05-05 Thread Andrea Arcangeli

On Sun, May 06, 2001 at 03:00:58PM +1200, Chris Wedgwood wrote:
> On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote:
> 
> Moving e2fsck into the kernel is a completly different matter
> than caching the blockdevice accesses with pagecache instead of
> buffercache.
> 
> No, I was takling about user space fsck using character devices.

I misread your previous email sorry, I think you meant to fsck using
rawio (not to move fsck into the kernel). You can do that just now but
to get decent performance then fsck should do self caching, changing
fsck to do self caching doesn't sound worthwhile either. Note also that
rawio has nothing to do with the pagecache.  Infact both rawio and
O_DIRECT bypasses all the pagecache and its smp locks for example.

> I'm not claiming it is... what I'm asking is _why_ do we need block
> devices once 'everything' lives in the page cache?

Where the cache of the blockdevice lives is a completly orthogonal
problem with "why cached blockdevices are useful" which I addressed in
the previous email.

> It's just that by doing it in pagecache you can mmap it as well
> and it will provide overall better performance and it's probably
> cleaner design. The only visible change is that you will be able
> to mmap a blockdevice as well.
> 
> Why? What needs to mmap a block device? Since these are typically
> larger than that you can mmap into a 32-bit address space (yes, I'm
> ignoring the 5% or so of cases where this isn't true) I'm not aware
> on many applications that do it.

Last time I talked with the parted maintainer he was asking for that
feature so that parted won't need to do its own anti-oom management in
userspace, so he can simple mmap(MAP_SHARED) a quite large region of
metadata of the blockdevice, read/write to the mmaped region and the
kernel will take care of doing paging when it runs low on memory. right
now it allocates the metadata in anonymous memory and loads it via
read(). This memory will need to be swapped out if the working set
doesn't fit in ram (and swap may not be available ;).

> As I said, I'm not takling about kernel based fsck, although for
> _VERY_ large filesystems even with journalling I suspect it will be
> required one day (so it can run in the background and do consistency
> checking when the machine is idle).

Being able to fsck a live filesystem is yet another exotic feature and
yes for that you would certainly need some additional kernel support.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [BUG] freeze Alpha ES40 SMP 2.4.4.ac3, another TCP/IP Problem ? ( was 2.4.4 kernel crash , possibly tcp related )

2001-05-03 Thread Andrea Arcangeli

On Thu, May 03, 2001 at 06:16:02PM +0200, Cabaniols, Sebastien wrote:
> The only thing that does not work under load is the network TCP/IP ?

My alpha is running 2.4.4aa3 under very high load (apache beaten from ab
in loop via 100mbit switched network [tulip on the alpha] plus cerberus)
and I didn't had any problem so far (it only deadlocked with OOM after
one day of day of tux [instead of apache] + cerberus regression testing
but that's only because of a memleak in tux that I reproduced on x86 too
it seems)

I'm going to release soon a 2.4.5pre1aa1 that will compile with modules
as well. The only annoying thing is that UP kernel compiles seems not to
boot but I hope that will be fixed soon too.

So I doubt the problem is the tcp stack, it may not be the driver but it
shouldn't be a generic bug in vanilla 2.4.4 at least.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [BUG] freeze Alpha ES40 SMP 2.4.4.ac3, another TCP/IP Problem ? ( was 2.4.4 kernel crash , possibly tcp related )

2001-05-03 Thread Andrea Arcangeli

On Thu, May 03, 2001 at 06:46:10PM +0200, Andrea Arcangeli wrote:
> as well. The only annoying thing is that UP kernel compiles seems not to
> boot but I hope that will be fixed soon too.

Ok I spotted and fixed that bug that forbidden my tree to boot with UP
compiles on alpha. The bug is that the SCHED_YIELD handling was broken
on alpha UP, this is the fix:

--- 2.4.5pre1aa1/arch/alpha/kernel/entry.S.~1~  Thu May  3 18:22:13 2001
+++ 2.4.5pre1aa1/arch/alpha/kernel/entry.S  Thu May  3 19:18:16 2001
@@ -709,16 +709,14 @@
br  restore_all
 .end entSys
 
-#ifdef CONFIG_SMP
-.globl  ret_from_smp_fork
+.globl  ret_from_fork
 .align 3
-.ent ret_from_smp_fork
-ret_from_smp_fork:
+.ent ret_from_fork
+ret_from_fork:
lda $26,ret_from_sys_call
mov $17,$16
jsr $31,schedule_tail
-.end ret_from_smp_fork
-#endif /* CONFIG_SMP */
+.end ret_from_fork
 
 .align 3
 .ent reschedule
--- 2.4.5pre1aa1/arch/alpha/kernel/process.c.~1~Thu May  3 18:22:09 2001
+++ 2.4.5pre1aa1/arch/alpha/kernel/process.cThu May  3 19:15:41 2001
@@ -306,7 +306,7 @@
struct task_struct * p, struct pt_regs * regs)
 {
extern void ret_from_sys_call(void);
-   extern void ret_from_smp_fork(void);
+   extern void ret_from_fork(void);
 
struct pt_regs * childregs;
struct switch_stack * childstack, *stack;
@@ -325,11 +325,7 @@
stack = ((struct switch_stack *) regs) - 1;
childstack = ((struct switch_stack *) childregs) - 1;
*childstack = *stack;
-#ifdef CONFIG_SMP
-   childstack->r26 = (unsigned long) ret_from_smp_fork;
-#else
-   childstack->r26 = (unsigned long) ret_from_sys_call;
-#endif
+   childstack->r26 = (unsigned long) ret_from_fork;
p->thread.usp = usp;
p->thread.ksp = (unsigned long) childstack;
p->thread.pal_flags = 1;/* set FEN, clear everything else */


(SCHED_YIELD of the previous task is cleared by __schedule_tail, it
wasn't cleared so a non running task had a SCHED_YIELD set and it was
deadlocking, this can explain many malfunction of UP alpha kernels)
I never noticed so far because I always compiled it SMP.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



nfs MAP_SHARED corruption fix

2001-05-08 Thread Andrea Arcangeli

This fixes corruption with MAP_SHARED on top of nfs filesystem in 2.4:

--- 2.4.5pre1aa2/fs/nfs/write.c.~1~ Tue May  1 19:35:29 2001
+++ 2.4.5pre1aa2/fs/nfs/write.c Tue May  8 02:04:15 2001
@@ -1533,6 +1533,7 @@
if (!inode && file)
inode = file->f_dentry->d_inode;
 
+   filemap_fdatasync(inode->i_mapping);
do {
error = 0;
if (wait)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



blkdev in pagecache

2001-05-08 Thread Andrea Arcangeli

This night I moved the blkdev layer in pagecache in this patch:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/blkdev-pagecache-1

It is incremental and depends on the o_direct functionality, latest
o_direct patch against 2.4.5pre1 is here:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/o_direct-5

The main reasons I moved the blkdev in pagecaches is that the current
blkdev provides horrible performance with fast I/O subsystem capable of
over 50mbyte/sec that I just increased x2 with a simple hack that you
can see here if you're curious:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa2/00_4k_block_dev-1

(btw, also the current rawio uses a 512byte bh->b_size granularity that is even
worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter
on this side as it uses the softblocksize of the fs that can be as well
4k if you created the fs with -b 4096)

However after running this 4k_block_dev-1 hack on some more machine I
noticed the blkdev layer wasn't able anymore to update the superblock of
1k ext2 filesystems and to make it "usable" in real life I needed to fix
it. But I didn't wanted ot invest any further time on such an hack and I
preferred to move the blkdev in pagecache and to fix the problem on top
of the new better design (moving blkdev in pagecache of course
introduces that same problem too as I also mentioned in one of the below
points).

I'll describe here some of the details of the blkdev-pagecache-1 patch:

- /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by
  opening the blkdevice with O_DIRECT, it looks much saner and I
  basically get it for free by just implementing 10 lines of the
  blkdev_direct_IO callback, of course I didn't removed the /dev/raw*
  API for compatibility.

  While testing O_DIRECT I destroyed the first 50mbyte of the root
  partition so I will need to wait the test box to return alive before I
  can make further testing ;). But I just fixed the bug that caused the
  corruption before uploading the patch so I don't expect further
  problems (it was only a s/i_dev/i_rdev thing) because the regression
  testing was working well even if it was writing in the wrong disk ;).

- I force the virtual blocksize for all the blkdev I/O
  (buffered and direct) to work with a 4096 bytes granularity instead of
  the current 1024 softblocksize because we need that for getting higher
  performance, 1024 is too low because it wastes too much ram and too
  much cpu. So a DBMS won't be able anymore to write 512bytes to the
  disk using rawio being sure it will be a single atomic block update.
  If you use /dev/raw nothing changed of course, only opening blkdev
  with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O.
  I don't think this is a problem, and also O_DIRECT through the fs was
  just using the fs softblocksize instead of the hardblocksize as unit
  of the minimal direct-IO granularity.

- writes to the blockdevice won't end in the buffer cache, so it will
  be impossible to update the superblock of an ext2 partition mounted ro
  for example, it must not be mounted at all to update the superblock, I
  will need to invent an hack to fix this problem or it will get too
  annoying. One way could simply to change ext2 and have it checking
  the buffer to be uptodate before marking it dirty again but maybe
  we could also do it in a generic manner that fixes all the fs at once
  (OTOH probably not that many fs needs to be fscked online...).

- mmap should be functional but it's totally untested.

- currently the last `harddisk_size & 4095' bytes (if any) won't be
  accessible via the blkdev, to avoid sending to the hardware requests
  beyond the end of the device. Not sure how/if to solve this. But this is
  definitely not a new issue, the same thing happens today in 2.2 and
  2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared
  a mke2fs -b 1024 could get confused. But I really don't want to
  decrease the b_size of the buffer header even if we fix this.

- to share all the filemap.c code and not to change too much stuff in
  the first patch I added some ISBLK check in fast paths, basically
  only to check against blk_size instead of inode->i_size, I also
  considered changing the i_size semantics for the blkdev inodes but
  I didn't wanted to break all the fs yet so I took the localized
  slower way for now (I doubt it is noticeable in the benchmarks
  but nevertheless it would be nice to optimize away those branches).

- once the blkdev is closed in the block_close callback I
  filemap_fdatasync;fsync_dev;filemap_fdatawait;invalidate_inode_pages2
  (fdatawait seems not necessary but it won't hurt). I'm not calling
  truncate_inode_pages because those pages could be still mapped
  (->release is called when f_count goes down to zero, not when
  i_count reaches zero). I'd like to defer the invalidate_inode

Re: nfs MAP_SHARED corruption fix

2001-05-08 Thread Andrea Arcangeli

On Tue, May 08, 2001 at 05:21:02PM +0200, Trond Myklebust wrote:
> AFAICs this fix will clearly deadlock...

yeah, it didn't triggered because it probably needs to be the same page
writepaged and in the dirty list at the same time. I hooked it very deep
into the writeback logic to keep it generic (it wasn't going to add a
significant overhead) but it didn't need to be _that_ deep.

Even worse I think it was partly wrong because it was only in the
close(2) path but not in the fput path that is the one walked by munmap.

This looks better to me, what do you think?

diff -urN ref/fs/nfs/file.c nfs-corruption/fs/nfs/file.c
--- ref/fs/nfs/file.c   Thu Feb 22 03:45:10 2001
+++ nfs-corruption/fs/nfs/file.cTue May  8 19:11:57 2001
@@ -39,6 +39,7 @@
 static ssize_t nfs_file_write(struct file *, const char *, size_t, loff_t *);
 static int  nfs_file_flush(struct file *);
 static int  nfs_fsync(struct file *, struct dentry *dentry, int datasync);
+static void nfs_file_close_vma(struct vm_area_struct *);
 
 struct file_operations nfs_file_operations = {
read:   nfs_file_read,
@@ -57,6 +58,11 @@
setattr:nfs_notify_change,
 };
 
+static struct vm_operations_struct nfs_file_vm_ops = {
+   nopage: filemap_nopage,
+   close:  nfs_file_close_vma,
+};
+
 /* Hack for future NFS swap support */
 #ifndef IS_SWAPFILE
 # define IS_SWAPFILE(inode)(0)
@@ -104,6 +110,20 @@
return result;
 }
 
+static void nfs_file_close_vma(struct vm_area_struct * vma)
+{
+   struct inode * inode;
+
+   inode = vma->vm_file->f_dentry->d_inode;
+
+   if (inode->i_state & I_DIRTY_PAGES) {
+   filemap_fdatasync(inode->i_mapping);
+   lock_kernel();
+   nfs_wb_file(inode, vma->vm_file);
+   unlock_kernel();
+   }
+}
+
 static int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -115,8 +135,11 @@
dentry->d_parent->d_name.name, dentry->d_name.name);
 
status = nfs_revalidate_inode(NFS_SERVER(inode), inode);
-   if (!status)
+   if (!status) {
status = generic_file_mmap(file, vma);
+   if (!status)
+   vma->vm_ops = &nfs_file_vm_ops;
+   }
return status;
 }
 

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
> >   (buffered and direct) to work with a 4096 bytes granularity instead of
> 
> You mean PAGE_SIZE :-).

In my first patch it is really 4096 bytes, but yes I agree we should
change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
I wasn't sure all the device drivers out there can digest a bh->b_size of
8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
supported by linux is 4k. If Jens says I can sumbit 64k b_size without
any problem for all the relevant blkdevices then I will change that in a
jiffy ;). Anyways changing that is truly easy, just define
BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as
well) and it should do the trick automatically. So for now I only cared
to make it easy to change that.

> Exactly, please see my former explanation... BTW.> If you are gogin into
> the range of PAGE_SIZE, it may be very well possible to remove the
> whole page assoociated mechanisms of a buffer_head?

I wouldn't be that trivial to drop it, not much different than dropping
it when a fs has a 4k blocksize. I think the dynamic allocation of the
bh is not that a bad thing, or at least it's an orthogonal problem to
moving the blkdev in pagecache ;).

> Basically this is something which should come down to the strategy
> routine
> of the corresponding device and be fixed there... And then we have this

so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE
aligned and to take care of writing zero in the pagecache beyond the end
of the device? That would be fine from my part but I'm not yet sure
that's the cleanest manner to handle that.

> Some notes about the code:
> 
>   kdev_t dev = inode->i_rdev;
> - struct buffer_head * bh, *bufferlist[NBUF];
> - register char * p;
> + int err;
>  
> - if (is_read_only(dev))
> - return -EPERM;
> + err = -EIO;
> + if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >>
> (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
>^
> 
> blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
> supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
> Are you shure it's guaranteed here to be already preset?
> 
> Same question goes for calc_end_index and calc_rsize.

that's a bug indeed (a minor one at least because all the relevant
blkdevices initialize such array and if it's not initialized you notice
before you can make any damage ;), thanks for pointing it out!

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: nfs MAP_SHARED corruption fix

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 09:30:18AM +0200, Trond Myklebust wrote:
> Here therefore is Andrea's patch with the changes I propose. Opinions?

flushing the dirty pages before locks is probably not required, a dirty
page in the dirty_pages list is no different than a mapped page not in
the dirty_pages list but only with the pte marked dirty, and we cannot flush
the pages with only the pte marked dirty before the locks, but flushing
the dirty_pgaes list won't hurt so overall it looks ok to me.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote:
> better to stay with PAGE_CACHE_SIZE and not get into dark country :-)

My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is
that many architectures have a PAGE_CACHE_SIZE > 4k, up to 64k, on
x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the
multi giga boxes. I think you agreed I'd better stay to a virtual
blocksize of 4k fixed for now.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [BUG] memory mngt > 2 Gbytes and DMA for the Alpha? pci_iommu.c

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 04:12:41PM +0200, Cabaniols, Sebastien wrote:
> Hi lkml,
> 
> There is likely a bug in the management of memory above two Gigabytes and
> DMA in kernel 2.4.4
> (up to ac-6) with the alpha. :-(

remeber why last yaer I was advocating a CONFIG_HIGHMEM option also in
2.4 and not only in 2.2? If we had that now I would tell you "set
HIGHMEM to y until we fix it" and you could use more up to terabyte of
ram in the meantime.

> When I boot the system with mem=2048M, everything is back... network,
> storage...

can you try to set DEBUG_NODIRECT to 1 in pci_iommu.c and then to boot
with mem=2048M, if you can reproduce I should be able to reproduce too.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 05:03:06PM +0200, Reto Baettig wrote:
> Jeff Garzik schrieb:
> > 
> > Martin Dalecki wrote:
> > > > - I force the virtual blocksize for all the blkdev I/O
> > > >   (buffered and direct) to work with a 4096 bytes granularity instead of
> > >
> > > You mean PAGE_SIZE :-).
> 
> Or maybe 8192 bytes on alphas ?!? ;-)

Again, see my argument with Jens, if we make it 8k we risk triggering
lowlevel driver assumption about b_size being <= 4k. At least on my
alpha the fs has a 4k blocksize and I think I never tested myself using
a b_size of 8k yet and so I didn't wanted to put too many unknown
variables into the first equation ;).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: nfs MAP_SHARED corruption fix

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 07:02:16PM -0300, Marcelo Tosatti wrote:
> Why don't you clean I_DIRTY_PAGES ? 

we don't have visibilty on the inode_lock from there, we could make a
function in fs/inode.c or export the inode_lock to do that, but the flag
will be collected when the inode is released anyways, and it's only an
hint to make the common case fast (the common case is when nobody ever
did a MAP_SHARED on the inode). Other places msync/fsync doesn't even
check for such bit but they fdatasync/fdatawait unconditionally. And on
the same lines also sys_fsync and sys_msync could clear the
I_DIRTY_PAGES but they don't for semplcity (it will be cleared by
kupdate later).

So in short we can clear it but it's not required and it won't make much
difference. If you really care you can clear it before calling fdatasync
though.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: nfs MAP_SHARED corruption fix

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 07:38:01PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Thu, 10 May 2001, Andrea Arcangeli wrote:
> 
> > On Wed, May 09, 2001 at 07:02:16PM -0300, Marcelo Tosatti wrote:
> > > Why don't you clean I_DIRTY_PAGES ? 
> > 
> > we don't have visibilty on the inode_lock from there, we could make a
> > function in fs/inode.c or export the inode_lock to do that, but the flag
> > will be collected when the inode is released anyways, and it's only an
> > hint to make the common case fast (the common case is when nobody ever
> > did a MAP_SHARED on the inode). Other places msync/fsync doesn't even
> > check for such bit but they fdatasync/fdatawait unconditionally. 
> 
> Actually msync/fsync _can't_ rely on this bit because there is no
> guarantee that data is fully synced on disk even if the bit is cleared.
> (__sync_one (fs/inode.c) clears the bit _before_ starting the writeout,
> and thats it).

correct sorry, fsync/msync cannot check that bit of course.

> You have the same problem with your code, so I guess its better to just
> remove the I_DIRTY_PAGES check. 

The point you have to clarify before claming we should remove the check
is if the munmap flush needs to be synchronous or not. In general munmap
doesn't need to be synchronous. If you want to commit the writes an
explicit msync(MS_SYNC) or fsync on the file is required.  Otherwise the
updates will hit the platter in a rasonable amount of finite time
asynchronously. If somebody just intiated the fdatasync he will have to
finish before we can collect away the inode and in turn drop all its
cache, so those dirty pages cannot get lost in iput if somebody started
doing the flush under us either, and the guy doing the fdatasync under
us will have to wait synchronously for the stuff to be committed before
it can return.

If some page wasn't yet visible in the dirty_pages list by the time
__sync_one started, we'll find I_DIRTY_PAGES set. This is enforced by
the locking order (sync_one first clears the I_DIRTY_PAGES and then
it starts browsing the dirty_pages list while set_page_dirty first make the
page visible and then marks the inode dirty).

So the I_DIRTY_PAGES check guarantees that those dirty pages cannot be
lost in iput, that was the _only_ object of the patch and that is
certainly enough to fix the nfs fs data corruption reported.

Now if you claim that munmap needs to be synchronous for nfs that's a
completly different matter. I didn't even tried to make it synchronous.
It is possible it has to be synchronous, even write(2) (in theory ;) has
to behave like O_SYNC with nfs, but I'm not sure.


Another thing (completly unrelated to the above issues) that I noticed
while looking over this nfs code is that the __sync_one() for example
called by generic_file_write(O_SYNC) will recall fdatasync but no nfs_wb_all
is put before the fdatawait, and I'm not sure that the nfs_sync_page
called by the fdatawait is enough to rapidly flush the writepaged stuff
to the nfs server. nfs_sync_page apparently only cares about speculative
reads, not at all about committing writebacks. It would look much saner
to me if nfs_sync_page also does a nfs_wb_all() on the inode, so that
the ->sync_page callback gets the same semantics it has for the real
filesystems.

Comments?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Deadlock in 2.2 sock_alloc_send_skb?

2001-05-10 Thread Andrea Arcangeli

On Thu, May 10, 2001 at 07:30:47PM +0200, Andi Kleen wrote:
> On Thu, May 10, 2001 at 01:57:49PM +0100, Alan Cox wrote:
> > > If that happens, and the socket uses GFP_ATOMIC allocation, the while (1)
> > > loop in sock_alloc_send_skb() will endlessly spin, without ever calling
> > > schedule(), and all the time holding the kernel lock ...
> > 
> > If the socket is using GFP_ATOMIC allocation it should never loop. That is
> > -not-allowed-. 
> 
> It is just not clear why any socket should use GFP_ATOMIC. I can understand
> it using GFP_BUFFER e.g. for nbd, but GFP_ATOMIC seems to be rather radical
> and fragile.

side note, the only legal use of GFP_ATOMIC in sock_alloc_send_skb is
with noblock set and fallback zero, remeber GFP_BUFFER will sleep, it
may not sleep in vanilla 2.2.19 but the necessary lowlatency hooks in
the memory balancing that for example I have on my 2.2 tree will make it
to sleep.

The patch self contained looks perfect (I didn't checked that the
callers are happy with a -ENOMEM errorcode though), if alloc_skb()
failed that's a plain -ENOMEM. The caller must _not_ try again, they
must take a recovery fail path instead.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Deadlock in 2.2 sock_alloc_send_skb?

2001-05-10 Thread Andrea Arcangeli

On Thu, May 10, 2001 at 11:17:17PM +0200, Andi Kleen wrote:
> On Thu, May 10, 2001 at 11:13:00PM +0200, Andrea Arcangeli wrote:
> > On Thu, May 10, 2001 at 07:30:47PM +0200, Andi Kleen wrote:
> > > On Thu, May 10, 2001 at 01:57:49PM +0100, Alan Cox wrote:
> > > > > If that happens, and the socket uses GFP_ATOMIC allocation, the while (1)
> > > > > loop in sock_alloc_send_skb() will endlessly spin, without ever calling
> > > > > schedule(), and all the time holding the kernel lock ...
> > > > 
> > > > If the socket is using GFP_ATOMIC allocation it should never loop. That is
> > > > -not-allowed-. 
> > > 
> > > It is just not clear why any socket should use GFP_ATOMIC. I can understand
> > > it using GFP_BUFFER e.g. for nbd, but GFP_ATOMIC seems to be rather radical
> > > and fragile.
> > 
> > side note, the only legal use of GFP_ATOMIC in sock_alloc_send_skb is
> > with noblock set and fallback zero, remeber GFP_BUFFER will sleep, it
> > may not sleep in vanilla 2.2.19 but the necessary lowlatency hooks in
> > the memory balancing that for example I have on my 2.2 tree will make it
> > to sleep.
> 
> Even with nonblock set the socket code will sleep in some circumstances
> (e.g. when aquiring the socket lock) so interrupt operation is out anyways.
> 
> 
> > The patch self contained looks perfect (I didn't checked that the
> > callers are happy with a -ENOMEM errorcode though), if alloc_skb()
> > failed that's a plain -ENOMEM. The caller must _not_ try again, they
> > must take a recovery fail path instead.
> 
> I think the callers are likely broken.
> The patch is still good of course, but not for GFP_ATOMIC. 

you said interrupt won't call that function so I don't see the
GFP_ATOMIC issue.

I also don't what's the issue with GFP_ATOMIC regardless somebody uses
it or not, the patch itself has nothing to do with GFP_ATOMIC. All
gfpmasks can fail, allock_skb can fail regardless of the gfpmask, not
only GFP_ATOMIC will fail, of course GFP_ATOMIC can fail even if the
machine is not totally out of memory but you never know and you cannot
assume anything and when alloc_skb fails you must assume the machine is
totally out of memory or you will deadlock, so if alloc_skb fails we
must return -ENOMEM and take the fail path and the patch does the right
thing in such case as well.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LVM 1.0 release decision

2001-05-11 Thread Andrea Arcangeli

On Fri, May 11, 2001 at 03:32:46PM +0100, Alan Cox wrote:
> Please fix the binary incompatibility in the on disk format between the current
> code and your new release _before_ you do that. The last patches I was sent
> would have screwed every 64bit LVM user.

I just switched to the >=beta4 lvm IOP for all 64bit archs. The previous
one (the 2.4 mainline one) isn't feasible on the archs with 32bit
userspace and 64bit kernel (it uses long). The IOP didn't changed btw,
only the structures changed silenty.

> A new format is fine but import old ones properly. And if you do a new format

It's not a matter of the ondisk format, the on-disk format didn't
changed of course, it's the ioctl format between userspace and kernel 
that changed and the userspace only knows about 1 format. Once IOP
changes (or IOP breaks silenty as in this case) you have to upgrade
userspace with the current design.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



x86 bootmem corruption

2001-05-11 Thread Andrea Arcangeli

Bootmem allocations are executed before all the reserved memory is been
reserved.  This is the fix against 2.4.5pre1. This might explain weird
crashes and "reserved twice" error messages at boot on highmem systems.
I didn't yet had the confirm this patch hels but certainly it is a
necessary fix for correctness.

--- initmem/arch/i386/kernel/setup.c.~1~Tue May  1 19:35:18 2001
+++ initmem/arch/i386/kernel/setup.cFri May 11 01:59:19 2001
@@ -934,7 +934,6 @@
 * trampoline before removing it. (see the GDT stuff)
 */
reserve_bootmem(PAGE_SIZE, PAGE_SIZE);
-   smp_alloc_memory(); /* AP processor realmode stacks in low memory*/
 #endif
 
 #ifdef CONFIG_X86_IO_APIC
@@ -943,18 +942,6 @@
 */
find_smp_config();
 #endif
-   paging_init();
-#ifdef CONFIG_X86_IO_APIC
-   /*
-* get boot-time SMP configuration:
-*/
-   if (smp_found_config)
-   get_smp_config();
-#endif
-#ifdef CONFIG_X86_LOCAL_APIC
-   init_apic_mappings();
-#endif
-
 #ifdef CONFIG_BLK_DEV_INITRD
if (LOADER_TYPE && INITRD_START) {
if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
@@ -971,6 +958,26 @@
initrd_start = 0;
}
}
+#endif
+
+   /*
+* NOTE: before this point _nobody_ is allowed to allocate
+* any memory using the bootmem allocator.
+*/
+
+#ifdef CONFIG_SMP
+   smp_alloc_memory(); /* AP processor realmode stacks in low memory*/
+#endif
+   paging_init();
+#ifdef CONFIG_X86_IO_APIC
+   /*
+* get boot-time SMP configuration:
+*/
+   if (smp_found_config)
+   get_smp_config();
+#endif
+#ifdef CONFIG_X86_LOCAL_APIC
+   init_apic_mappings();
 #endif
 
/*

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 bootmem corruption

2001-05-11 Thread Andrea Arcangeli

On Fri, May 11, 2001 at 05:18:35PM +0100, Alan Cox wrote:
> > reserved.  This is the fix against 2.4.5pre1. This might explain weird
> > crashes and "reserved twice" error messages at boot on highmem systems.
> 
> Reserved twice occurs for two known reasons
> 
> BIOS reporting the same region twice or overlaps (fixed in -ac sent to Linus)
> find_smp_config blindly reserves pages that may already be marked as ROM and
> thus reserved anyway

when it happens because of a double reserve that's fine I know, it _can_
be harmless, I'm not trying to hide those messages. What I'm saying is
that it can _also_ indicate somebody allocated the page before we reserved
it and currently x86 allocates from the bootmem allocator before
reserving all its pages, that's a bug and I provided the fix.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: correctable ECC error

2001-05-12 Thread Andrea Arcangeli

On Sun, May 13, 2001 at 12:44:45AM +0900, root wrote:
> 
> On UP2000 SMP with two 21264 CPU's running 2.4.5pre1aa1 and 2.2.19aa1,
> I am getting the following message:
> 
> ===
> 
> May 12 07:02:09 norma kernel: TSUNAMI machine check: vector=0x630 pc=0x20001170070 
>code=0x10086
> May 12 07:02:09 norma kernel: machine check type: correctable ECC error (retryable)
> May 12 07:02:16 norma init: PANIC: segmentation violation! sleeping for 30 seconds.
> May 12 07:02:46 norma init: PANIC: segmentation violation! sleeping for 30 seconds.

almost certainly it's due buggy ram, ECC checks trapped it.

> Is one of my memory modules failiing?  BTW, it did not sleep when 

yes.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LVM 1.0 release decision

2001-05-12 Thread Andrea Arcangeli

On Fri, May 11, 2001 at 10:19:13PM -0700, David S. Miller wrote:
> 
> Andrea Arcangeli writes:
>  > you _must_ know very well what the mainteinance of that code means ;).
> 
> Which is why I added the facility by which such ioctl conversions can
> be registered at runtime by the subsystem/driver itself.

Which no one single piece of common code is using yet in 2.4.5pre1 so
right now (2.4.5pre1) you must be doing the mainteinance yourself the
old way.

But I certainly agree that it is promising and that in the future
de-localizing the 32bit wrappers is a good thing so at least people will
see this code when they break it while changing the common code ;).

> I'm already planning on doing this, but it is a 2.5.x project.
> Dave Mosberger agrees with this as has anyone else I've mentioned
> the idea to, so consider it basically done in 2.5.x sometime.

Nice to hear that, when you do that please keep [EMAIL PROTECTED] in
CC so we follow it.

After we change the wrapper mechanism by avoiding the mainteinance work by
de-localizing the wrappers and after we share the wrapper logic as well, it
will be a _real_ pleasure to support the lvm ioctl from 32bit userland on
x86-64 too indeed and then it will be a worthwhile effort to support
those ioctl.

Thanks,
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LVM 1.0 release decision

2001-05-11 Thread Andrea Arcangeli

On Fri, May 11, 2001 at 06:29:27PM -0700, David S. Miller wrote:
> I think that's a bad decision, but it is your's.

Let me put it this way: after I get the first real life request from an
user with an useful case where a 32bit app needs to run the lvm ioctl it
will be truly easy to change my mind about that, I just don't expect to
get that request anytime soon because the only thing that runs the lvm
ioctl are the lvmtools, and I assume Andi thought the same when he
originally proposed not to support the lvm ioctl from the 32bit
userland. If you just have in mind something of useful that needs that
please let us know and we will definitely listen to you.

Of course not implementing the 32bit lvm ioctl emulation now, in any
case won't forbid us to implement it in the next 5 minutes, we can do
that anytime if/when we find the first useful case that needs that, it's
just a matter of cut and pasting from the ioctl32.c of sparc64 to the
x86-64 tree and then to spend one day of hacking and testing through
those pointer conversions, being aware that this work will break in the
next two weeks when a new lvm ioctl is added in the next lvm release, or
something like that, you _must_ know very well what the mainteinance of
that code means ;).

BTW, it would be nice if somebody would take care of unifying the
large sharable parts of the emulation code between
x86-64/sparc64/ia64/mips64, this was mentioned by Andi several times but
nothing is been done in that direction yet, they for large part do the
same things and somehow we duplicate efforts across all those ports (if
we exclude the regs maniuplation in the ELF_PLAT_DATA and friends that
can be localized easily). If we do that kind of sharing all the other
ports would probably get the 32bit emulation for the lvm ioctl for free
from the sparc64 effort for example.

> To me, either you support fully the 32-bit execution
> environment or you do not.  After all the work that
> myself and others have done for other platforms, there
> really is no need to cut corners in this area.
>
> My userland on sparc64 is %100 32-bit and everything works
> quite beautifully.

The sparc platform is a completly different matter, you cannot compare
sparc64 to x86-64, on sparc64 the 64bit userspace is a very recent thing
(just to make an example my sparc64 still runs only with the 32bit
userspace and I use the 64bit compiler only for the kernel, I don't know
if you have a total 64bit userspace either).  For sparc64 a 64bit-only
lvmtools would been a very nasty dependency and so for sparc64 it is
almost mandatory to support completly all the ioctls from the 32bit
userland.

On x86-64 the only reason for having a program 32bit is because it's
either binary only, or if you need to reduce the memory footprint and
you don't need more than 4G of address space [btw all the binaries
running in compatibility mode on the x86-64 kernel will get 4G of
address space, 1G more than with a ia32 kernel].  lvmtools are GPL'd and
the memory footprint of the lvmtools is absolutely worthless to
consider. So there's no point to compile the lvmtools 32bit, period.

And I think your "everything works quite beautifully" on sparc64 isn't
really the case if you upgrade to a recent lvm revision unless you fixup
all the 32bit ioctl emulation first, the reason we want "everything
works beautifully and always" is the main reason we try to avoid the lvm
ioctl 32bit emulation ;), at least with the current lvm user<->kernel
design.

Furthmore if somebody post a patch that implements the ioclt wrappers
even if there isn't an useful case that needs them, I will be glad to
checkin that code after adding a fat warning in the source that says it
can break anytime. the lvm ioctl can be run only by the administrator so
it won't be a security breach if they breaks when the drivers/md/lvm*
code gets updated and what I will do in my box will be to compile the
lvmtools with a plain `make` anyways, so my lvmtools will run 64bit
anyways and I will never test that wrappers myself (and after some time
they remains broken I will end putting an #if 0 /* FIXME */ around the
wrappers to avoid somebody getting bitten by that code).

So in short to me it sounds a good decision and also a no brainer one
that we can change anytime if we need to.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



msync over reserved mem

2001-05-14 Thread Andrea Arcangeli

This patch fixes the troubles generated by msync on /dev/fb0 or any
other device driver that exports reserved memory to userspace via shared
mapping.

--- 2.4.5pre1aa3/mm/filemap.c.~1~   Fri May 11 02:08:28 2001
+++ 2.4.5pre1aa3/mm/filemap.c   Mon May 14 18:48:59 2001
@@ -1808,10 +1808,12 @@
 {
pte_t pte = *ptep;
 
-   if (pte_present(pte) && ptep_test_and_clear_dirty(ptep)) {
+   if (pte_present(pte)) {
struct page *page = pte_page(pte);
-   flush_tlb_page(vma, address);
-   set_page_dirty(page);
+   if (VALID_PAGE(page) && !PageReserved(page) && 
+ptep_test_and_clear_dirty(ptep)) {
+   flush_tlb_page(vma, address);
+   set_page_dirty(page);
+   }
}
return 0;
 }

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



2.4.5pre2aa1

2001-05-15 Thread Andrea Arcangeli

Detailed description of 2.4.5pre2aa1 follows.

---
00_alpha-illegal-irq-1

Be verbose for MAX_ILLEGAL_IRQS times if an invalid irq number
is getting run.

(debugging)

00_alpha-ksyms-1

Export a few alpha-arch symbols needed by modules.

(recommended to avoid compilation troubles)

00_alpha-large-vmalloc-1

Drop the CONFIG_LARGE_VMALLOC selection from the
arch/alpha/config.in, the large-vmalloc feature is racy and it can
destabilize the machine, fixing it isn't worthwhile because
nobody needs more than 8Gigabytes of ram in vmalloc memory,
not even tux on the 256G boxes will ever need that.

(recommended)

00_alpha-modrace-1

Fix alpha races between module insmod/rmmod and the page fault
fixmap lookup.

(recommended)

00_alpha-numa-10

Fully support wildfire machines with all kind of NUMA memory
configuration, plus it optimizes the allocation on per node
basis to boost the performance on the NUMA boxes. Right now
CONFIG_WILDFIRE needs to be selected to take advantage of this
feature. (CONFIG_GENERIC + CONFIG_DISCONTIGMEM=y and
CONFIG_NUMA=y will work fine as well but it won't take
advantage of the new feature. It also fixes many memory
management bit in the core linux allocator in the common code,
mostly to avoid wasting static memory.

(recommended)

00_alpha-sched-yield-1

Fixes SCHED_YIELD on the alpha arch for UP compiles.

(recommended)

00_alpha-show_stack-1

Implements the show_stack() call used often by some common
code, mostly to allow compilation, things like tux needs it.

(nice to have)

00_alpha-tlb-page-sym-1

Drops a not necessary export on the alpha port.

(recommended)

00_buffer-2

Reschedule during oom while allocating buffers, still getblk
can deadlock with oom but this will hide it pretty well as
it won't loop in a tight loop anymore.

(recommended)

00_cachelinealigned-in-smp-1

Moves the pagecache_lock and the VM pagemap_lru_lock in two
different L1 cachelines to avoid contention, mostly useful on
the alpha where the spinlocks uses load locked store
conditional loops (and we don't want to loop).

(nice to have)

00_copy-user-lat-2

Put the rechedule points into copy-user calls, with lots of
cache large read/writes could otherwise _never_ reschedule
once until they returns to userspace.

(recommended)

00_cpus_allowed-1

Fixes a bug in the cpu affinity in-kernel API, bug was fatal
for ksoftirqd.

(recommended)

00_double-buffer-pass-1

Avoids looping two times for no good reason into the lru lists
of the buffer cache (the double loop was an unreliable hack
from the prehistory that survided 'till today).

(nice to have)

00_exception-table-1

Avoids a compilation warning when compiling without modules.

(very minor thing)

00_highmem-deadlock-3

Fixes an highmem deadlock using a reserved pool for the bounce
buffers.

(recommended)

00_highmem-debug-1

Allows people with x86 machines with less than 1G of ram to
test the highmem code.

(debugging)

00_ia32-bootmem-corruption-1

Fixes the x86 boot stage to finish initializing all the
reserved memory before starting allocating memory.

(recommended)

00_ipv6-null-oops-1

Fixes null pointer oops.

(recommended)

00_jens-loop-noop-nobounce-1

Skips the bounces with the null transfer function.

(nice to have)

00_ksoftirqd-4

Avoids 1/HZ latency for the softirq if the softirq is marked
again pending when do_softirq() finished and the machine is
otherwise idle, it also fixes the case of a softirq re-marking
itself runnable by delegating to the scheduler the balance of
the softirq load like if it would be an normal task.

(nice to have)

00_kupdate-large-interval-1

Allows to set large interval for the kupdate runs, this is
useful on the laptops, instead of sigstopping ksoftirqd it's
nicer to set a large interval for example of the order of one
hour (do that at your own risk of course, doing that is not
recommended unless you know what you're doing).

(nice to have)

00_lvm-0.9.1_beta7-4

Updates to the lvmbeta7 with fixes for the lv hardsectsize
estimantion based on the max hardsectsize of the underlying
pv, plus it has some other tons of fixes and it is a must have
for the 64bit archs as the IOP silenty changed for those
platforms.

(recommended)

00_max_readahead-1

Increases the max_readahead to allow t

2.2.20pre2aa1

2001-05-15 Thread Andrea Arcangeli

The main features of 2.2.20pre2aa1 are:

o   Support for 4Gigabyte of RAM on IA32 (me and Gerhard Wichert)
o   Support for 2T of RAM on alpha (me)
o   RAW-IO (doable with bigmem enabled too). Improvements are also been
backported from 2.4.
o   SMP scheduler improvements. (me and partly from 2.3.x contributed by
Ingo Molnar)
o   LFS (>2G file on 32bit architectures) also NFSv3 works over 2G
(nfsv3-lfs work from me, Andi and fix from Jay Weber)
o   fixed race in wake-one LIFO in accept(2). Apache must be compiled with
-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT to take advantage of that.
o   lowlatency and SMP scalability in all copy-user and tcp_sendmsg
checksum.
o   GFS support.
o   various fixes

Detailed description of 2.2.20pre2aa1 follows.

---
00_4_min_percent-1

Increase the min percent of the buffer cache and page cache to 4%.
(it wouldn't matter if the VM algorithms were better). (me)

00_IO-wait-3

Avoid suprious unplug of the I/O queue. (me)

00_K7_P4-cachelines-2

Allows the kernel to be compiled for K7 (AMD Athlon) or Pentium4.

This compilations options _only_ make the kernel to assume respectively
64byte cachelines or 128byte cachelines.

Those assumptions are critical mostly on SMP systems, but even UP will
take advantage of it because it will make most performance
critical slab allocations to start on a cacheline boundary.

Since the only difference between Ppro/K7/P4 compilations
is the cacheline size assumed by the kernel you can safely boot
a K7/P4 compiled kernel on a Ppro, it won't obviously generate
cacheline pinpongs in SMP. There are only two downsides in 
running on a Ppro (PII/PIII included) a K7/P4 compiled kernel:

1)  some byte of memory wasted due the larger paddings (really not
a big deal)
2)  potential waste of cacheline sets. On PII and PIII and PPro
the L1 dcache is 8kbyte or 16kbytes 2-way set associative (so
there are 128 or 256 sets of cachelines) and by stressing the
first 32bytes of the 128 byte aligned data structures (for
example if the kernel is compiled for P4) you would take
advantage of only 1/4 of the available L1 cache. (you would
stress set 0, 4, 8, ... only) This is probably quite serious
issue in terms of performance. So for Ppro/PII/PIII you're
still suggested to use Ppro compilation option (if care to
optimize the L1 cache usage).

(me and Andi)

00_P4-local-APIC-1

Fix a local APIC initaliziation ordering bug that triggers on the PIV.

00_PIII-10.bz2

SSE/SSE2 support (unmasked exception via mxcsr included).
(mix of Doug Ledford, Ingo Molnar, Gabriel Paubert PIII patch for
2.2.x and 2.4.x PIII support from Gareth Hughes, audited, fixed and
changed by me to be dynamic. At the end it's very similar to the
2.4.x support)

Kernel now understands the `nofxsr' boot time parameter and it
doesn't enable fxsr in that case (if there's any CPU that
crashes at boot because it's buggy, nofxsr will workaround
the hardware bug; it's also useful for asymetric multiprocessing
where boot cpu can have fxsr capabilities and the other cpus hasn't)

00_SIGIO-reason-2.bz2

Pass the reason for the sigio in the si_code (and a duplicate info
in si_band) with the same API of 2.4.x. This avoids people
having to poll a set of fd during the sigio handler. (current
2.4.x has two bugs in that area but fixes are in Linus's mailbox)

00_SMP-scheduler-2.2.18pre21-H.bz2

Better SMP reschedule_idle. (partly backported from 2.3.x, 2.3.x
version was contributed by Ingo Molnar)

Fixes the wmb() in schedule_tail that should really be a mb(), in
theory one of the last reads in reschedule_idle could return garbage (in
practice I think it can't trigger... at least on x86 :) (me)

00_VM-locked-1

wait I/O completion as well while doing the Wait_IO second pass on the
dirty cache (should fix last VM problem reported by VA while creating
very large ext2 fs on lowmem machines)

00_VM_RESERVED-1

Allows device drivers to set the VMA as reserved to avoid swap_out to
try to unmap stuff from them (this avoid device drivers using ->nopage
for lazily mapping scatter gather dma areas in userspace, to implement
a noop swapout and it also avoids useless page faults). This is much
more efficient than setting the physical pages as reserved.

00_alpha-epoch-2

Fixes the RTC parsing done by the kernel at boot. (backported from
2.4.x)


Re: rwsem, gcc3 again

2001-05-16 Thread Andrea Arcangeli

On Wed, May 16, 2001 at 11:03:27AM +0200, [EMAIL PROTECTED] wrote:
> David,
> I am using the gcc-3.0 snapshot of 14.5.2001 from codesourcery (i686 binary).
> I have now tried to mimic CPU=386 behaviour (patch posted yesterday night)
> and it compiles (just sound fails), by exchanging y and n in
> CONFIG_RWSEM_GENERIC_SPINLOCK and CONFIG_RWSEM_XCHGADD_ALGORITHM.
> 
> Thanks for your patience, all listening...

can you check if the alternate rwsem compiles with gcc 3.0? I had a
report that they don't compile but I checked and that had to be a gcc
3.0 bug, and so I was waiting to hear they start to compile with latest
CVS of gcc 3.0.


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre2aa1/00_rwsem-11

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.5pre2aa1

2001-05-16 Thread Andrea Arcangeli

On Tue, May 15, 2001 at 08:42:03PM -0300, Rik van Riel wrote:
> On Tue, 15 May 2001, Andrea Arcangeli wrote:
> 
> > Detailed description of 2.4.5pre2aa1 follows.
> 
> > 00_buffer-2
> > 
> > Reschedule during oom while allocating buffers, still getblk
> > can deadlock with oom but this will hide it pretty well as
> > it won't loop in a tight loop anymore.
> 
> These descriptions are very helpful. Are they available somewhere

I'm happy to hear that.

> for all your (recent) patches?

Almost everything was shortly described in my last emails. Those
descriptions are available also as .log into my ftp area.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.20pre2aa1

2001-05-16 Thread Andrea Arcangeli

On Tue, May 15, 2001 at 08:33:05PM -0700, dean gaudet wrote:
> On Tue, 15 May 2001, Andrea Arcangeli wrote:
> 
> > o   fixed race in wake-one LIFO in accept(2). Apache must be compiled with
> > -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT to take advantage of that.
> >
> > 00_wake-one-4
> >
> > Backport 2.4 waitqueues and in turn fixes an hanging condition in accept(2).
> >
> > (me)
> 
> apache since 1.3.15 has defined SINGLE_LISTEN_UNSERIALIZED_ACCEPT ...

That's definitely a good thing.

> 'cause that's what you guys asked me to do :)  does this mean there are
> known hangs on linux 2.2.x without your fix?

I never heard of anybody reproducing that but accpet() in 2.2
can _definitely_ miss events without the above 00_wake-one-4 patch
because it wrongly considers a progress wakeing up two times the same
exclusive task.

Furthmore the exclusive wakeup logic with the exclusive information
per-task and not per wait_queue_t will screwup if the tasks registers
itself like a wakeall after it was just registered as wakeone somewhere
else (however this second thing is more a theorical issue that shouldn't
trigger in 2.2).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rwsem, gcc3 again

2001-05-16 Thread Andrea Arcangeli

On Wed, May 16, 2001 at 02:52:04PM +0100, David Howells wrote:
> 
> Hi Andrea:
> 
> Here you go:
> 
> /usr/local/bin/gcc -D__KERNEL__ -I/inst-kernels/linux-2.4.5-pre2-aa/include -Wall 
>-Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe 
>-mpreferred-stack-boundary=2 -march=i686-DEXPORT_SYMTAB -c sys.c
> sys.c: In function `sys_gethostname':
> /inst-kernels/linux-2.4.5-pre2-aa/include/asm/rwsem_xchgadd.h:51: inconsistent
> operand constraints in an `asm'
> 
> I've lit fires underneath some of our gcc people, and they say it's definitely
> a bug in gcc. They're currently looking at it.

Ok, I hope it will be fixed soon ;), thanks for checking.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.2.20pre2aa1

2001-05-16 Thread Andrea Arcangeli

On Wed, May 16, 2001 at 10:25:32AM -0700, dean gaudet wrote:
> On Wed, 16 May 2001, Andrea Arcangeli wrote:
> 
> > On Tue, May 15, 2001 at 08:33:05PM -0700, dean gaudet wrote:
> > > apache since 1.3.15 has defined SINGLE_LISTEN_UNSERIALIZED_ACCEPT ...
> >
> > That's definitely a good thing.
> 
> hmm, i'm not so sure -- 1.3.x is our stable release, and it sounds like
> this change has added an instability.

Not if you use my 2.2 tree or any recent 2.4 out there. I mean that's
not an apache mistake, you shouldn't backout that change because of a
kernel race condition.

> i'm guessing from your description that the missed event will be noticed
> when the next socket arrives.  i.e. if the server is pretty busy then the

yes, it will handle the missed connect only when the next connect
request arrives.

> missed events are not important.  but if it's not a busy server, like a
> hit every hour, then the missed event may be noticeable to browsers (as a
> timeout waiting for server activity).
> 
> does that pretty much sum it up?

I'm not sure what apache does exactly while handling new connections but
your above description of the sympthoms sounds ok.

> > Furthmore the exclusive wakeup logic with the exclusive information
> > per-task and not per wait_queue_t will screwup if the tasks registers
> > itself like a wakeall after it was just registered as wakeone somewhere
> > else (however this second thing is more a theorical issue that shouldn't
> > trigger in 2.2).
> 
> i.e. if the socket was used both in accept() and in select() at the same
> time?  (which apache doesn't do)

No because the same task cannot run accept() and select() at the same
time, that's a per-task vs per-waitqueue_t issue (not per-socket),
imagine it like accept() calling select() interally in the kernel, which
doesn't happen of course and that's why it cannot trigger in real life,
you cannot exploit it from userspace, it's a kernel internal issue. So
don't worry about it ;) My patch has the bonus to fix it as well though
(like 2.4).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-19 Thread Andrea Arcangeli

On Fri, May 18, 2001 at 09:46:17PM +0400, Ivan Kokshaysky wrote:
> The most interesting thing here is the pyxis "tbia" fix.
> Whee! I can now copy files from SCSI to bus-master IDE, or
> between two IDE drives on separate channels, or do other nice
> things without hanging lx/sx164. :-)
> The pyxis "tbia" turned out to be broken in a more nastier way
> than one could expect - tech details are commented in the patch.
> 
> Another problem, I think, is that we need extra locking in
> pci_unmap_xx(). It seems to be possible that after the scatter-gather
> table "wraps" and some SG ptes get free, these ptes might be
> immediately allocated and next_entry pointer advanced by pci_map_xx()
> from interrupt or another CPU *before* the test for mv_pci_tbi().
> In this case we'd have stale TLB entries.
> 
> Also small compile fix for 2.4.5-pre3.
> 

I fixed the same race condition in the unmap (not flushed pte after
next_entry was visible) two days ago and it's ovbiosuly correct, but it
was not nearly enough here, there was a very nasty other race condition
that triggers at least on all ds10 ds20 es40 tsunami/clibber based
boards that is necessary to fix too to make the machine stable (fixed
yesterday and getting tested today).

Reading the tsunami specs I learnt 1 tlb entry caches 8 pagetables (not 1)
so the tlb flush will be invalidate immediatly by any PCI DMA run after
the flush on any of the other 7 mappings cached in the same tlb entry.


This is the fix:

diff -urN alpha-ref/arch/alpha/kernel/pci_iommu.c
alpha-works/arch/alpha/kernel/pci_iommu.c
--- alpha-ref/arch/alpha/kernel/pci_iommu.c Sun Apr  1 01:17:07 2001
+++ alpha-works/arch/alpha/kernel/pci_iommu.c   Fri May 18 18:07:40 2001
@@ -69,7 +69,7 @@
 
/* Align allocations to a multiple of a page size.  Not needed
   unless there are chip bugs.  */
-   arena->align_entry = 1;
+   arena->align_entry = 8;
 
return arena;
 }
@@

However thsi is just the production fix, the real fix will only change
that for the tsunami chipset

since I didn't wanted to deal with the optimizations yet I also disabled
the optimizations (I will audit the optimizations shortly). Then I fixed
at least the eppro100 driver to check if it runs of pci map entries (all
drivers out there are broken, they don't check the retval from pci_map*
etc...).

then I also enlarged the pci SG space to 1G beause runing out of entries
right now breaks the whole world:

@@ -358,7 +360,7 @@
 * address range.
 */
hose->sg_isa = iommu_arena_new(hose, 0x0080, 0x0080, 0);
-   hose->sg_pci = iommu_arena_new(hose, 0xc000, 0x0800, 0);
+   hose->sg_pci = iommu_arena_new(hose, 0xc000, 0x4000, 0);
__direct_map_base = 0x4000;
__direct_map_size = 0x8000;
 
diff

With all this stuff plus the same fix you posted the es40 8g runs rock
solid on top of 2.4.5pre3aa1.

I was going to wait to cleanup all those fixes but I'm posting this half
curroputed email here now just so we don't duplicate further efforts ;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-19 Thread Andrea Arcangeli

On Sat, May 19, 2001 at 11:11:31PM +0400, Ivan Kokshaysky wrote:
> On Sat, May 19, 2001 at 03:55:02PM +0200, Andrea Arcangeli wrote:
> > Reading the tsunami specs I learnt 1 tlb entry caches 8 pagetables (not 1)
> > so the tlb flush will be invalidate immediatly by any PCI DMA run after
> > the flush on any of the other 7 mappings cached in the same tlb entry.
> 
> I have neither tsunami docs nor the tsunami box to play with :-(
> so my guesses might be totally wrong...
> But -- assuming that tsunami is similar to cia/pyxis, that is incorrect.
> We're invalidating not the cached ptes, but the TLB tags, with all 4 (on
> pyxis, and 8 on tsunami, I guess) associated ptes. The reason why we

exactly.

> align new entries at 4*PAGE_SIZE on cia/pyxis is a hardware bug -- if cached
> pte is invalid, it doesn't cause TLB miss. I wouldn't be surprised at all if
> tsunami has the same bug; in this case your fix is urgently needed, of course.

It only depends on the specs if it has to be called a bug or a feature.

> BTW, look at Richard's code in core_cia.c/verify_tb_operation() for
> "valid tag invalid pte reload" test, it could be easily ported to tsunami.

I didn't checked very closely what this code is doing but it seems it's
not triggering any DMA transaction from a DMA bus master so it shouldn't
be able to trigger the race, and as far I can tell as soon as you do DMA
on an invalid pagetable cached in a tlb the machine will lock hard. So I
expect if you try to probe if you need the 8 alignment at runtime you
won't be able to finish the probe ;).

> > then I also enlarged the pci SG space to 1G beause runing out of entries
> > right now breaks the whole world:
> 
> It would just delay the painful death, I think ;-)

I _have_ to completly hide the painful death because as soon as I run
out of entries the machine crashes immedatly because lots of drivers
aren't checking for running out of ptes.

Fixing that is a brainer thing, it may need to partly redesign the
driver so you can take a fail path where you coulnd't previously, in
some place you may need to re-issue a softirq later to try again, in
others you must run_task_queue(&tq_disk) and sched_yield and try again
later (you have the guarantee those entries will return available so it
would be not deadlock prone to do that), in other places you can just
drop the skb.  Each driver has to be fixed in its right way. It seems
for a lot of cases people just replaced the virt_to_bus with the
pci_map_single and they didn't thought pci_map_single may even return
0 which _doesn't_ mean bus address 0 ;)

The reason it's not too bad to hide it is that you can usually calculate
an high bound of how many pci mappings a certain given machine may need
at the same time at runtime, so I can give you the guarantee that you
won't be able to reproduce any of that kind of driver bugs on a certain
given machine, this is the only point of the change, just to get this
guarantee on a larger subset of machines until all those bugs are fixed.

> I'm almost sure that all these "pci_map_sg failed" reports are caused
> by some buggy driver[s], which calls pci_map_xx() without proper
> pci_unmap_xx(). This is harmless on i386, and on alpha if all IO is going

I'm not talking about that kind of leak bug. The fact some driver is
leaking ptes is a completly different kind of bug.

I was only talking about when you get the "pci_map_sg failed" because
you have not 3 but 300 scsi disks connected to your system and you are
writing to all them at the same time allocating zillons of pte, and one
of your drivers (possibly not even a storage driver) is actually not
checking the reval of the pci_map_* functions. You don't need a pte
memleak to trigger it, even regardless of the fact I grown the dynamic
window to 1G which makes it 8 times harder to trigger than in mainline.

For now with a a couple of disks and a few nics and a 1G of dynamic
window size it doesn't trigger and the 1G thing gives a fairly large
margin for most machines out there. I cannot care less about the 2M-128k
memory wastage at this point in time, but as I said I wanted at least
to optimize the 2M pte arena allocation away completly if the machine
has less than 2G, that would be a very worthwhile optimization.

> I've got some debugging code checking for this (perhaps it worth
> posting or even porting to i386 ;-)
> For now I can confirm that all drivers I'm currently using are fine
> wrt pci_map/unmap:
> 3c59x, tulip, sym53c8xx, IDE.

all the drivers I'm using are definitely not leaking pte entries either,
but if you give me not 10 but 1000 scsi disks be sure I will trigger the
missing checks for pci_map_* retval without the need of any driver
leaking ptes.

The bug you are talking about is even w

Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Sun, May 20, 2001 at 04:12:34PM +0400, Ivan Kokshaysky wrote:
> On Sun, May 20, 2001 at 04:40:13AM +0200, Andrea Arcangeli wrote:
> > I was only talking about when you get the "pci_map_sg failed" because
> > you have not 3 but 300 scsi disks connected to your system and you are
> > writing to all them at the same time allocating zillons of pte, and one
> > of your drivers (possibly not even a storage driver) is actually not
> > checking the reval of the pci_map_* functions. You don't need a pte
> > memleak to trigger it, even regardless of the fact I grown the dynamic
> > window to 1G which makes it 8 times harder to trigger than in mainline.
> 
> I think you're too pessimistic. Don't mix "disks" and "controllers" --

I'm not pessimistic, I'm fairly relaxed also with a 512Mbyte dynamic window
(that's why I did the change in first place) and I agree that it should
take care of hiding all those bugs on 99% of hardware configurations,
but OTOH I don't want things to work by luck and I'd prefer if the real
bugs gets fixed as well eventually.

> SCSI adapter with 10 drives attached is a single DMA agent, not 10 agents.

you can do simultaneous I/O to all the disks, so you will keep those dma
entries for the SG for each disk in-use at the same time.

> If you're so concerned about Big Iron, go ahead and implement 64-bit PCI
> support, it would be right long-term solution. I'm pretty sure that
> high-end servers use mostly this kind of hardware.

Certainly 64bit pci is supported but that doesn't change the fact you
can as well have 32bit devices on those boxes. 

> Oh, well. This doesn't mean that I'm disagreed with what you said. :-)
> Driver writers must realize that pci mappings are limited resources.

Exactly.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

[ cc'ed to l-k ]

> DMA-mapping.txt assumes that it cannot fail.

DMA-mapping.txt is wrong. Both pci_map_sg and pci_map_single failed if
they returned zero. You either have to drop the skb or to try again later
if they returns zero.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 12:05:20AM +1000, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > [ cc'ed to l-k ]
> > 
> > > DMA-mapping.txt assumes that it cannot fail.
> > 
> > DMA-mapping.txt is wrong. Both pci_map_sg and pci_map_single failed if
> > they returned zero. You either have to drop the skb or to try again later
> > if they returns zero.
> > 
> 
> Well this is news to me.  No drivers understand this.

Yes, almost all drivers are buggy.

> How long has this been the case?  What platforms?

Always and all platforms.

Just think about this, you have 2^32 of bus address space, and you
theoritically can start I/O for more than 2^32 of phys memory, see?
Whatever platform it is it will never be able to guarantee all mappings
to succeed.

> For netdevices at least, the pci_map_single() call is always close
> to the site of the skb allocation.  So what we can do is to roll
> them together and use the existing oom-handling associated with alloc_skb(),
> assuming the driver has it...

Fine.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Sun, May 20, 2001 at 03:49:58PM +0200, Andrea Arcangeli wrote:
> they returned zero. You either have to drop the skb or to try again later
> if they returns zero.

BTW, pci_map_single is not a nice interface, it cannot return bus
address 0, so once we start the fixage it is probably better to change
the interface as well to get either the error or the bus address via a
pointer passed to the function.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 02:21:18AM +1000, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> Would it not be sufficient to define a machine-specific
> macro which queries it for error?  On x86 it would be:
> 
> #define BUS_ADDR_IS_ERR(addr) ((addr) == 0)

that would be more flexible at least, however not mixing the error with
a potential bus address still looks cleaner to me.

> I can't find *any* pci_map_single() in the 2.4.4-ac9 tree
> which can fail, BTW.

I assume you mean that no one single caller of pci_map_single is
checking if it failed or not (because all pci_map_single can fail).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 02:54:16AM +1000, Andrew Morton wrote:
> No.  Most of the pci_map_single() implementations just
> use virt_to_bus()/virt_to_phys(). [..]

then you are saying that on the platforms without an iommu the pci_map_*
cannot fail, of course, furthmore even a missing pci_unmap cannot
trigger an iommu address space leak on those platforms. That has nothing
to do with the fact pci_map_single can fail or not, the device drivers
are not architectural specific.

> [..]  Even sparc64's fancy
> iommu-based pci_map_single() always succeeds.

Whatever sparc64 does to hide the driver bugs you can break it if you
pci_map 4G+1 bytes of phyical memory.  Otherwise it means it's sleeping
or looping inside the pci_map functions which would break things in
another manner.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 2.4.5pre2aa1 panic during boot

2001-05-20 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 01:59:25AM +0900, root wrote:
> Andrea told us that he will not care for anything
> compiled with gcc-2.95 or version lower than that.

I said I don't care about bugreport of alpha kernel crashes if the
_alpha_ kernel was compiled with gcc 2.95.*. 2.95 is fine on the x86,
but it's broken on the alpha. In short:

x86 2.4 kernels ->  use 2.95.[34] or egcs 1.1.2 (I
use 2.95.4 from the
gcc_2_95_branch of CVS)
alpha 2.4 kernel->  use egcs 1.1.2 or 2.96 with some
houndred of patches (I
personally still use the egcs
1.1.2)

> However, it seems that this kernel panic has anything
> to do with gcc-2.95.

Please try to reproduce with egcs 1.1.2 to be sure.

> Anyway, gcc-2.95 is still the official release of gcc.
> Even SuSE-7.1 has this version only.  I wish SuSE puts

x86 and alpha are completly different issues with regard to the
compiler. I never heard of problems with 2.95.4 on x86 and I would never
replace 2.95.4 from the gcc_2_95_branch for the latest 2.96 on my x86
boxes, I'd instead try again gcc 3.0 after the inline asm fixes for "+="
constranints on local variables are done.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Sun, May 20, 2001 at 01:16:25PM -0400, Jeff Garzik wrote:
> Andrea Arcangeli wrote:
> > 
> > On Sun, May 20, 2001 at 03:49:58PM +0200, Andrea Arcangeli wrote:
> > > they returned zero. You either have to drop the skb or to try again later
> > > if they returns zero.
> > 
> > BTW, pci_map_single is not a nice interface, it cannot return bus
> > address 0, 
> 
> who says?
> 
> A value of zero for the mapping is certainly an acceptable value, and it
> should be handled by drivers.

this is exactly why I'm saying pci_map_single currently is ugly in
declaring a retval of 0 as an error, because as you also explicitly said
above bus address 0 is perfectly valid bus adress, so my whole point is
that I'd prefer to change the API of pci_map_single to notify of faliure
not returning 0 like it does right now in 2.4.5pre3 and all previous 2.4
kernels but via a parameter, so bus address zero returns a valid bus
address as it should be just now (but it isn't right now).

> In fact its an open bug in a couple net drivers that they check the
> mapping to see if it is non-zero...

if a driver is catching the faluire of pci_map_single by checking if the
bus address returned is zero such driver is one of the few (or the only
one) correct driver out there.

As it stands right now a bus address of 0 means pci_map_single failed.

For pci_map_sg if it returns zero it means it failed too.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Sun, May 20, 2001 at 06:07:17PM -0700, David S. Miller wrote:
> 
> Andrea Arcangeli writes:
>  > > [..]  Even sparc64's fancy
>  > > iommu-based pci_map_single() always succeeds.
>  > 
>  > Whatever sparc64 does to hide the driver bugs you can break it if you
>  > pci_map 4G+1 bytes of phyical memory.
> 
> Which is an utterly stupid thing to do.
> 
> Please construct a plausable situation where this would occur legally
> and not be a driver bug, given the maximum number of PCI busses and
> slots found on sparc64 and the maximum _concurrent_ usage of PCI dma
> space for any given driver (which isn't doing something stupid).

Assume I have a dozen of PCI cards that does DMA using SG tables that
can map up to some houndred mbytes of ram each, so I can just program
the cards to start the dma on those houndred mbyte of ram, most of the
time the I/O is not simulaneous, but very rarely it happens to be
simultaneous and in turn it tries to pci_map_sg more than 4G of physical
ram. After that sparc64 iommu code will say "bye bye" and the machine
will crash because the nic driver is not checking for pci_map_single
faliures.

I don't see why the above scenario should be classified as stupid. such
pci_map_* API and the device drivers have to be generic.

It's like if you say me that there's no need to check for
alloc_pages(GFP_ATOMIC) faliures in the device drivers because all
machines you are using have 256G of ram and you never use all the
physical ram with your workloads. I would never buy such an argument.

Furthmore currently (2.4.5pre3) on alpha you only need to ask the iommu
to map more than 128mbyte of ram to crash (I increased it to 512mbyte at
least, Jay said my first patch that increased it to 1G is risky because
some device gets confused by bus addresses at around -1M and we keep the
dynamic window above 3G, 512M should still be enough to cover 99% of
hardware configurations I agree on that but this is not a good excuse to
left all device drivers buggy just because those bugs doesn't trigger in
all the hardware configurations out there).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-20 Thread Andrea Arcangeli

On Sun, May 20, 2001 at 06:01:40PM -0700, David S. Miller wrote:
> 
> Andrea Arcangeli writes:
>  > > Well this is news to me.  No drivers understand this.
>  > 
>  > Yes, almost all drivers are buggy.
> 
> No, the interface says that the DMA routines may not return failure.

The alpha returns a faliure since day zero of iommu support, the sparc64
has too otherwise it's even more buggy than alpha when the machine runs
out of pci virtual address space.

> If you want to change the DMA api to act some other way, then fine
> please propose it, but do not act as if today they have to act this
> way because that is simply not true.

About the pci_map_single API I'd like if bus address 0 would not be the
indication of faluire, mainly on platforms without an iommu that's not
nice, x86 happens to get it right only because the physical page zero is
reserved for completly other reasons. so we either add a err parameter
to the pci_map_single, or we define a per-arch bus address to indicate
an error, either ways are ok from my part.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LVM 1.0 release decision

2001-05-11 Thread Andrea Arcangeli

On Fri, May 11, 2001 at 01:12:55PM -0700, David S. Miller wrote:
> They can be converted, [..]

of course, and part of that code will be still necessary also with the
>=beta4 lvm interface to still convert the pointers of the userspace
data structures but my point was we probably want to avoid that complexity
where it's not necessary (feasible was too strong adj sorry).

Related side note: for the x86-64 kernel we won't support the emulation
of the lvm ioctl from the 32bit executables to avoid the pointer
conversion an mainteinance pain enterely, at least in the early stage
the x86-64 lvmtools will have to be compiled elf64.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-21 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 03:59:58AM -0700, David S. Miller wrote:
> This still leaves around 800MB IOMMU space free on that sparc64 PCI
> controller.

if it was 400mbyte you were screwed up too, the point here is that the
marging is way too to allows ignore the issue completly, furthmore there
can be fragmentation effects in the pagetbles, at least in the way alpha
manages them which is to find contigous virtual pci bus addresses for each sg.
Alpha in mainline is just screwedup if a single pci bus tries to dynamic
map more than 128mbyte, changing it to 512mbyte is trivial, growing more
has performance implications as it needs to reduce the direct windows
which I don't like to as it would also increase the number of machines
that will get bitten by drivers that still use the virt_to_bus and also
increase the pressure on the iommu ptes too.

Now I'm not asking to break the API for 2.4 to take care of that, you
seems to be convinced in fixing this for 2.5 and I'm ok with that,
I just changed the printk of running out of entries to be KERN_ERR at
least, so we know if somebody has real life troubles with 2.4 I will go
HIGHMEM which is a matter of 2 hours for me to implement.

Only thing I suggest is to change the API before starting fixing the
drivers, I mean: don't start checking for bus address 0 before changing
the API to return faliure in another way. It's true x86 is reserving the
zero page anyways because it's a magic bios thing, but for example on
the alpha such a 0 bus address that we cannot use wastes 8 mbyte of DMA
virtual bus addresses that we reserve for the ISA cards (of course we
almost never need 16mbyte of ram all under isa dma but since it's so
low cost to allow that I think we will just in case).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-21 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 10:53:39AM -0700, Richard Henderson wrote:
> should probably just go ahead and allocate the 512M or 1G
> scatter-gather arena.

I just have a bugreport in my mailbox about pci_map faliures even after
I enlarged to window to 1G argghh (at first it looked apparently stable
by growing the window), so I'm stuck again, it seems I was right in not
being careless about the pci_map_* bugs today even if the 1G window
looked to offer a rasonable marging at first.

The pci_map_* failed triggers during a benchmark with a certain driver
that does massive DMA (similar to the examples I did previously), the
developers of the driver simply told me the hardware wants to do massive
zerocopy dma to userspace and they apparently excluded it could be a
memleak in the driver missing some pci_unmap_* after I told them to
check for that. Even enabling HIGHMEM would not be enough because they
do dma on userspace but on the network side, so it won't be taken care
by create_bounces(), so I at least would need to put another bounce
buffer layer in the driver to make highmem to work.

Other more efficient ways to go besides highmem plus additional bounce
buffer layer are:

2) fixing all buggy drivers now (would be a great pain as it seems to me
   I should do that alone apparently as it seems everybody else doesn't
   care about those bugs for 2.4)
3) let the "massing DMA" hardware to use DAC

Theoritically I could also cheat again and take a way 4) that is to try
to enlarge the window beyond 1G and see if the bugs gets hided also
during the benchmark that way, but I would take this as last resort as
this would again not be a definitive solution and I'd risk to get stuck
again tomorrow like I'm right now.

I think I will prefer to take a dirty way 3) just for those drivers to
solve this production problem even if it won't be implemented in a
generic manner at first (I got the idea from the quadrics folks that do
this just now with their nics if I understood well).

If I understand correctly on the tsunami enabling DAC simply means to
enable the pchip->pctl |= MWIN (monster window) bit during the boot
stage on both pchip.

Then the device driver of the "massive DMA" hardware should simply
program the registers of the nic to do use DAC with bus addresses that
are the phys address of the destination/source memory of the DMA,
only changed to have bit 40th set to 1. Those should be all the needed
changes necessary to make pci64 to work on tsunami at the same time of
pci32 direct/dynamic windows and it would be very efficient and it
sounds the best way to workaround the broken pci_map_* in 2.4 given
fixing the pci_map_* the right way is a pain.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-22 Thread Andrea Arcangeli

On Mon, May 21, 2001 at 10:53:39AM -0700, Richard Henderson wrote:
> diff -ruNp linux/arch/alpha/kernel/pci_iommu.c 
>linux-new/arch/alpha/kernel/pci_iommu.c
> --- linux/arch/alpha/kernel/pci_iommu.c   Fri Mar  2 11:12:07 2001
> +++ linux-new/arch/alpha/kernel/pci_iommu.c   Mon May 21 01:25:25 2001
> @@ -402,8 +402,20 @@ sg_fill(struct scatterlist *leader, stru
>   paddr &= ~PAGE_MASK;
>   npages = calc_npages(paddr + size);
>   dma_ofs = iommu_arena_alloc(arena, npages);
> - if (dma_ofs < 0)
> - return -1;
> + if (dma_ofs < 0) {
> + /* If we attempted a direct map above but failed, die.  */
> + if (leader->dma_address == 0)
> + return -1;
> +
> + /* Otherwise, break up the remaining virtually contiguous
> +hunks into individual direct maps.  */
> + for (sg = leader; sg < end; ++sg)
> + if (sg->dma_address == 2 || sg->dma_address == -2)
 should be == 1

> + sg->dma_address = 0;
> +
> + /* Retry.  */
> + return sg_fill(leader, end, out, arena, max_dma);
> + }
>  
>   out->dma_address = arena->dma_base + dma_ofs*PAGE_SIZE + paddr;
>   out->dma_length = size;

I am going to merge this one (however it won't help on the big memory
machines, it will only try to hide the problem on the machines with not
much memory above 2G).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: alpha iommu fixes

2001-05-22 Thread Andrea Arcangeli

On Tue, May 22, 2001 at 06:44:09PM +0400, Ivan Kokshaysky wrote:
> On Tue, May 22, 2001 at 04:29:16PM +0200, Andrea Arcangeli wrote:
> > Ivan could you test the above fix on the platforms that needs the
> > align_entry hack?
> 
> That was one of the first things I noticed, and I've tried exactly
> that (2 instead of ~1UL).

just in case (I guess it wouldn't matter much but), but are you sure you
tried it with also the locking fixes applied too?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Kernel diff_small_2.4.5pre4_2.4.5pre5

2001-05-22 Thread Andrea Arcangeli

On Tue, May 22, 2001 at 10:04:39PM +0200, Andrea Arcangeli wrote:
> diff -urN 2.4.5pre4/arch/alpha/kernel/pci_iommu.c 
>2.4.5pre5/arch/alpha/kernel/pci_iommu.c
> --- 2.4.5pre4/arch/alpha/kernel/pci_iommu.c   Sun Apr  1 01:17:07 2001
> +++ 2.4.5pre5/arch/alpha/kernel/pci_iommu.c   Tue May 22 22:04:07 2001
> @@ -402,8 +402,20 @@
>   paddr &= ~PAGE_MASK;
>   npages = calc_npages(paddr + size);
>   dma_ofs = iommu_arena_alloc(arena, npages);
> - if (dma_ofs < 0)
> - return -1;
> + if (dma_ofs < 0) {
> + /* If we attempted a direct map above but failed, die.  */
> + if (leader->dma_address == 0)
> + return -1;
> +
> + /* Otherwise, break up the remaining virtually contiguous
> +hunks into individual direct maps.  */
> + for (sg = leader; sg < end; ++sg)
> + if (sg->dma_address == 2 || sg->dma_address == -2)
> + sg->dma_address = 0;
> +
> + /* Retry.  */
> + return sg_fill(leader, end, out, arena, max_dma);
> + }
>  
>   out->dma_address = arena->dma_base + dma_ofs*PAGE_SIZE + paddr;
>   out->dma_length = size;

this is just broken as I said a few hours ago on l-k. please replace ==
2 with == 1 as described in earlier email. However it's not a
showstopper because it will trigger only after running of pci mappings
(and by that time things are going to break pretty soon anyways on the
much bigger than 2G boxes, where the 2G direct window has low probablity
to save you), the fact I found this patch in in I assume is your
agreemnt that the pci mapping bugs are an issue also for 2.4, good.

I couldn't hack all the day long today, I will finish the alpha updates
before tomorrow though.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Swap strangeness using 2.4.5pre2aa1

2001-05-23 Thread Andrea Arcangeli

On Thu, May 24, 2001 at 03:16:48AM +0900, G. Hugh Song wrote:
> The following is the output from "free"
> =
>  total   used   free sharedbuffers
> cached
> Mem:   10231281015640   7488  0544
> 948976
> -/+ buffers/cache:  66120 957008
> Swap:  10219361021936  0
> ==

I get the same with egcs. To me it sounds broken VM (I shouldn't have
changed anything that can confuse the VM so this should be reproducible
with 2.4.5pre5 vanilla and infact you also said you reproduced
previously in 2.4.4).

Is it possible you booted with 'mem=something'? It seems to me that when
I boot with 'mem=something' the VM bad beahaviour become more visible.

> I think I should back down to Kernel 2.2.20pre2aa1.

definitely a good idea until somebody fixes the VM in mainline.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
> [..] I assume that Andrea basically
> made the block-size be the same as the page size. That's how I would have

exactly (softblocksize is 4k fixed, regardless of the page cache size to
avoid confusing device drivers).

> done it (and then waited for people to find real life cases where we want
> to allow sector writes).

Correct, the partial write logic is kind of disabled on x86 because the
artificial softblocksize of the blkdev pagecache matches the
pagecachesize but it should just work on the other archs.

Now I can try to make the bh more granular for partial writes in a
dynamic manner (so we don't pay the overhead of the 512byte bh in the
common case) but I think this would need its own additional logic and I
prefer to think about it after I solved the coherency issues between
pinned buffer cache and filesystem, so after the showstoppers are solved
and the patch is just usable in real life (possibly with the overhead of
read-modify-write for some workload doing small random write I/O).
An easy short term fix for removing the read-modify-write would be to use the
hardblocksize of the underlying device as the softblocksize but again
that would cause us to pay for the 512byte bhs which I don't like to... ;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 06:13:13PM -0400, Alexander Viro wrote:
> Uh-oh... After you solved what?

The superblock is pinned by the kernel in buffercache while you fsck a
ro mounted ext2, so I must somehow uptodate this superblock in the
buffercache before collecting away the pagecache containing more recent
info from fsck. It's all done lazily, I just thought not to break the
assumption that an idling buffercache will never become not uptodate
under you anytime because it seems not too painful to implement compared
to changing the fs, it puts the check in a slow path and it doesn't
break the API with the buffercache (so I don't need to change all the fs
to check if the superblock is still uptodate before marking it dirty).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 04:40:14PM -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> > Now, it may be that the preliminary patches from Andrea do not work this
> > way. I didn't look at them too closely, and I assume that Andrea basically
> > made the block-size be the same as the page size. That's how I would have
> > done it (and then waited for people to find real life cases where we want
> > to allow sector writes).
> 
> Due to limitations in low-level drivers, Andrea was forced to hardcode
> 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE.

Yes, actually to trigger the read-modify-write logic not more than with
the current buffercache I could simply decrease the softblocksize of the
blkdev pagecache to 1k, like the default granularity of the current
buffercache before any filesystem is mounted, but that would impose a
_very_ significant performance hit to the non-cached case which is quite
important as well mainly for a blkdev I think.

I measured on high end disks reading (out of cache) with 4k buffercache
blocksize instead of with 1k buffercache blocksize is an exact x2
improvement because at that speed the bottleneck become the work that
has to be done by the cpu.

Infact rawio /dev/raw* is as well 2 times slower than the 2.4 4k
bufferecache on blkdev in those environment (of course with rawio the
cpu is not used much comared to the buffered I/O) and that's one of the
reasons I also imposed a 4k granularity on the direct I/O from
open("/dev/hda", O_DIRECT|O_RDRW)  I didn't benchmarked yet but I
suspect that doing rawio with forced 4k bh (as opposed to 512bytes bh of
/dev/raw*) will make O_DIRECT on the blkdev much faster than the
buffered I/O on the blkdev through pagecache just like O_DIRECT scored
the 170MByte/sec of very scalable I/O recently I think also because it
was done through ext2 that imposed a 4k softblocksize:

http://boudicca.tux.org/hypermail/linux-kernel/2001week17/1175.html

http://boudicca.tux.org/hypermail/linux-kernel/2001week17/att-1175/01-directio.png

(boudicca.tux.org is not online at the moment but I assume it will
return online soon)

However this is still flexible, right now my first object is to solve
the showstoppers (so for example I can run my machine with that patch
applied) and then we can think how to solve the 4k/1k/512byte
softblocksize issues. Possibly automatically or selectable from
userspace. I will try to work on the blkdev patch tomorrow to bring it
in an usable state.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rwsems and asm-constraint gcc bug

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 01:27:19PM +0100, David Howells wrote:
> 
> The bug in gcc 3.0 that stopped the inline asm constraints being interpreted
> properly, and thus prevented linux from compiling is now fixed.

I'm writing this on top of 2.4.5pre5aa3 compiled with gcc-3_0-branch and
binutils cvs mainline of this evening. No problem so far. Thanks!

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: writeback-highmem

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 10:26:30PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> >
> > This needed highmem fix from Rik is still missing too, so please apply
> >  along the other 5 (it's orthogonal so you can apply this one in any
> >  order you want).
> > 
> >  From: Rik van Riel <[EMAIL PROTECTED]>
> >  Subject: [PATCH][1/2] adjust dirty threshold for lowmem-only mappings
> 
> I've held off on this one because the recent throttling fix should have
> helped this problem.  Has anyone confirmed that this patch still actually
> fixes something?  If so, what was the scenario?

Without this fix write throttling is completely broken for a blkdev and
it won't start _at_all_ and it'll just keep hanging in the allocation
routines. I agree it won't explain oom (with the other fixes the VM
should writeback synchronously instead of running oom) but it may make
the box completely unusable under a cp /dev/zero /dev/somedevice.

There is a reason why we start write throttling before 100% of ram is
being locked by dirty pages in the pagecache path.

The beauty of this fix is that Rik allowed the pagecache not to have the
limit (in 2.4 pagecache had the limit too). Probably async writeback
won't start but at least the write throttling will and that's all we
need to keep the box running other apps at the same time of the write.

If the system goes unresponsive for 10 minutes and swaps during backups
or workloads working on the blkdev, they'll file bugreports and they'd
be correct.

In short I agree this shouldn't be applied for oom, but it's still
definitely a correct and needed fix (and I rate it a bit more than just
an optimization).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 05:36:14PM +1100, Nick Piggin wrote:
> I think it should be turned on by default. I can't recall what

I think it too, since the number of people that can be bitten by this is
certainly higher than the number of people who knows the VM internals
and for what kind of workloads they need to enable this by hand to avoid
risking lockups (notably with boxes without swap or with heavy pagetable
allocations all the time which is not uncommon with db usage).

This is needed on x86-64 too to avoid pagetables to lockup the dma zone.
Or anyways it's needed also on x86 for the dma zone on <1G boxes too.

Anyway if you leave it off by default I don't mind, with my new code
forward ported stright from 2.4 mainline, it's possible for the first
time to set it from userspace without having to embed knowledge on the
kernel min_kbytes settings at boot time. So if you want it down by
default it simply means we'll guarantee it on our distro with userland.
Setting a sysctl at boot time is no big deal for us (of course leaving
it enabled by default in kernel space is older distro where userland
isn't yet aware about it). So it's pretty much up to you, as long as we
can easily fixup in userland is fine with me and I already tried a dozen
times to push mainline in what I believe to be the right direction (like
I already did in 2.4 mainline since that same code is enabled by default
in 2.4).

The sysctl name had to change to lowmem_reserve_ratio because its
semantics are completely different now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote:
> Thus empirically, it appears that the number of machines which need a
> non-zero protection ratio is exceedingly small.  Why change the setting on
> all machines for the benefit of the tiny few?  Seems weird.  Especially
> when this problem could be solved with a few-line initscript.  Ho hum.

It's up to you, IMHO you're doing a mistake, but I don't mind as long as our
customers aren't at risk of early oom kills (or worse kernel crashes)
with some db load (especially without swap the risk is huge for all
users, since all anonymous memory will be pinned like ptes, but with ~3G
of pagetables they're at risk even with swap).  At least you *must*
admit that without my patch applied as I posted, there's a >0 probabity
of running out of normal zone which will lead to an oom-kill or a
deadlock despite 10G of highmem might still be freeeable (like with
clean cache). And my patch obviously cannot make it impossible to run
out of normal zone, since there's only 800m of normal zone and one can
open more files than what fits in normal zone, but at least it gives the
user the security that a certain workload can run reliably. Without this
patch there's no guarantee at all that any workload will run when >1G of
ptes is allocated.

This below fix as well is needed and you won't find reports of people
reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I
know you were working on it (you said not in the weekend IIRC), but I've
been upgraded to latest bk so I had to fixup quickly or I would have to
run the racy code on my smp systems to test new kernels.

From: Andrea Arcangeli <[EMAIL PROTECTED]>
Subject: fixup smp race introduced in 2.6.11-rc1

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- x/mm/memory.c.~1~   2005-01-21 06:58:14.747335048 +0100
+++ x/mm/memory.c   2005-01-21 07:16:15.318063328 +0100
@@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_
 
spin_lock(&mapping->i_mmap_lock);
 
+   /* serialize i_size write against truncate_count write */
+   smp_wmb(); 
/* Protect against page faults, and endless unmapping loops */
mapping->truncate_count++;
+   /*
+* For archs where spin_lock has inclusive semantics like ia64
+* this smp_mb() will prevent to read pagetable contents
+* before the truncate_count increment is visible to
+* other cpus.
+*/
+   smp_mb();
if (unlikely(is_restart_addr(mapping->truncate_count))) {
if (mapping->truncate_count == 0)
reset_vma_truncate_counts(mapping);
@@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct 
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
sequence = mapping->truncate_count;
+   smp_rmb(); /* serializes i_size against truncate_count */
}
 retry:
cond_resched();
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
+   /*
+* No smp_rmb is needed here as long as there's a full
+* spin_lock/unlock sequence inside the ->nopage callback
+* (for the pagecache lookup) that acts as an implicit
+* smp_mb() and prevents the i_size read to happen
+* after the next truncate_count read.
+*/
 
/* no page was available -- either SIGBUS or OOM */
if (new_page == NOPAGE_SIGBUS)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Thu, Jan 20, 2005 at 11:00:16PM -0800, Andrew Morton wrote:
> Last time we dicsussed this you pointed out that reserving more lowmem from
> highmem-capable allocations may actually *help* things.  (Tries to remember
> why) By reducing inode/dentry eviction rates?  I asked Martin Bligh if he
> could test that on a big NUMA box but iirc the results were inconclusive.

This is correct, guaranteeing more memory to be freeable in lowmem (ptes
aren't freeable without a sigkill for example) the icache/dcache will at
least have a margin where it can grow indipendently from highmem
allocations.

> Maybe it just won't make much difference.  Hard to say.

I don't know myself if it makes a performance difference, all old
benchmarks have been run with this applied. This was applied for
correcntess (i.e.  to avoid sigkills or lockups), it wasn't applied for
performance. But I don't see how it could hurt performance (especially
given current code already does the check at runtime, which is
pratically the only fast-path cost ;).

> >  The sysctl name had to change to lowmem_reserve_ratio because its
> >  semantics are completely different now.
> 
> That reminds me.  Documentation/filesystems/proc.txt ;)

Woops, forgotten about it ;)

> I'll cook something up for that.

Thanks. If you prefer I can write it too to relieve you from this load,
it's up to you. If you want to fix it yourself go ahead of course ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 06:04:25PM +1100, Nick Piggin wrote:
> OK this is a fairly lame example... but the current code is more or
> less just lucky that ZONE_DMA doesn't usually fill up with pinned mem
> on machines that need explicit ZONE_DMA allocations.

Yep. For the DMA zone all slab cache will be a memory pin (like ptes for
highmem, but not that many people runs with 3G of ram in ptes, and I
guess the ones doing it aren't normally using a mainline kernel in the
first place so they're likely not running into it either). While slab
cache pinning the normal zone has more probability of being reproduced
on l-k in random usages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 2/5

2005-01-20 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 08:08:21AM +0100, Andi Kleen wrote:
> So at least for GFP_DMA it seems to be definitely needed.

Indeed. Plus if you add pci32 zone, it'll be needed for it too on
x86-64, like for the normal zone on x86, since ptes will go in highmem
while pci32 allocations will not. So while floppy might be fixed, this
issue would be for brand new pci32 zone needed by some device (i.e.
nvidia, so not such a unlikely corner case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: oom killer gone nuts

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 08:42:08AM +0100, Jens Axboe wrote:
> And especially not with 500MB of zone normal free, thanks :)

;) Are you sure you had 500m free even before the _first_ oom killing?

I assumed what you posted was not the first one of the oom killing
messages. If it was the first then there was a regression. But if OTOH I
didn't misunderstood your message and it wasn't the first, then what
you've seen is just the brokeness of 2.6 w.r.t. oom killing, that's what
made Thomas drive a few hours too, and you've only to apply the 5
patches I just posted, and everything will work perfectly correct then
in terms of _not_ killing right and left anymore, even despite the 500m
free ;). I tested the code before posting and my regression test passed
at least, so it looked like there was no other regression. The several
rejects I've got while porting the code looked all due noop-cleanups. So
I doubt there was a regression and I'm optimistic you've just seen the
old bugs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: oom killer gone nuts

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 09:09:41AM +0100, Jens Axboe wrote:
> Jan 20 13:22:15 wiggum kernel: oom-killer: gfp_mask=0xd1

This was a GFP_KERNEL|GFP_DMA allocation triggering this. However it
didn't look so much out of DMA zone, there's 4M of ram free. Could be
the ram was relased by another CPU in the meantime if this was SMP (or
even by an interrupt in UP too).

Could very well be you'll get things fixed by the lowmem_reserve patch,
that will reserve part of the dma zone, so with it you're sure it
couldn't have gone below 4M due slab allocs like skb.

I recommend trying again with the patches applied, the oom stuff is so
buggy right now that it's better you apply the fixes and try again, and
if it still happens we know it's a regression.

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


seccomp for 2.6.11-rc1-bk8

2005-01-21 Thread Andrea Arcangeli
Hello,

This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
Cpushare (until trusted computing will hit the hardware market). This is
against 2.6.11-rc1-bk8. The progress is on schedule so far, so it might
not be a bad idea to merge this into the kernel sooner than later, so that
there will be some significant userbase capable of running the Cpushare client
as soon as it becomes available (plus I won't have to forward port the patch
all the time ;). Getting this merged anytime before the end of 2005 is going to
be fundamental for my project if my forecasts will turn out to be correct
(which is not guaranteed, but if I'm wrong that could also mean I need it
sooner ;), but anyway there is no short term urgency, so even 2.6.12/13 will be
ok, but if you can merge it now it's even better and it'll certainly save me
some time.

I remember you asked for syscalls, I can add them but I wouldn't mind to
be able to get/set the value still from the /proc API. I don't really
feel the need of syscalls, this is all but a fast path. The overhead of
creating the pipes and forking would be significant too. Ideally I could
add syscalls to make it easier to use in chroot environment (just in
case someone feels the need to stack seccomp on top of chroot), that's
the only reason why syscalls might ever be useful. But this is still is
nice to have in /proc at least in readonly mode, so I see the current
patch as a good starting point and as valid code for the long term (not
overlapping with syscalls since `cat/echo` cannot be used with the syscalls).

As usual this is theoretically useful to run any kind of untrusted
bytecode on the computer. This means also code that might have bugs.
Like to decompress a mpeg stream securely regardless of the decoder lib,
or stuff like that. I've no idea if somebody is going to use it for that
though. I only know I'm going to use it with Cpushare 8).

Works for me:

[EMAIL PROTECTED]:~/cpushare/client/cpushare> python seccomp_test.py 
gcc -march=i686 -Os -Wall -fomit-frame-pointer -fno-common seccomp-loader.c -o 
seccomp-loader
gcc -c -march=i686 -Os -Wall -fomit-frame-pointer -fno-common bytecode.c -o 
bytecode.o
cpp bytecode.lds.S -o bytecode.lds.s
grep -A1 SECTION bytecode.lds.s > bytecode.lds
ld -T bytecode.lds bytecode.o /usr/lib/gcc-lib/i586-suse-linux/3.3.4/libgcc.a 
/usr/lib/libc.a /usr/lib/libm.a -N -o bytecode
objcopy -O binary bytecode -j .text bytecode.text.bin
objcopy -O binary bytecode -j .data bytecode.data.bin
gcc -c -march=i686 -Os -Wall -fomit-frame-pointer -fno-common -DMALICIOUS 
bytecode.c -o bytecode-malicious.o
ld -T bytecode.lds bytecode-malicious.o 
/usr/lib/gcc-lib/i586-suse-linux/3.3.4/libgcc.a /usr/lib/libc.a /usr/lib/libm.a 
-N -o bytecode-malicious
objcopy -O binary bytecode-malicious -j .text bytecode-malicious.text.bin
objcopy -O binary bytecode-malicious -j .data bytecode-malicious.data.bin
Starting computing some malicious bytecode
init
load
start
stop
receive_data failure
kill
exit_code 0 signal 9
The malicious bytecode has been killed successfully by seccomp
Starting computing some safe bytecode
init
load
start
stop
1509 counts
kill
exit_code 0 signal 0
The seccomp_test.py completed successfully, thank you for testing.
[EMAIL PROTECTED]:~/cpushare/client/cpushare> 

Thanks.
 
--- xxx/arch/i386/Kconfig   2005-01-21 09:14:54.0 +0100
+++ xx/arch/i386/Kconfig2005-01-21 09:07:57.0 +0100
@@ -33,6 +33,10 @@ config GENERIC_IOMAP
bool
default y
 
+config SECCOMP
+   bool
+   default y
+
 source "init/Kconfig"
 
 menu "Processor type and features"
--- xxx/arch/i386/kernel/entry.S2005-01-15 20:44:49.0 +0100
+++ xx/arch/i386/kernel/entry.S 2005-01-21 09:07:57.0 +0100
@@ -221,7 +221,8 @@ sysenter_past_esp:
SAVE_ALL
GET_THREAD_INFO(%ebp)
 
-   testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
+   /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not 
testb */
+   testw 
$(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax
jae syscall_badsys
@@ -245,7 +246,8 @@ ENTRY(system_call)
SAVE_ALL
GET_THREAD_INFO(%ebp)
# system call tracing in operation
-   testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
+   /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not 
testb */
+   testw 
$(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax
jae syscall_badsys
--- xxx/arch/i386/kernel/ptrace.c   2005-01-15 20:44:49.0 +0100
+++ xx/arch/i386/kernel/ptrace.c2005-01-21 09:07:57.0 +0100
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -678,6 +679,10 @@ void send_sigtrap(struct task_struct *ts
 __attribute_

Re: seccomp for 2.6.11-rc1-bk8

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 01:47:01PM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > > This is the seccomp patch ported to 2.6.11-rc1-bk8, that I need for
> > > Cpushare (until trusted computing will hit the hardware market). 
> > > [...]
> > 
> > why do you need any kernel code for this? This seems to be a limited
> > ptrace implementation: restricting untrusted userspace code to only be
> > able to exec read/write/sigreturn.
> > 
> > So this patch, unless i'm missing something, duplicates in essence what
> > ptrace can do [...]
> 
> there's one thing ptrace wont do: if the ptrace parent dies unexpectedly
> and the child was 'running' (there is a small window where the child

You got it, I couldn't use ptrace right now. Pavel already suggested it
and I told him the problem with the parent being killed by oom.

> might not be stopped and where this may happen) then the child can get
> runaway. While i think this is theoretical (UML doesnt suffer from this
> problem), it is simple to fix - find below a proof-of-concept patch that
> introduces PTRACE_ATTACH_JAIL - ptraced children can never escape out of
> such a jail. (barely tested - but you get the idea.)

IMHO the complexity of ptrace makes it by definition less secure than
seccomp. Seccomp is extremely simple and self contained. This is why I
still prefer seccomp to fixing ptrace w.r.t. security.

Fixing ptrace w.r.t. security-tracing it'd be still nice, but I'd prefer
not to relay on ptrace when something as simple and robust as seccomp
can be implemented instead.

However if the kerneel folks wants me to use a "fixed version of
ptrace", I could use it too (performance isn't the issue). In _theory_
you're right it'd be completely equivalent after fixing the problem with
the parent dying unexpectedly. But from my part in practice I prefer to
relay _only_ on the much simpler seccomp patch (and on trusted computing
as soon as the hardware is available).

Even trusted computing will be less secure than seccomp from the point
of view of the seller (because it's a lot more complicated than
seccomp), but unlike with ptrace, the buyer will get both privacy
guarantees and guarantees about reliably results too (only against
software attacks). Having those two guarantees for the buyer will be
fundamental, so it will worth to decrease the seller security a bit to
give these guarantees to the buyer (I'll most certainly leave an
exchange for seccomp at the same time I start the trusted computing
exchange, so if some seller doesn't trust the trusted computing code,
they can stick with the very secure seccomp approach), but right now,
seccomp seems the most secure solution from the seller standpoint, and
the buyer won't notice the difference between ptrace and seccomp.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 08:55:22PM +0100, Ingo Molnar wrote:
> 
> * Chris Wright <[EMAIL PROTECTED]> wrote:
> 
> > * Rik van Riel ([EMAIL PROTECTED]) wrote:
> > > Yes, but do you care about the performance of syscalls
> > > which the program isn't allowed to call at all ? ;)
> > 
> > Heh, no, but it's for every syscall not just denied ones.  Point is
> > simply that ptrace (complexity aside) doesn't scale the same.
> 
> seccomp is about CPU-intense calculation jobs - the only syscalls
> allowed are read/write (and sigreturn). UML implements a full kernel
> via ptrace and CPU-intense applications run at native speed.

Indeed. Performance is not an issue (in the short term at least, since
those syscalls will be probably network bound).

The only reason I couldn't use ptrace is what you found, that is the oom
killing of the parent (or a mistake of the CPU seller that kills it by
mistake by hand, I must prevent him to screw himself ;). Even after
fixing ptrace, I've an hard time to prefer ptrace, when a simple,
localized and self contained solution like seccomp is available.

The reason I called it seccomp and not restricted syscalls, is that I'm
not allowing Chris to choose which syscall to restrict. I restricted
only the ones that are required to be able to compute securely, hence
the name "seccomp" and not "restricted syscalls". Obviously I'm
restricting certain number of syscalls to create this seccomp mode.

I'm open to different solutions, I can even live with you forcing me to
use the fixed version of ptrace, but you must be confortable to take the
blame if it breaks ;). Personally I'm confortable to take the blame only
if seccomp breaks, it's so simple that it can't break. And with break I
don't mean 0xf00f, that's a minor issue that will be autodetected by the
system. I mean breaking like killing the ptrace parent right now... That
can be fixed up reasonably securely too, but it _can't_ be autodetected
easily (I keep cross logs for everything so I can trace it, but it
won't be an immediate/automated task like the 0xf00f or fcnlex).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 09:54:16PM +0100, Ingo Molnar wrote:
> - the second barrier is the 'jail' of the ptraced task. Especially with
>   PTRACE_SYSCALL, the things a child ptraced process can do are
>   extremely limited, everything it tries to do will trap, the task will
>   suspend and the parent runs. The task is completely passive and ptrace
>   on that end is a pretty small engine that stops/traps/restarts user
>   processing without alot of frills.
> 
> historically there has been alot less problems with the second barrier. 
> (in fact i cannot remember even one security issue in that area.)

I agree there are less problems in that area.  But there's still a great
deal of complexity in ptrace that I preferred to keep it out of the
security equation.

uml can't run with seccomp, uml is forced to ptrace, it has to trap the
arguments and everything.

Once kernel CVS returns up, I'll get an email as soon as somebody
touches kernel/seccomp.c or the other files involved, and I can keep the
eye on the code and verify all modifications very quickly (plus there
will be very few modifications on those files, unlike for the ptrace
code that is much more under deveopment). Keeping ptrace under control
would be more costly on my side.

> i'm not forcing anyone to do anything, but i think the most logical
> solution is to use ptrace. It's there on every Linux box so your client
> can run even on 'older' Linux boxes. (You might want to detect in the
> client whether the OOM race is fixed in a kernel, but it should not be a
> truly big issue.) Waiting for any extra API to get significant userbase
> takes at least 1-2 years - while ptrace is here and available on every

Note that I'm not ready for production myself yet, I'm suggesting to
include this now, exactly to get some real userbase ready in 1-2 years.
And after that with trusted computing it'll take another few years
before the trusted Cpushare exchange can start in parallel to the
seccomp one.  My schedule is planned for a much longer timeframe, I
doubt anything significant could happen this year regardless of ptrace
or seccomp.

Plus I would never depend on the users to do the right thing (i.e. not
to run oom etc..). So I'm forced to wait the 1-2 years anyways either to
get seccomp merged, or to get your ptrace extension merged. If I use
ptrace, the current kernels can't prevent the Cpushare users to hurt
themself, so I won't allow current unpatched kernels to run.

I have no hurry, my first prio is to do everything safely, I don't care
to grow the userbase fast if I have to add some risk to the users to
do that.

Note also that all Cpushare client software that runs on the user
computers is GPL, in turn without pending patents and completely free
software, so you're very free to take it, rewrite it with ptrace, and
ship it to your users now. Even Microsoft can write its own Cpushare
client and ship it in Windows just fine.  You can fake the kernel
version to tell the server 2.6.11+seccom is running, despite 2.6.9 with
the insecure ptrace might be running instead (the Cpushare protocol does
most checks on the server side btw).  I have no control on that and as
long as I have no liability I'm fine (and I write in capital letters no
liability and no warranty in the account creation procedure of course).
But the client I will ship myself on cpushare.com will have security as
priority number 1 in mind, and in turn I can't allow it to run with the
current ptrace kernel code.

(however if you want to write your own client for your own OS, please
let me know privately, instead of faking the kernel version, that's
going to be more secure shall you need me to shutdown just your clients
because you found a security issue in your code)

If you noticed, I also made sure that after seccomp is enabled, it is
impossible to disable it:

/* can set it only once to be even more secure */
if (unlikely(tsk->seccomp_mode))
return -EPERM;

This is a *major* feature. I'm sure we can hack ptrace for that too with
yet another patch, but isn't it so much simpler to merge seccomp to get
the highest degree of security? The only way an user can screw himself
with seccomp is to write the right bit in /dev/mem at the right bit
offset. And I exclude that can happen by mistake. I mean, it has a
lower probability than a ram bitflip ;).

> Linux box. If you require 'users' to go with a new (or worse: patched)
> kernel then you are creating a pretty significant artificial market
> penetration barrier for your application.

This is fine. It's a long term project, I don't care about the short
term, I only care that the users are as safe as possible.

> also, with more applications relying on ptrace it will become more
> tested, more robust and people will do speedups. I think the fact that
> UML uses ptrace is already a very good sign that it's robust for such
> purposes. (_Also_, if there's a security problem in the ptrace barrier,
> you'd like to know about it

Re: User space out of memory approach

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 05:27:11PM -0400, Mauricio Lin wrote:
> Hi Andrea,
> 
> I applied your patch and I am checking your code. It is really a very
> interesting work. I have a question about the function
> __set_current_state(TASK_INTERRUPTIBLE) you put in out_of_memory
> function. Do not you think it would be better put set_current_state
> instead of __set_current_state function? AFAIK the set_current_state
> function is more feasible for SMP systems, right?

set_current_state is needed only when you need to place a memory barrier
after __set_current_state. So it's needed in the usual wait_event loop,
right after registering in the waitqueue. Example:

unsigned long flags;

wait->flags &= ~WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
/*
 * don't alter the task state if this is just going to
 * queue an async wait queue callback
 */
if (is_sync_wait(wait))
set_current_state(state);
spin_unlock_irqrestore(&q->lock, flags);

and even in the above is needed only because spin_unlock has inclusive
semantics in ia64. In 2.4 there was no unlock at all after
set_current_state and it was like this:


set_current_state(TASK_UNINTERRUPTIBLE);
\
if (condition)
\
break;
\
schedule();
\

The rule of thumb is that if there's nothing between set_current_state
and schedule() then __set_current_state is more efficient and equally
safe to use. And the oom killer path I posted falls in this category,
nothing in between set_current_state and schedule, so no reason to place
memory barries in there.

Hope this helps ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 01:31:46PM -0800, Roland McGrath wrote:
> When gdb has a bug, people want to be able to kill it and get on with using
> their program, not have their program always be killed too.

What I need is that the program is killed right away synchronously as
soon as the "debugger" detaches (to me that's a needed feature). No
matter why the debugger detached.  This is the opposite of what
ptrace/strace does right now.

Just try to attach to a task with strace -p, then kill strace with -9,
the task will keep going like if nothing has happened. I need the child
killed too instead (before the parent unptrace the child).

Probably the reason why the app gets killed is that gdb is the ptrace
task is the process leader of the process group like Ingo suggested. But
I'd rather not depend on leaders/groups/pids/signals, when I can do it
with do_exit and a check on the syscall number.

Ptrace does a lot more of what I need, I don't care about parameters or
anything more than the syscall number, I don't need to change the
retvals during syscall return or to check registers or to stop a task.
Even the auditing subsystem could be implemented by putting all tasks
under strace and by having the ptracers communicating with each other
with pipes to generate a global info. But it wouldn't be as reliable and
as simple as having kernel code doing it.

I'm still open to do it with ptrace if there's a consensus on l-k to do
it in that direction, it's probably going to work fine too but if I
didn't feel safer with seccomp I would be doing ptrace in the first
place, it's not like I forgotten I could do it with ptrace too (like
Pavel already reminded me some month ago).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User space out of memory approach

2005-01-21 Thread Andrea Arcangeli
On Fri, Jan 21, 2005 at 05:45:13PM -0400, Mauricio Lin wrote:
> Hi Andrew,
> 
> I have another question. You included an oom_adj entry in /proc for
> each process. This was the approach you used in order to allow someone
> or something to interfere the ranking algorithm from userland, right?
> So if i have an another ranking algorithm in user space, I can use it
> to complement the kernel decision as necessary. Was it your idea?

Yes, you should use your userspace algorithm to tune the oom killer via
the oom_adj and you can check the effect of your changes with oom_score.
I posted a one liner ugly script to do that a few days ago on l-k.

The oom_adj has this effect on the badness() code:

/* 
 * Adjust the score by oomkilladj.
 */
if (p->oomkilladj) {
if (p->oomkilladj > 0)
points <<= p->oomkilladj;
else
points >>= -(p->oomkilladj);
}

The biggest the points become, the more likely the task will be choosen
by the oom killer.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM fixes 1/5

2005-01-21 Thread Andrea Arcangeli
I noticed 1/5 had a glitch, this is an update. It won't alter the
ordering, the other patches will still apply cleanly.

Thanks.

From: [EMAIL PROTECTED]
Subject: protect-pids

This is protect-pids, a patch to allow the admin to tune the oom killer.
The tweak is inherited between parent and child so it's easy to write a
wrapper for complex apps.

I made used_math a char at the light of later patches. Current patch
breaks alpha, but future patches will fix it.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

--- x/fs/proc/base.c2005-01-15 20:44:58.0 +0100
+++ xx/fs/proc/base.c   2005-01-22 07:02:50.0 +0100
@@ -72,6 +72,8 @@ enum pid_directory_inos {
PROC_TGID_ATTR_FSCREATE,
 #endif
PROC_TGID_FD_DIR,
+   PROC_TGID_OOM_SCORE,
+   PROC_TGID_OOM_ADJUST,
PROC_TID_INO,
PROC_TID_STATUS,
PROC_TID_MEM,
@@ -98,6 +100,8 @@ enum pid_directory_inos {
PROC_TID_ATTR_FSCREATE,
 #endif
PROC_TID_FD_DIR = 0x8000,   /* 0x8000-0x */
+   PROC_TID_OOM_SCORE,
+   PROC_TID_OOM_ADJUST,
 };
 
 struct pid_entry {
@@ -133,6 +137,8 @@ static struct pid_entry tgid_base_stuff[
 #ifdef CONFIG_SCHEDSTATS
E(PROC_TGID_SCHEDSTAT, "schedstat", S_IFREG|S_IRUGO),
 #endif
+   E(PROC_TGID_OOM_SCORE, "oom_score",S_IFREG|S_IRUGO),
+   E(PROC_TGID_OOM_ADJUST,"oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
{0,0,NULL,0}
 };
 static struct pid_entry tid_base_stuff[] = {
@@ -158,6 +164,8 @@ static struct pid_entry tid_base_stuff[]
 #ifdef CONFIG_SCHEDSTATS
E(PROC_TID_SCHEDSTAT, "schedstat",S_IFREG|S_IRUGO),
 #endif
+   E(PROC_TID_OOM_SCORE,  "oom_score",S_IFREG|S_IRUGO),
+   E(PROC_TID_OOM_ADJUST, "oom_adj", S_IFREG|S_IRUGO|S_IWUSR),
{0,0,NULL,0}
 };
 
@@ -384,6 +392,18 @@ static int proc_pid_schedstat(struct tas
 }
 #endif
 
+/* The badness from the OOM killer */
+unsigned long badness(struct task_struct *p, unsigned long uptime);
+static int proc_oom_score(struct task_struct *task, char *buffer)
+{
+   unsigned long points;
+   struct timespec uptime;
+
+   do_posix_clock_monotonic_gettime(&uptime);
+   points = badness(task, uptime.tv_sec);
+   return sprintf(buffer, "%lu\n", points);
+}
+
 //
 /*   Here the fs part begins*/
 //
@@ -657,6 +677,56 @@ static struct file_operations proc_mem_o
.open   = mem_open,
 };
 
+static ssize_t oom_adjust_read(struct file * file, char * buf,
+   size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file->f_dentry->d_inode);
+   char buffer[8];
+   size_t len;
+   int oom_adjust = task->oomkilladj;
+   loff_t __ppos = *ppos;
+
+   len = sprintf(buffer, "%i\n", oom_adjust);
+   if (__ppos >= len)
+   return 0;
+   if (count > len-__ppos)
+   count = len-__ppos;
+   if (copy_to_user(buf, buffer + __ppos, count)) 
+   return -EFAULT;
+   *ppos = __ppos + count;
+   return count;
+}
+
+static ssize_t oom_adjust_write(struct file * file, const char * buf,
+   size_t count, loff_t *ppos)
+{
+   struct task_struct *task = proc_task(file->f_dentry->d_inode);
+   char buffer[8], *end;
+   int oom_adjust;
+
+   if (!capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+   memset(buffer, 0, 8);   
+   if (count > 6)
+   count = 6;
+   if (copy_from_user(buffer, buf, count)) 
+   return -EFAULT;
+   oom_adjust = simple_strtol(buffer, &end, 0);
+   if (oom_adjust < -16 || oom_adjust > 15)
+   return -EINVAL;
+   if (*end == '\n')
+   end++;
+   task->oomkilladj = oom_adjust;
+   if (end - buffer == 0) 
+   return -EIO;
+   return end - buffer;
+}
+
+static struct file_operations proc_oom_adjust_operations = {
+   read:   oom_adjust_read,
+   write:  oom_adjust_write,
+};
+
 static struct inode_operations proc_mem_inode_operations = {
.permission = proc_permission,
 };
@@ -1336,6 +1406,15 @@ static struct dentry *proc_pident_lookup
ei->op.proc_read = proc_pid_schedstat;
break;
 #endif
+   case PROC_TID_OOM_SCORE:
+   case PROC_TGID_OOM_SCORE:
+   inode->i_fop = &proc_info_file_operations;
+   ei->op.proc_read = proc_oom_score;
+   break;
+   case PROC_TID_OOM_ADJUST:
+   case PROC_TGID_OOM_ADJUST:
+   inode->

Re: seccomp for 2.6.11-rc1-bk8

2005-01-22 Thread Andrea Arcangeli
On Sat, Jan 22, 2005 at 11:32:42AM +0100, Pavel Machek wrote:
> Well, seccomp is also getting very little testing, when ptrace gets a
> lot of testing; I know that seccomp is simple, but I believe testing
> coverage still make ptrace better choice.

It's not testing that makes code more secure. Testing verifys the code
works in production, but testing almost never helps to find security
issues, and often not even hidden subtle race conditions. Check how many
security bugs have been found with testing.  Just go to bugtraq count
them. I simply cannot relay on testing for the security part. I will
relay on testing for everything else but not for this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-22 Thread Andrea Arcangeli
On Sat, Jan 22, 2005 at 08:42:42PM +0100, Pavel Machek wrote:
> Well, then you can help auditing ptrace()... It is probably also true
> that more people audited ptrace() than seccomp :-).

Why should I spend time auditing ptrace when I have a superior solution
that doesn't require me any auditing at all? I've an huge pile of work,
I'm not doing this for fun, just thinking at wasting time auditing a
single line of ptrace code is insane as far as I'm concerned (if I can
avoid it with a more robust, less likely to break and simpler approach).
If the l-k community forces me to use ptrace, I'll be forced to do that
indeed (and you should be ready to take the blame if something goes
wrong), but be sure I'll try as much as I can to stay away from ptrace
completely.  ptrace is a debugging knob, uml itself is a debugging tool
that depends on a debugging knob and that's fine. I'm not doing a
debugging tool, I'm doing something that requires the maximum level of
security ever, and using ptrace is dead wrong for that IMHO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-22 Thread Andrea Arcangeli
On Sun, Jan 23, 2005 at 01:07:04AM +0100, Pavel Machek wrote:
> Adding code is easy, but in the long term would lead to maintainance
> nightmare. Adding seccomp code that does subset of ptrace, just
> because ptrace audit is lot of work, seems like a wrong thing to
> do. Sorry.

Even if I do the ptrace audit right now, within 6 months something can
change and the implications of the changes won't be as trivial to
evaluate as if entry.S or seccomp.c have changed.

The userland side will be a lot more complicated too to implement.

Do you want video compressed strems to be played securely and
efficiently? I can't see a better solution than seccomp. ptrace would be
slower and it'd require ugly code to be written in userland. Streams
are going to pump some stuff into the pipes and this will avoid
quite a number of schedules per second (regardless of buffering). The
seccomp API is just tricky enough without having to hardcoded into every
userland app the number of the syscalls. Seccomp at least gives a slight
chance to write arch indipendent code while still providing lowlevel
security from the OS, there's no way to use ptrace_syscall in a arch
indipendent manner.

In the last patch I sent privately to Andrew I made it a config option,
but I recommend not to disable it, or you won't be able to run the
Cpushare client. Andrew's right seccomp.o would waste precious bytes
(not kbytes) on embedded systems, so it has to be a config option for
that. You can still modify it to use ptrace freely, but then I will have
nothing to do with the problems that may arise over time by using ptrace
within the GPL'd Cpushare client code and I personally do not approve
the use of ptrace there (but it's GPL so you can modify it).  I'm doing
something that I can trust to run on my own desktop system, and
personally seccomp is the only thing I'm confortable to depend on. Plus
the userland gets so much simpler as well. It's not only a problem of
trusting the kernel space of ptrace.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-22 Thread Andrea Arcangeli
On Sat, Jan 22, 2005 at 07:43:26PM -0500, Rik van Riel wrote:
> On Sun, 23 Jan 2005, Andrea Arcangeli wrote:
> 
> >I'm doing something that requires the maximum level of
> >security ever,
> 
> You're kidding, right ?

Why should I be kidding? The client code I'm doing, has to be at least as secure
as ssh and the firewall code, what else has to be more secure than that?
Nor ssh nor the firewall code depends on ptrace for their security. The
nice thing is that I can embed all the security in the kernel with
seccomp, and I'd be a fool not trying it to get it merged and to
complicate my life with ptrace.

Once seccomp is in, I believe there's a chance that security people uses
it for more than Cpushare while I don't think there's a chance you'll
see security people using ptrace_syscall hardcoding the syscall numbers
in every userland app out there that may have to parse untrusted data
with potentially buggy bytecode (i.e. decompression bytecode etc..).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: seccomp for 2.6.11-rc1-bk8

2005-01-22 Thread Andrea Arcangeli
On Sat, Jan 22, 2005 at 11:43:06PM -0500, [EMAIL PROTECTED] wrote:
> It's a poor idea to confuse "secure" with "can't break out of the sandbox".

The only point I'm making with seccomp, is that if it can't break out of
the sandbox it's secure. I didn't mean that the only way to make it
secure is to put it in the sandbox of course.

> And they don't even depend on seccomp or ptrace for the security either...

Indeed.

> Security people probably won't be interested, specifically because it's
> way too inflexible.  Very few real-life applications can be made to fit
> into a "open all the files you might need, then shut yourself into a 
> read/write syscalls only" model.

This exactly correct. Recycled matter is of lower quality. Not
everything is going to be printed on recycled paper, your vacation
photos cannot be printed with recyled paper. But a few may actually
appreciate recycled paper at a much cheaper price for a extremely tiny
niche of apps. It'll be a mess to be able to use it the first time, but
after they start using it they'll get a ton of it very cheap and it
might work as good as first quality paper for them. Perhaps somebody not
buying paper because was too expensive, may also start buying the
recyled paper because it gets affordable (yeah after the initial
dealing with the recycled matter conversion).

> In fact, a case could be made that the unnatural contortions needed to
> restructure applications into a seccomp model actually *decrease* the
> overall security, because of more complicated setup code being more
> vulnerable to attack.  Also, the fact that you need to keep open() out

All setup code before the execve of the loader (and the loader is few
lines of C only) is not in C/C++, which means first of all no buffer
overflows. It's a quite small piece of code as well. Sure there can be
still a bug there, but clearly somehow a software must exists to start
the seccomp mode. But this software won't be the binfmt_elf.c and it
will not be written in C (which is also why using ptrace is way
annoying, since it'd require more C code), it'll be small, and it will
be written with security in mind. I've already uploaded that software in
the website if you want to check it (ignore the gui part, it's obsolete).

Just the fact it's not in C rules out 90% of possible exploits.

> of the permitted set for seccomp to make any sense means that you need to
> open all the possible files up front.  So now you're handing the program
> *more* access to files than they should

They're not files, they're pipes. There are only two open, fd 0 and fd 1
and no data emitted and recevied by those two pipes is being
computed outside seccomp. It's like if you push .mpeg data into fd 0 and
you read from fd 1 and you write it in the framebuffer. Even if
something goes wrong into the library, as worse you'll see garbage on
the screen.

I don't think a model like this can decrease security.

The last YOU update I did, fetched an update of some decoding library,
now if it was running under seccomp it couldn't do any damage. The same
is true for the zlib trouble some time ago.

I'm not suggesting everything should run inside seccomp, and of course
such an update would be happening anyway since not every app will run
under seccomp, but certainly if you've a _special_ critical app that you
don't want to risk to be exploited by a libz bug, then seccomp may help
and it's going to be a lot more handy to use than ptrace.

> Oh, come *ON*, Andrea.  This is a red herring and you *know* it.  The only
> people who will be hardcoding syscall numbers are the same idiots that
> hardcoded capability masks instead of "#include " and
> using the CAP_* defines.

I didn't mean hardcoding in terms of numbers, I mean in terms of
__NR_read. Just read the 32bit emulation code, I had to use ifdef
TIF_IA32, that's the best I could do, and I doubt you would be able to
write much cleaner code in userland either.

> And if a filename has a runtime dependency on the untrusted data (consider
> any sort of web server or browser or mail program or anything else that
> accepts a "suggested filename" as input), things get very difficult very 
> quickly.
> 
> I can pass ptrace a SYSCALL_OPEN, and then call my untrusted code, and then
> look at the filename at runtime and see if there's something hinky going on.
> I can even apply heuristics like "The first file opened should be THIS one,
> then THOSE 4 shared libraries in order, then THIS file, and then the NEXT file
> is dependent on user input, but has to start with $USER/tmp/workdir, and then
> there's two other opens of files X and Y, and then no others should happen".
> Using seccomp, you don't get that choice.  You either have to jump through
> hoops to get all that set up beforehand, or allow open() in all its glory.

I don't get what you mean here. Anyway the filedescriptors inside
seccomp are never going to be files, and there will be only two. I can
add some documentation if it gets merged.

Bu

Re: memory leak in 2.6.11-rc2

2005-01-25 Thread Andrea Arcangeli
On Mon, Jan 24, 2005 at 10:45:47PM -0500, Dave Jones wrote:
> On Tue, Jan 25, 2005 at 02:19:24PM +1100, Andrew Tridgell wrote:
>  > The problem I've hit now is a severe memory leak. I have applied the
>  > patch from Linus for the leak in free_pipe_info(), and still I'm
>  > leaking memory at the rate of about 100Mbyte/minute.
>  > I've tested with both 2.6.11-rc2 and with 2.6.11-rc1-mm2, both with
>  > the pipe leak fix. The setup is:
> 
> That's a little more extreme than what I'm seeing, so it may be
> something else, but my firewall box needs rebooting every
> few days. It leaks around 50MB a day for some reason.
> Given it's not got a lot of ram, after 4-5 days or so, it's
> completely exhausted its swap too.
> 
> It's currently on a 2.6.10-ac kernel, so it's entirely possible that
> we're not looking at the same issue, though it could be something
> thats been there for a while if your workload makes it appear
> quicker than a firewall/ipsec gateway would.
> Do you see the same leaks with an earlier kernel ?
> 
> post OOM (when there was about 2K free after named got oom-killed)
> this is what slabinfo looked like..
> 
> dentry_cache1502   3775160   251 : tunables  120   600 : 
> slabdata151151  0
> vm_area_struct  1599   2021 84   471 : tunables  120   600 : 
> slabdata 43 43  0
> size-1283431   6262128   311 : tunables  120   600 : 
> slabdata202202  0
> size-64 4352   4575 64   611 : tunables  120   600 : 
> slabdata 75 75  0
> avtab_node  7073   7140 32  1191 : tunables  120   600 : 
> slabdata 60 60  0
> size-32 7256   7616 32  1191 : tunables  120   600 : 
> slabdata 64 64  0

What is avtab_node? there's no such thing in my kernel. But the above
can be ok. Can you show meminfo too after oom kill?

Just another datapoint my firewall runs a kernel based on 2.6.11-rc1-bk8 with
all the needed oom fixes and I've no problems on it yet. I run it oom
and this is what I get after the oom:

athlon:/home/andrea # free
 total   used   free sharedbuffers cached
Mem:511136  50852 460284  0572  15764
-/+ buffers/cache:  34516 476620
Swap:  1052248  01052248
athlon:/home/andrea # 

The above is sane, 34M is very reasonable for what's loaded there
(there's the X server running, named too, and various other non standard
daemons, one even has a virtual size of >100m so it's not a tiny thing),
so I'm quite sure I'm not hitting a memleak, at least not on the
firewal. No ipsec on it btw, and it's a pure IDE without anything
special, just quite a few nics and USB usermode running all the time.

athlon:/home/andrea # uptime
  1:34pm  up 2 days 12:08,  1 user,  load average: 0.98, 1.13, 0.54
athlon:/home/andrea # iptables -L -v |grep -A2 FORWARD
Chain FORWARD (policy ACCEPT 65 packets, 9264 bytes)
 pkts bytes target prot opt in out source   destination 

3690K 2321M block  all  --  anyany anywhere anywhere

athlon:/home/andrea # 

So if there's a memleak in rc1-bk8, it's probably not in the core of the
kernel, but in some driver or things like ipsec. Either that or it broke
after 2.6.11-rc1-bk8. The kernel I'm running is quite heavily patched
too, but I'm not aware of any memleak fix in the additional patches.

Anyway I'll try again in a few days to verify it goes back down again to
exactly 34M of anonymous/random and 15M of cache.

No apparent problem on my desktop system either, it's running the same
kernel with different config.

If somebody could fix the kernel CVS I could have a look at the
interesting changesets between 2.6.11-rc1-bk8 and 2.6.11-rc2.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    5   6   7   8   9   10   11   12   13   14   >