Re: CONFIG_PPC_VAS depends on 64k pages...?

2020-12-01 Thread Bulent Abali
I don't know anything about VAS page size requirements in the kernel.  I 
checked the user compression library and saw that we do a sysconf to get 
the page size; so the library should be immune to page size by design.
But it wouldn't surprise me if a 64KB constant is inadvertently hardcoded 
somewhere else in the library.  Giving heads up to Tulio and Raphael who 
are owners of the github repo.

https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L922

If we got this wrong in the library it might manifest itself as an error 
message of the sort "excessive page faults".  The library must touch pages 
ahead to make them present in the memory; occasional page faults is 
acceptable. It will retry.


Bulent




From:   "Sukadev Bhattiprolu" 
To: "Christophe Leroy" 
Cc: "Will Springer" , 
linuxppc-dev@lists.ozlabs.org, dan...@octaforge.org, Bulent 
Abali/Watson/IBM@IBM, ha...@linux.ibm.com
Date:   12/01/2020 12:53 AM
Subject:Re: CONFIG_PPC_VAS depends on 64k pages...?




Christophe Leroy [christophe.le...@csgroup.eu] wrote:
> Hi,
> 
> Le 19/11/2020 à 11:58, Will Springer a écrit :
> > I learned about the POWER9 gzip accelerator a few months ago when the
> > support hit upstream Linux 5.8. However, for some reason the Kconfig
> > dictates that VAS depends on a 64k page size, which is problematic as 
I
> > run Void Linux, which uses a 4k-page kernel.
> > 
> > Some early poking by others indicated there wasn't an obvious page 
size
> > dependency in the code, and suggested I try modifying the config to 
switch
> > it on. I did so, but was stopped by a minor complaint of an 
"unexpected DT
> > configuration" by the VAS code. I wasn't equipped to figure out 
exactly what
> > this meant, even after finding the offending condition, so after 
writing a
> > very drawn-out forum post asking for help, I dropped the subject.
> > 
> > Fast forward to today, when I was reminded of the whole thing again, 
and
> > decided to debug a bit further. Apparently the VAS platform device
> > (derived from the DT node) has 5 resources on my 4k kernel, instead of 
4
> > (which evidently works for others who have had success on 64k 
kernels). I
> > have no idea what this means in practice (I don't know how to 
introspect
> > it), but after making a tiny patch[1], everything came up smoothly and 
I
> > was doing blazing-fast gzip (de)compression in no time.
> > 
> > Everything seems to work fine on 4k pages. So, what's up? Are there
> > pitfalls lurking around that I've yet to stumble over? More 
reasonably,
> > I'm curious as to why the feature supposedly depends on 64k pages, or 
if
> > there's anything else I should be concerned about.

Will,

The reason I put in that config check is because we were only able to
test 64K pages at that point.

It is interesting that it is working for you. Following code in skiboot
https://github.com/open-power/skiboot/blob/master/hw/vas.c should restrict
it to 64K pages. IIRC there is also a corresponding change in some NX 
registers that should also be configured to allow 4K pages.


 static int init_north_ctl(struct proc_chip *chip)
 {
 uint64_t val = 0ULL;

 val = SETFIELD(VAS_64K_MODE_MASK, val, 
true);
 val = SETFIELD(VAS_ACCEPT_PASTE_MASK, 
val, true);
 val = SETFIELD(VAS_ENABLE_WC_MMIO_BAR, 
val, true);
 val = SETFIELD(VAS_ENABLE_UWC_MMIO_BAR, 
val, true);
 val = SETFIELD(VAS_ENABLE_RMA_MMIO_BAR, 
val, true);

 return vas_scom_write(chip, 
VAS_MISC_N_CTL, val);
 }

I am copying Bulent Albali and Haren Myneni who have been working with
VAS/NX for their thoughts/experience.

> > 
> 
> Maybe ask Sukadev who did the implementation and is maintaining it ?
> 
> > I do have to say I'm quite satisfied with the results of the NX
> > accelerator, though. Being able to shuffle data to a RaptorCS box over 
gigE
> > and get compressed data back faster than most software gzip could ever
> > hope to achieve is no small feat, let alone the instantaneous results 
locally.
> > :)
> > 
> > Cheers,
> > Will Springer [she/her]
> > 
> > [1]: 
https://github.com/Skirmisher/void-packages/blob/vas-4k-pages/srcpkgs/linux5.9/patches/ppc-vas-on-4k.patch

> > 
> 
> 
> Christophe






RE: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure

2020-07-09 Thread Bulent Abali
copied verbatim from P9 DD2 Nest Accelerators Workbook Version 3.2

Table 4-36. CSB Non-zero CC Reported Error Types

CC=5, Error Type: Translation, 
Comment: Unused, defined by RFC02130 (footnote:  DMA controller uses this 
CC internally in translation fault handling. Do not reuse for other 
purposes.)

CC=240 through 251, reserved for future firmware use, 
Comment: Error codes 240 - 255 (0xF0 - 0xF0) are reserved for firmware use 
and are not signalled by the hardware. 
These CCs are written in the CSB by hypervisor to alert the partition to 
error conditions detected by the hypervisor. 
These codes have been used in past processors for this purpose and ought 
not be relocated.





From:   Haren Myneni/Beaverton/IBM
To: Michael Ellerman 
Cc: ab...@us.ibm.com, Haren Myneni , 
linuxppc-dev@lists.ozlabs.org, 
"Linuxppc-dev", 
rzin...@linux.ibm.com, tuli...@br.ibm.com, Haren 
Myneni/Beaverton/IBM@IBMUS
Date:   07/09/2020 04:01 PM
Subject:Re: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper 
error for address translation failure




"Linuxppc-dev"  
wrote on 07/09/2020 04:22:10 AM:

> From: Michael Ellerman 
> To: Haren Myneni 
> Cc: tuli...@br.ibm.com, ab...@us.ibm.com, linuxppc-
> d...@lists.ozlabs.org, rzin...@linux.ibm.com
> Date: 07/09/2020 04:21 AM
> Subject: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper error
> for address translation failure
> Sent by: "Linuxppc-dev"  +hbabu=us.ibm@lists.ozlabs.org>
> 
> Haren Myneni  writes:
> > DMA controller uses CC=5 internally for translation fault handling. So
> > OS should be using CC=250 and should report this error to the user 
space
> > when NX encounters address translation failure on the request buffer.
> 
> That doesn't really explain *why* the OS must use CC=250.
> 
> Is it documented somewhere that 5 is for hardware use, and 250 is for
> software?

Yes, mentioned in Table 4-36. CSB Non-zero CC Reported Error Types (P9 NX 
DD2 work book). Also footnote for CC=5 says "DMA controller uses this CC 
internally in translation fault handling. Do not reuse for other purposes"

I will add documentation reference for CC=250 comment. 

> 
> > This patch defines CSB_CC_ADDRESS_TRANSLATION(250) and updates
> > CSB.CC with this proper error code for user space.
> 
> We still have:
> 
> #define CSB_CC_TRANSLATION   (5)
> 
> And it's very unclear where one or the other should be used.
> 
> Can one or the other get a name that makes the distinction clear.

CSB_CC_TRANSLATION is added in 842 driver (nx-common-powernv.c) when NX is 
introduced (P7+). NX will not see faults on kernel requests (cc=250) and 
even CC=5. 

Table 4-36: 
For CC=5: says Translation
CC=250:says "Address Translation Fault"

So I can say CRB_CC_ADDRESS_TRANSLATION_FAULT or CRN_CC_TRANSLATION_FAULT. 
This code path (also CRBs) should be generic, so should not use like 
CRB_CC_NX_FAULT. 

Thanks
Haren

> 
> cheers
> 
> 
> > diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/
> powerpc/vas-api.rst
> > index 1217c2f..78627cc 100644
> > --- a/Documentation/powerpc/vas-api.rst
> > +++ b/Documentation/powerpc/vas-api.rst
> > @@ -213,7 +213,7 @@ request buffers are not in memory. The 
> operating system handles the fault by
> >  updating CSB with the following data:
> > 
> > csb.flags = CSB_V;
> > -   csb.cc = CSB_CC_TRANSLATION;
> > +   csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> > csb.ce = CSB_CE_TERMINATION;
> > csb.address = fault_address;
> > 
> > diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/
> include/asm/icswx.h
> > index 965b1f3..b1c9a57 100644
> > --- a/arch/powerpc/include/asm/icswx.h
> > +++ b/arch/powerpc/include/asm/icswx.h
> > @@ -77,6 +77,8 @@ struct coprocessor_completion_block {
> >  #define CSB_CC_CHAIN  (37)
> >  #define CSB_CC_SEQUENCE  (38)
> >  #define CSB_CC_HW  (39)
> > +/* User space address traslation failure */
> > +#define   CSB_CC_ADDRESS_TRANSLATION   (250)
> > 
> >  #define CSB_SIZE  (0x10)
> >  #define CSB_ALIGN  CSB_SIZE
> > diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/
> powerpc/platforms/powernv/vas-fault.c
> > index 266a6ca..33e89d4 100644
> > --- a/arch/powerpc/platforms/powernv/vas-fault.c
> > +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> > @@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
> > csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
> > 
> > memset(, 0, sizeof(csb));
> > -   csb.cc = CSB_CC_TRANSLATION;
> > +   csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> > csb.ce = CSB_CE_TERMINATION;
> > csb.cs = 0;
> > csb.count = 0;
> > -- 
> > 1.8.3.1
> 






Re:

2009-11-07 Thread Bulent Abali
help

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bug in __tcp_inherit_port ?

2001-07-01 Thread Bulent Abali


I get an occasional panic in __tcp_inherit_port(sk,child).  I believe the
reason is tb=sk->prev is NULL.

sk->prev is set to NULL in only few places including __tcp_put_port(sk).
Perhaps there is a serialization problem between __tcp_inherit_port and
__tcp_put_port ???   One possibility is that sk->num != child->num.
Therefore spin_locks in the two routines do not serialize.

This code is out of my league so I couldn't debug any further.  Ingo, this
is the same problem that I posted to linux-kernel couple weeks ago for
tcp_v4_syn_recv_sock.

Problem occurs when running TUX-B6, 2.4.5-ac4 with SPECweb99, dual PIII,
and one acenic adapter.   It is difficult to trigger but did occur few
times so far.   In the following are the oops and objdump
/bulent.

=

/* Caller must disable local BH processing. */
static __inline__ void __tcp_inherit_port(struct sock *sk, struct sock
*child)
{
 struct tcp_bind_hashbucket *head =
_bhash[tcp_bhashfn(child->num)];
 struct tcp_bind_bucket *tb;

 spin_lock(>lock);
 tb = (struct tcp_bind_bucket *)sk->prev;   <** line 149
 if ((child->bind_next = tb->owners) != NULL) <** panic here
  tb->owners->bind_pprev = >bind_next;
 tb->owners = child;
 child->bind_pprev = >owners;
 child->prev = (struct sock *) tb;
 spin_unlock(>lock);
}


__inline__ void __tcp_put_port(struct sock *sk)
{
 struct tcp_bind_hashbucket *head = _bhash[tcp_bhashfn(sk->num)];
 struct tcp_bind_bucket *tb;

 spin_lock(>lock);
 tb = (struct tcp_bind_bucket *) sk->prev;
 if (sk->bind_next)
  sk->bind_next->bind_pprev = sk->bind_pprev;
 *(sk->bind_pprev) = sk->bind_next;
 sk->prev = NULL;
 sk->num = 0;
 if (tb->owners == NULL) {
  if (tb->next)
   tb->next->pprev = tb->pprev;
  *(tb->pprev) = tb->next;
  kmem_cache_free(tcp_bucket_cachep, tb);
 }
 spin_unlock(>lock);
}



oops output

NULL pointer dereference at virtual address 0008
 printing eip:
c0247a34
*pde = 
Oops: 
CPU:0
EIP:0010:[]
EFLAGS: 00010246
eax:    ebx: f74224c0   ecx:    edx: f74224c0
esi: f750   edi: f71e6cf0   ebp: f74225b4   esp: c0313c00
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c0313000)
Stack: f2a55ec4 f2d6bf64 459d1162 f74224c0 c024aff9 f74224c0 f2a55ec4
f2d6bf64
    459d1163 459d1162 459d1163  1000 f74225b4
f740f58c
   f7760c00 c022a3c5 f740f58c c0231e76 e11d2a9c f7760cd8 f740083c

Call Trace: [] [] [] []
[]
   [] [] [] [] []
[]
   [] [] [] [] []
[]
   [] [] [] [] []
[]
   [] [] [] [] []
[]
   [] [] [] [] []
[]
   [] [] [] [] []
[]
   []

Code: 8b 41 08 89 43 18 85 c0 74 09 8b 51 08 8d 43 18 89 42 1c 89
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing


ksymoops output

Code;  c0247a34 
 <_EIP>:
Code;  c0247a34 
   0:   8b 41 08  mov0x8(%ecx),%eax  //panics in
child->bind_next=tb->owners
Code;  c0247a37 
   3:   89 43 18  mov%eax,0x18(%ebx)
Code;  c0247a3a 
   6:   85 c0 test   %eax,%eax
Code;  c0247a3c 
   8:   74 09 je 13 <_EIP+0x13> c0247a47

Code;  c0247a3e 
   a:   8b 51 08  mov0x8(%ecx),%edx
Code;  c0247a41 
   d:   8d 43 18  lea0x18(%ebx),%eax
Code;  c0247a44 
  10:   89 42 1c  mov%eax,0x1c(%edx)
Code;  c0247a47 
  13:   89 00 mov%eax,(%eax)



objdump -S

/usr/src/linux-2.4.5-ac4/include/asm/spinlock.h:104
c0247a21:   f0 fe 0e lock decb (%esi)
c0247a24:   0f 88 85 79 03 00js c027f3af

/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:149
c0247a2a:   8b 54 24 14  mov0x14(%esp,1),%edx
c0247a2e:   8b 8a a4 00 00 00mov0xa4(%edx),%ecx   //tb =
sk->prev
/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:150
c0247a34:   8b 41 08 mov0x8(%ecx),%eax //
child->bind_next=tb->owners
c0247a37:   89 43 18 mov%eax,0x18(%ebx)
c0247a3a:   85 c0test   %eax,%eax
c0247a3c:   74 09je c0247a47

/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:151
c0247a3e:   8b 51 08 mov0x8(%ecx),%edx
c0247a41:   8d 43 18 lea0x18(%ebx),%eax
c0247a44:   89 42 1c mov%eax,0x1c(%edx)
/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:152
c0247a47:   89 59 08 mov%ebx,0x8(%ecx)
/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:153
c0247a4a:   8d 41 08 lea0x8(%ecx),%eax
c0247a4d:   89 43 1c mov%eax,0x1c(%ebx)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

bug in __tcp_inherit_port ?

2001-07-01 Thread Bulent Abali


I get an occasional panic in __tcp_inherit_port(sk,child).  I believe the
reason is tb=sk-prev is NULL.

sk-prev is set to NULL in only few places including __tcp_put_port(sk).
Perhaps there is a serialization problem between __tcp_inherit_port and
__tcp_put_port ???   One possibility is that sk-num != child-num.
Therefore spin_locks in the two routines do not serialize.

This code is out of my league so I couldn't debug any further.  Ingo, this
is the same problem that I posted to linux-kernel couple weeks ago for
tcp_v4_syn_recv_sock.

Problem occurs when running TUX-B6, 2.4.5-ac4 with SPECweb99, dual PIII,
and one acenic adapter.   It is difficult to trigger but did occur few
times so far.   In the following are the oops and objdump
/bulent.

=

/* Caller must disable local BH processing. */
static __inline__ void __tcp_inherit_port(struct sock *sk, struct sock
*child)
{
 struct tcp_bind_hashbucket *head =
tcp_bhash[tcp_bhashfn(child-num)];
 struct tcp_bind_bucket *tb;

 spin_lock(head-lock);
 tb = (struct tcp_bind_bucket *)sk-prev;   ** line 149
 if ((child-bind_next = tb-owners) != NULL) ** panic here
  tb-owners-bind_pprev = child-bind_next;
 tb-owners = child;
 child-bind_pprev = tb-owners;
 child-prev = (struct sock *) tb;
 spin_unlock(head-lock);
}


__inline__ void __tcp_put_port(struct sock *sk)
{
 struct tcp_bind_hashbucket *head = tcp_bhash[tcp_bhashfn(sk-num)];
 struct tcp_bind_bucket *tb;

 spin_lock(head-lock);
 tb = (struct tcp_bind_bucket *) sk-prev;
 if (sk-bind_next)
  sk-bind_next-bind_pprev = sk-bind_pprev;
 *(sk-bind_pprev) = sk-bind_next;
 sk-prev = NULL;
 sk-num = 0;
 if (tb-owners == NULL) {
  if (tb-next)
   tb-next-pprev = tb-pprev;
  *(tb-pprev) = tb-next;
  kmem_cache_free(tcp_bucket_cachep, tb);
 }
 spin_unlock(head-lock);
}



oops output

NULL pointer dereference at virtual address 0008
 printing eip:
c0247a34
*pde = 
Oops: 
CPU:0
EIP:0010:[c0247a34]
EFLAGS: 00010246
eax:    ebx: f74224c0   ecx:    edx: f74224c0
esi: f750   edi: f71e6cf0   ebp: f74225b4   esp: c0313c00
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c0313000)
Stack: f2a55ec4 f2d6bf64 459d1162 f74224c0 c024aff9 f74224c0 f2a55ec4
f2d6bf64
    459d1163 459d1162 459d1163  1000 f74225b4
f740f58c
   f7760c00 c022a3c5 f740f58c c0231e76 e11d2a9c f7760cd8 f740083c

Call Trace: [c024aff9] [c022a3c5] [c0231e76] [c01bff2c]
[f8805514]
   [c02ac6b0] [f8805000] [c0231e76] [c022a3c5] [c0231e76]
[c0224962]
   [c0231e76] [c0220f5c] [c02210a8] [c02321e4] [c02444ff]
[c0220f5c]
   [c02210a8] [c023d85c] [c023db31] [c0220f5c] [c02210a8]
[c0247b1c]
   [c0247e4b] [c024833f] [c022f5b8] [c022f955] [c01bf55a]
[c02251bb]
   [c0224eb2] [c0108d7e] [c0120e5b] [c0109189] [c0105220]
[c0105220]
   [c0107544] [c0105220] [c0105220] [c010524d] [c01052d2]
[c0105000]
   [c01001cf]

Code: 8b 41 08 89 43 18 85 c0 74 09 8b 51 08 8d 43 18 89 42 1c 89
 0Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing


ksymoops output

Code;  c0247a34 tcp_v4_syn_recv_sock+284/330
 _EIP:
Code;  c0247a34 tcp_v4_syn_recv_sock+284/330
   0:   8b 41 08  mov0x8(%ecx),%eax  //panics in
child-bind_next=tb-owners
Code;  c0247a37 tcp_v4_syn_recv_sock+287/330
   3:   89 43 18  mov%eax,0x18(%ebx)
Code;  c0247a3a tcp_v4_syn_recv_sock+28a/330
   6:   85 c0 test   %eax,%eax
Code;  c0247a3c tcp_v4_syn_recv_sock+28c/330
   8:   74 09 je 13 _EIP+0x13 c0247a47
tcp_v4_syn_recv_sock+297/330
Code;  c0247a3e tcp_v4_syn_recv_sock+28e/330
   a:   8b 51 08  mov0x8(%ecx),%edx
Code;  c0247a41 tcp_v4_syn_recv_sock+291/330
   d:   8d 43 18  lea0x18(%ebx),%eax
Code;  c0247a44 tcp_v4_syn_recv_sock+294/330
  10:   89 42 1c  mov%eax,0x1c(%edx)
Code;  c0247a47 tcp_v4_syn_recv_sock+297/330
  13:   89 00 mov%eax,(%eax)



objdump -S

/usr/src/linux-2.4.5-ac4/include/asm/spinlock.h:104
c0247a21:   f0 fe 0e lock decb (%esi)
c0247a24:   0f 88 85 79 03 00js c027f3af
stext_lock+0x5c6f
/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:149
c0247a2a:   8b 54 24 14  mov0x14(%esp,1),%edx
c0247a2e:   8b 8a a4 00 00 00mov0xa4(%edx),%ecx   //tb =
sk-prev
/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:150
c0247a34:   8b 41 08 mov0x8(%ecx),%eax //
child-bind_next=tb-owners
c0247a37:   89 43 18 mov%eax,0x18(%ebx)
c0247a3a:   85 c0test   %eax,%eax
c0247a3c:   74 09je c0247a47
tcp_v4_syn_recv_sock+0x297
/usr/src/linux-2.4.5-ac4/net/ipv4/tcp_ipv4.c:151
c0247a3e:   8b 51 08 mov

Re: all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-26 Thread Bulent Abali



>> I am running in to a problem, seemingly a deadlock situation, where
almost
>> all the processes end up in the TASK_UNINTERRUPTIBLE state.   All the
>
>could you try to reproduce with this patch applied on top of
>2.4.6pre5aa1 or 2.4.6pre5 vanilla?

Andrea,
I would like try your patch but so far I can trigger the bug only when
running TUX 2.0-B6 which runs on 2.4.5-ac4.  /bulent



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-26 Thread Bulent Abali



 I am running in to a problem, seemingly a deadlock situation, where
almost
 all the processes end up in the TASK_UNINTERRUPTIBLE state.   All the

could you try to reproduce with this patch applied on top of
2.4.6pre5aa1 or 2.4.6pre5 vanilla?

Andrea,
I would like try your patch but so far I can trigger the bug only when
running TUX 2.0-B6 which runs on 2.4.5-ac4.  /bulent



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-25 Thread Bulent Abali



>[EMAIL PROTECTED] said:
>> I am running in to a problem, seemingly a deadlock situation, where
>> almost all the processes end up in the TASK_UNINTERRUPTIBLE state.
>> All the process eventually stop responding, including login shell, no
>> screen updates, keyboard etc.  Can ping and sysrq key works.   I
>> traced the tasks through sysrq-t key.  The processors are in the idle
>> state.  Tasks all seem to get stuck in the __wait_on_page or
>> __lock_page.
>
>I've seen this under UML, Rik van Riel has seen it on a physical box, and
we
>suspect that they're the same problem (i.e. mine isn't a UML-specific
bug).

Can you give more details?  Was there an aic7xxx scsi driver on the box?
run_task_queue(_disk) should eventually unlock those pages
but they remain locked.  I am trying to narrow it down to fs/buffer
code or the SCSI driver aic7xxx in my case. Thanks. /bulent



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-25 Thread Bulent Abali


keywords:  tux, aic7xxx, 2.4.5-ac4, specweb99, __wait_on_page, __lock_page

Greetings,

I am running in to a problem, seemingly a deadlock situation, where almost
all the processes end up in the TASK_UNINTERRUPTIBLE state.   All the
process eventually stop responding, including login shell, no screen
updates, keyboard etc.  Can ping and sysrq key works.   I traced the tasks
through sysrq-t key.  The processors are in the idle state.  Tasks all seem
to get stuck in the __wait_on_page or __lock_page.  It appears from the
source that they are waiting for pages to be unlocked.   run_task_queue
(_disk) should eventually cause pages to unlock but it doesn't happen.
Anybody familiar with this problem or have seen it before?  Thanks for any
comments.
Bulent

Here are the conditions:
Dual PIII, 1GHz, 1GB of memory,  aic7xxx scsi driver, acenic eth.
This occurs while TUX  (2.4.5-B6) webserver is being driven by SPECWeb99
benchmark at a rate of 800 c/s.  The system is very busy doing disk and
network I/O.  Problem occurs sometimes in an hour and sometimes 10-20 hours
in to the running.

Bulent


Process: 0, { swapper}
EIP: 0010:[] CPU: 1 EFLAGS: 0246
EAX:  EBX: c0105220 ECX: c2afe000 EDX: 0025
ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0
Call Trace: [] [] []
SysRq : Show Regs

Process: 0, { swapper}
EIP: 0010:[] CPU: 0 EFLAGS: 0246
EAX:  EBX: c0105220 ECX: c030a000 EDX: 
ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0
Call Trace: [] [] []
SysRq : Show Regs

EIP: 0010:[] CPU: 1 EFLAGS: 0246
Using defaults from ksymoops -t elf32-i386 -a i386
EAX:  EBX: c0105220 ECX: c2afe000 EDX: 0025
ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0
Call Trace: [] [] []

EIP: 0010:[] CPU: 0 EFLAGS: 0246
EAX:  EBX: c0105220 ECX: c030a000 EDX: 
ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0
Call Trace: [] [] []

>>EIP; c010524d<=
Trace; c01052d2 
Trace; c0119186 <__call_console_drivers+46/60>
Trace; c01192fb 

>>EIP; c010524d<=
Trace; c01052d2 
Trace; c0105000 
Trace; c01001cf 

=

SysRq : Show Memory
Mem-info:
Free pages:4300kB (   792kB HighMem)
( Active: 200434, inactive_dirty: 26808, inactive_clean: 1472, free: 1075
(574 1148 1722) )
24*4kB 15*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 728kB)
493*4kB 3*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 2780kB)
0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB
0*4096kB = 792kB)
Swap cache: add 2711, delete 643, find 5301/6721
Free swap:   2087996kB
253932 pages of RAM
24556 pages of HIGHMEM
7212 reserved pages
221419 pages shared
2068 pages swap cached
0 pages in page table cache
Buffer memory:12164kB
CLEAN: 2322 buffers, 9276 kbyte, 3 used (last=2322), 2 locked, 0
protected, 0 dirty
   LOCKED: 405 buffers, 1608 kbyte, 39 used (last=404), 348 locked, 0
protected, 0 dirty
DIRTY: 322 buffers, 1288 kbyte, 0 used (last=0), 322 locked, 0
protected, 322 dirty

=

async IO 0/2  D 0013 0  1061   1059  1062   (NOTLB)
Call Trace: [] [] [] []
[]
   [] [] [] [] []

Trace; c012e121 <___wait_on_page+91/c0>
Trace; c012f059 
Trace; c02614d7 
Trace; c0258c44 
Trace; c02588c0 
Trace; c025c65a 
Trace; c0256848 
Trace; c0258478 
Trace; c0105636 
Trace; c02582a0 


==

bash  D C2AE541C 0   920912 (NOTLB)
Call Trace: [] [] [] []
[]
   [] [] [] [] []
[]
   [] [] [] [

Trace; c012e1e1 <__lock_page+91/c0>
Trace; c012e04d 
Trace; c016b880 
Trace; c012fdac 
Trace; c012a49a 
Trace; c012a76a 
Trace; c012a8cb 
Trace; c021814c 
Trace; c0113ed0 
Trace; c0114106 
Trace; c0118aa5 
Trace; c01417d2 
Trace; c011e25b 
Trace; c0113ed0 
Trace; c01075b8 



void ___wait_on_page(struct page *page)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);

add_wait_queue(>wait, );
do {
sync_page(page);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
if (!PageLocked(page))
break;
run_task_queue(_disk);
schedule();
} while (PageLocked(page));
tsk->state = TASK_RUNNING;
remove_wait_queue(>wait, );
}

static void __lock_page(struct page *page)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);

add_wait_queue_exclusive(>wait, );
for (;;) {
sync_page(page);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
if (PageLocked(page)) {
run_task_queue(_disk);
   

all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-25 Thread Bulent Abali


keywords:  tux, aic7xxx, 2.4.5-ac4, specweb99, __wait_on_page, __lock_page

Greetings,

I am running in to a problem, seemingly a deadlock situation, where almost
all the processes end up in the TASK_UNINTERRUPTIBLE state.   All the
process eventually stop responding, including login shell, no screen
updates, keyboard etc.  Can ping and sysrq key works.   I traced the tasks
through sysrq-t key.  The processors are in the idle state.  Tasks all seem
to get stuck in the __wait_on_page or __lock_page.  It appears from the
source that they are waiting for pages to be unlocked.   run_task_queue
(tq_disk) should eventually cause pages to unlock but it doesn't happen.
Anybody familiar with this problem or have seen it before?  Thanks for any
comments.
Bulent

Here are the conditions:
Dual PIII, 1GHz, 1GB of memory,  aic7xxx scsi driver, acenic eth.
This occurs while TUX  (2.4.5-B6) webserver is being driven by SPECWeb99
benchmark at a rate of 800 c/s.  The system is very busy doing disk and
network I/O.  Problem occurs sometimes in an hour and sometimes 10-20 hours
in to the running.

Bulent


Process: 0, { swapper}
EIP: 0010:[c010524d] CPU: 1 EFLAGS: 0246
EAX:  EBX: c0105220 ECX: c2afe000 EDX: 0025
ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0
Call Trace: [c01052d2] [c0119186] [c01192fb]
SysRq : Show Regs

Process: 0, { swapper}
EIP: 0010:[c010524d] CPU: 0 EFLAGS: 0246
EAX:  EBX: c0105220 ECX: c030a000 EDX: 
ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0
Call Trace: [c01052d2] [c0105000] [c01001cf]
SysRq : Show Regs

EIP: 0010:[c010524d] CPU: 1 EFLAGS: 0246
Using defaults from ksymoops -t elf32-i386 -a i386
EAX:  EBX: c0105220 ECX: c2afe000 EDX: 0025
ESI: c2afe000 EDI: c2afe000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049df0 CR3: 268e CR4: 06d0
Call Trace: [c01052d2] [c0119186] [c01192fb]

EIP: 0010:[c010524d] CPU: 0 EFLAGS: 0246
EAX:  EBX: c0105220 ECX: c030a000 EDX: 
ESI: c030a000 EDI: c030a000 EBP: c0105220 DS: 0018 ES: 0018
CR0: 8005003b CR2: 08049f7c CR3: 37a63000 CR4: 06d0
Call Trace: [c01052d2] [c0105000] [c01001cf]

EIP; c010524d default_idle+2d/40   =
Trace; c01052d2 cpu_idle+52/70
Trace; c0119186 __call_console_drivers+46/60
Trace; c01192fb call_console_drivers+eb/100

EIP; c010524d default_idle+2d/40   =
Trace; c01052d2 cpu_idle+52/70
Trace; c0105000 prepare_namespace+0/10
Trace; c01001cf L6+0/2

=

SysRq : Show Memory
Mem-info:
Free pages:4300kB (   792kB HighMem)
( Active: 200434, inactive_dirty: 26808, inactive_clean: 1472, free: 1075
(574 1148 1722) )
24*4kB 15*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 728kB)
493*4kB 3*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB
0*2048kB 0*4096kB = 2780kB)
0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB
0*4096kB = 792kB)
Swap cache: add 2711, delete 643, find 5301/6721
Free swap:   2087996kB
253932 pages of RAM
24556 pages of HIGHMEM
7212 reserved pages
221419 pages shared
2068 pages swap cached
0 pages in page table cache
Buffer memory:12164kB
CLEAN: 2322 buffers, 9276 kbyte, 3 used (last=2322), 2 locked, 0
protected, 0 dirty
   LOCKED: 405 buffers, 1608 kbyte, 39 used (last=404), 348 locked, 0
protected, 0 dirty
DIRTY: 322 buffers, 1288 kbyte, 0 used (last=0), 322 locked, 0
protected, 322 dirty

=

async IO 0/2  D 0013 0  1061   1059  1062   (NOTLB)
Call Trace: [c012e121] [c012f059] [c02614d7] [c0258c44]
[c02588c0]
   [c025c65a] [c0256848] [c0258478] [c0105636] [c02582a0]

Trace; c012e121 ___wait_on_page+91/c0
Trace; c012f059 do_generic_file_read+449/7d0
Trace; c02614d7 send_abuf+27/30
Trace; c0258c44 generic_send_file+84/100
Trace; c02588c0 sock_send_actor+0/1a0
Trace; c025c65a http_send_body+6a/100
Trace; c0256848 tux_schedule_atom+18/20
Trace; c0258478 cachemiss_thread+1d8/350
Trace; c0105636 kernel_thread+26/30
Trace; c02582a0 cachemiss_thread+0/350


==

bash  D C2AE541C 0   920912 (NOTLB)
Call Trace: [c012e1e1] [c012e04d] [c016b880] [c012fdac]
[c012a76a]
   [c012a8cb] [c0110018] [c02709c7] [c0113ed0] [c0114106]
[c0195494]
   [c01417d2] [c011e25b] [c0113ed0] [c01075b8

Trace; c012e1e1 __lock_page+91/c0
Trace; c012e04d read_cluster_nonblocking+17d/1c0
Trace; c016b880 ext2_get_block+0/5b0
Trace; c012fdac filemap_nopage+3fc/5b0
Trace; c012a49a do_swap_page+23a/2f0
Trace; c012a76a do_no_page+8a/150
Trace; c012a8cb handle_mm_fault+9b/150
Trace; c021814c sock_sendmsg+6c/90
Trace; c0113ed0 do_page_fault+0/550
Trace; c0114106 do_page_fault+236/550
Trace; c0118aa5 do_syslog+1e5/820
Trace; c01417d2 sys_read+c2/d0
Trace; c011e25b do_softirq+6b/a0
Trace; c0113ed0 do_page_fault+0/550
Trace; c01075b8 

Re: all processes waiting in TASK_UNINTERRUPTIBLE state

2001-06-25 Thread Bulent Abali



[EMAIL PROTECTED] said:
 I am running in to a problem, seemingly a deadlock situation, where
 almost all the processes end up in the TASK_UNINTERRUPTIBLE state.
 All the process eventually stop responding, including login shell, no
 screen updates, keyboard etc.  Can ping and sysrq key works.   I
 traced the tasks through sysrq-t key.  The processors are in the idle
 state.  Tasks all seem to get stuck in the __wait_on_page or
 __lock_page.

I've seen this under UML, Rik van Riel has seen it on a physical box, and
we
suspect that they're the same problem (i.e. mine isn't a UML-specific
bug).

Can you give more details?  Was there an aic7xxx scsi driver on the box?
run_task_queue(tq_disk) should eventually unlock those pages
but they remain locked.  I am trying to narrow it down to fs/buffer
code or the SCSI driver aic7xxx in my case. Thanks. /bulent



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFQ] aic7xxx driver panics under heavy swap.

2001-06-20 Thread Bulent Abali



Justin,
Your patch works for me.  printk "Temporary Resource Shortage"
has to go, or may be you can make it a debug option.

Here is the cleaned up patch for 2.4.5-ac15 with TAILQ
macros replaced with LIST macros.  Thanks for the help.
Bulent



--- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001
+++ aic7xxx_linux.c Tue Jun 19 17:35:55 2001
@@ -1516,7 +1516,11 @@
 }
 cmd->result = CAM_REQ_INPROG << 16;
 TAILQ_INSERT_TAIL(>busyq, (struct ahc_cmd *)cmd, acmd_links.tqe);
-ahc_linux_run_device_queue(ahc, dev);
+if ((dev->flags & AHC_DEV_ON_RUN_LIST) == 0) {
+ LIST_INSERT_HEAD(>platform_data->device_runq, dev, links);
+ dev->flags |= AHC_DEV_ON_RUN_LIST;
+ ahc_linux_run_device_queues(ahc);
+}
 ahc_unlock(ahc, );
 return (0);
 }
@@ -1532,6 +1536,9 @@
 struct ahc_tmode_tstate *tstate;
 uint16_t mask;

+if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0)
+ panic("running device on run list");
+
 while ((acmd = TAILQ_FIRST(>busyq)) != NULL
 && dev->openings > 0 && dev->qfrozen == 0) {

@@ -1540,8 +1547,6 @@
   * running is because the whole controller Q is frozen.
   */
  if (ahc->platform_data->qfrozen != 0) {
-  if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0)
-   return;

   LIST_INSERT_HEAD(>platform_data->device_runq,
  dev, links);
@@ -1552,8 +1557,6 @@
   * Get an scb to use.
   */
  if ((scb = ahc_get_scb(ahc)) == NULL) {
-  if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0)
-   panic("running device on run list");
   LIST_INSERT_HEAD(>platform_data->device_runq,
  dev, links);
   dev->flags |= AHC_DEV_ON_RUN_LIST;








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFQ] aic7xxx driver panics under heavy swap.

2001-06-20 Thread Bulent Abali



Justin,
Your patch works for me.  printk Temporary Resource Shortage
has to go, or may be you can make it a debug option.

Here is the cleaned up patch for 2.4.5-ac15 with TAILQ
macros replaced with LIST macros.  Thanks for the help.
Bulent



--- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001
+++ aic7xxx_linux.c Tue Jun 19 17:35:55 2001
@@ -1516,7 +1516,11 @@
 }
 cmd-result = CAM_REQ_INPROG  16;
 TAILQ_INSERT_TAIL(dev-busyq, (struct ahc_cmd *)cmd, acmd_links.tqe);
-ahc_linux_run_device_queue(ahc, dev);
+if ((dev-flags  AHC_DEV_ON_RUN_LIST) == 0) {
+ LIST_INSERT_HEAD(ahc-platform_data-device_runq, dev, links);
+ dev-flags |= AHC_DEV_ON_RUN_LIST;
+ ahc_linux_run_device_queues(ahc);
+}
 ahc_unlock(ahc, flags);
 return (0);
 }
@@ -1532,6 +1536,9 @@
 struct ahc_tmode_tstate *tstate;
 uint16_t mask;

+if ((dev-flags  AHC_DEV_ON_RUN_LIST) != 0)
+ panic(running device on run list);
+
 while ((acmd = TAILQ_FIRST(dev-busyq)) != NULL
  dev-openings  0  dev-qfrozen == 0) {

@@ -1540,8 +1547,6 @@
   * running is because the whole controller Q is frozen.
   */
  if (ahc-platform_data-qfrozen != 0) {
-  if ((dev-flags  AHC_DEV_ON_RUN_LIST) != 0)
-   return;

   LIST_INSERT_HEAD(ahc-platform_data-device_runq,
  dev, links);
@@ -1552,8 +1557,6 @@
   * Get an scb to use.
   */
  if ((scb = ahc_get_scb(ahc)) == NULL) {
-  if ((dev-flags  AHC_DEV_ON_RUN_LIST) != 0)
-   panic(running device on run list);
   LIST_INSERT_HEAD(ahc-platform_data-device_runq,
  dev, links);
   dev-flags |= AHC_DEV_ON_RUN_LIST;








-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFQ] aic7xxx driver panics under heavy swap.

2001-06-19 Thread Bulent Abali


Justin,
When free memory is low, I get a series of aic7xxx messages followed by
panic.
It appears to be a race condition in the code.  Should you panic?  I tried
the following
patch to not panic.  But I am not sure if it is functionally correct.
Bulent


scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
Kernel panic: running device on run list


--- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001
+++ aic7xxx_linux.c Mon Jun 18 20:26:29 2001
@@ -1552,12 +1552,14 @@
   * Get an scb to use.
   */
  if ((scb = ahc_get_scb(ahc)) == NULL) {
+  ahc->flags |= AHC_RESOURCE_SHORTAGE;
   if ((dev->flags & AHC_DEV_ON_RUN_LIST) != 0)
-   panic("running device on run list");
+   return;
+   // panic("running device on run list");
   LIST_INSERT_HEAD(>platform_data->device_runq,
  dev, links);
   dev->flags |= AHC_DEV_ON_RUN_LIST;
-  ahc->flags |= AHC_RESOURCE_SHORTAGE;
+  // ahc->flags |= AHC_RESOURCE_SHORTAGE;
   printf("%s: Temporary Resource Shortage\n",
  ahc_name(ahc));
   return;



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFQ] aic7xxx driver panics under heavy swap.

2001-06-19 Thread Bulent Abali


Justin,
When free memory is low, I get a series of aic7xxx messages followed by
panic.
It appears to be a race condition in the code.  Should you panic?  I tried
the following
patch to not panic.  But I am not sure if it is functionally correct.
Bulent


scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
scsi0: Temporary Resource Shortage
Kernel panic: running device on run list


--- aic7xxx_linux.c.save Mon Jun 18 20:25:35 2001
+++ aic7xxx_linux.c Mon Jun 18 20:26:29 2001
@@ -1552,12 +1552,14 @@
   * Get an scb to use.
   */
  if ((scb = ahc_get_scb(ahc)) == NULL) {
+  ahc-flags |= AHC_RESOURCE_SHORTAGE;
   if ((dev-flags  AHC_DEV_ON_RUN_LIST) != 0)
-   panic(running device on run list);
+   return;
+   // panic(running device on run list);
   LIST_INSERT_HEAD(ahc-platform_data-device_runq,
  dev, links);
   dev-flags |= AHC_DEV_ON_RUN_LIST;
-  ahc-flags |= AHC_RESOURCE_SHORTAGE;
+  // ahc-flags |= AHC_RESOURCE_SHORTAGE;
   printf(%s: Temporary Resource Shortage\n,
  ahc_name(ahc));
   return;



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-10 Thread Bulent Abali



>The fix is to kill the dead/orphaned swap pages before we get to
>swapoff.  At shutdown time there is practically nothing active in
> ...
>Once the dead swap pages problem is fixed it is time to optimize
>swapoff.

I think fixing the orphaned swap pages problem will eliminate the
problem all together.  Probably there is no need to optimize
swapoff.

Because as the system is shutting down all the processes will be
killed and their pages in swap will be orphaned. If those pages
were to be reaped in a timely manner there wouldn't be any work
left for swapoff.

Bulent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-10 Thread Bulent Abali



The fix is to kill the dead/orphaned swap pages before we get to
swapoff.  At shutdown time there is practically nothing active in
 ...
Once the dead swap pages problem is fixed it is time to optimize
swapoff.

I think fixing the orphaned swap pages problem will eliminate the
problem all together.  Probably there is no need to optimize
swapoff.

Because as the system is shutting down all the processes will be
killed and their pages in swap will be orphaned. If those pages
were to be reaped in a timely manner there wouldn't be any work
left for swapoff.

Bulent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-09 Thread Bulent Abali




>Bulent,
>
>Could you please check if 2.4.6-pre2+the schedule patch has better
>swapoff behaviour for you?

Marcelo,

It works as expected.  Doesn't lockup the box however swapoff keeps burning
the CPU cycles.  It took 4 1/2 minutes to swapoff about 256MB of swap
content.  Shutdown took just as long.  I was hoping that shutdown would
kill the swapoff process but it doesn't.  It just hangs there.  Shutdown
is the common case.  Therefore, swapoff needs to be optimized for
shutdowns.
You could imagine users frustration waiting for a shutdown when there are
gigabytes in the swap.

So to summarize, schedule patch is better than nothing but falls far short.
I would put it in 2.4.6.  Read on.

--

The problem is with the try_to_unuse() algorithm which is very inefficient.
I searched the linux-mm archives and Tweedie was on to this. This is what
he wrote:  "it is much cheaper to find a swap entry for a given page than
to find the swap cache page for a given swap entry." And he posted a
patch http://mail.nl.linux.org/linux-mm/2001-03/msg00224.html
His patch is in the Redhat 7.1 kernel 2.4.2-2 and not in 2.4.5.

But in any case I believe the patch will not work as expected.
It seems to me that he is calling the function check_orphaned_swap(page)
in the wrong place.  He is calling the function while scanning the
active_list in refill_inactive_scan().  The problem with that is if you
wait
60 seconds or longer the orphaned swap pages will move from active
to inactive lists. Therefore the function will miss the orphans in inactive
lists.  Any comments?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-09 Thread Bulent Abali




Bulent,

Could you please check if 2.4.6-pre2+the schedule patch has better
swapoff behaviour for you?

Marcelo,

It works as expected.  Doesn't lockup the box however swapoff keeps burning
the CPU cycles.  It took 4 1/2 minutes to swapoff about 256MB of swap
content.  Shutdown took just as long.  I was hoping that shutdown would
kill the swapoff process but it doesn't.  It just hangs there.  Shutdown
is the common case.  Therefore, swapoff needs to be optimized for
shutdowns.
You could imagine users frustration waiting for a shutdown when there are
gigabytes in the swap.

So to summarize, schedule patch is better than nothing but falls far short.
I would put it in 2.4.6.  Read on.

--

The problem is with the try_to_unuse() algorithm which is very inefficient.
I searched the linux-mm archives and Tweedie was on to this. This is what
he wrote:  it is much cheaper to find a swap entry for a given page than
to find the swap cache page for a given swap entry. And he posted a
patch http://mail.nl.linux.org/linux-mm/2001-03/msg00224.html
His patch is in the Redhat 7.1 kernel 2.4.2-2 and not in 2.4.5.

But in any case I believe the patch will not work as expected.
It seems to me that he is calling the function check_orphaned_swap(page)
in the wrong place.  He is calling the function while scanning the
active_list in refill_inactive_scan().  The problem with that is if you
wait
60 seconds or longer the orphaned swap pages will move from active
to inactive lists. Therefore the function will miss the orphans in inactive
lists.  Any comments?



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-08 Thread Bulent Abali


>> I looked at try_to_unuse in swapfile.c.  I believe that the algorithm is
>> broken.
>> For each and every swap entry it is walking the entire process list
>> (for_each_task(p)).  It is also grabbing a whole bunch of locks
>> for each swap entry.  It might be worthwhile processing swap entries in
>> batches instead of one entry at a time.
>>
>> In any case, I think having this patch is worthwhile as a quick and
dirty
>> remedy.
>
>Bulent,
>
>Could you please check if 2.4.6-pre2+the schedule patch has better
>swapoff behaviour for you?

No problem.  I will check it tomorrow. I don't think it can be any worse
than it is now.  The patch looks correct in principle.
I believe it should go in to 2.4.6.  But I will test it.

On small machines people don't notice it, but otherwise if you have few
GB of memory it really hurts.  Shutdowns take forever since swapoff takes
forever.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-08 Thread Bulent Abali


 I looked at try_to_unuse in swapfile.c.  I believe that the algorithm is
 broken.
 For each and every swap entry it is walking the entire process list
 (for_each_task(p)).  It is also grabbing a whole bunch of locks
 for each swap entry.  It might be worthwhile processing swap entries in
 batches instead of one entry at a time.

 In any case, I think having this patch is worthwhile as a quick and
dirty
 remedy.

Bulent,

Could you please check if 2.4.6-pre2+the schedule patch has better
swapoff behaviour for you?

No problem.  I will check it tomorrow. I don't think it can be any worse
than it is now.  The patch looks correct in principle.
I believe it should go in to 2.4.6.  But I will test it.

On small machines people don't notice it, but otherwise if you have few
GB of memory it really hurts.  Shutdowns take forever since swapoff takes
forever.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-07 Thread Bulent Abali





>This is for the people who has been experiencing the lockups while running
>swapoff.
>
>Please test. (against 2.4.6-pre1)
>
>
>--- linux.orig/mm/swapfile.c Wed Jun  6 18:16:45 2001
>+++ linux/mm/swapfile.c Thu Jun  7 16:06:11 2001
>@@ -345,6 +345,8 @@
> /*
>  * Find a swap page in use and read it in.
>  */
>+if (current->need_resched)
>+ schedule();
> swap_device_lock(si);
> for (i = 1; i < si->max ; i++) {
>  if (si->swap_map[i] > 0 && si->swap_map[i] != SWAP_MAP_BAD)
{


I tested your patch against 2.4.5.  It works.  No more lockups.  Without
the
patch it took 14 minutes 51 seconds to complete swapoff (this is to recover
1.5GB of
swap space).  During this time the system was frozen.  No keyboard, no
screen, etc. Practically locked-up.

With the patch there are no more lockups. Swapoff kept running in the
background.
This is a winner.

But here is the caveat: swapoff keeps burning 100% of the cycles until it
completes.
This is not going to be a big deal during shutdowns.  Only when you enter
swapoff from
the command line it is going to be a problem.

I looked at try_to_unuse in swapfile.c.  I believe that the algorithm is
broken.
For each and every swap entry it is walking the entire process list
(for_each_task(p)).  It is also grabbing a whole bunch of locks
for each swap entry.  It might be worthwhile processing swap entries in
batches instead of one entry at a time.

In any case, I think having this patch is worthwhile as a quick and dirty
remedy.

Bulent Abali



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-07 Thread Bulent Abali



>> O.k.  I think I'm ready to nominate the dead swap pages for the big
>> 2.4.x VM bug award.  So we are burning cpu cycles in sys_swapoff
>> instead of being IO bound?  Just wanting to understand this the cheap
way :)
>
>There's no IO being done whatsoever (that I can see with only a blinky).
>I can fire up ktrace and find out exactly what's going on if that would
>be helpful.  Eating the dead swap pages from the active page list prior
>to swapoff cures all but a short freeze.  Eating the rest (few of those)
>might cure the rest, but I doubt it.
>
>-Mike

1)  I second Mike's observation.  swapoff either from command line or
during
shutdown, just hangs there.  No disk I/O is being done as I could see
from the blinkers.  This is not a I/O boundness issue.  It is more like
a deadlock.

I happened to saw this one with debugger attached serial port.
The system was alive.  I think I was watching the free page count and
it was decreasing very slowly may be couple pages per second.  Bigger
the swap usage longer it takes to do swapoff.  For example, if I had
1GB in the swap space then it would take may be an half hour to shutdown...


2)  Now why I would have 1 GB in the swap space, that is another problem.
Here is what I observe and it doesn't make much sense to me.
Let's say I have 1GB of memory and plenty of swap.  And let's
say there is process with little less than 1GB size.  Suppose the system
starts swapping because it is short few megabytes of memory.
Within *seconds* of swapping, I see that the swap disk usage balloons to
nearly 1GB. Nearly entire memory moves in to the page cache.  If you
run xosview you will know what I mean.  Memory usage suddenly turns from
green to red :-).   And I know for a fact that my disk cannot do 1GB per
second :-). The SHARE column of the big process in "top" goes up by
hundreds
of megabytes.
So it appears to me that MM is marking the whole process memory to be
swapped out and probably reserving nearly 1 GB in the swap space and
furthermore moves entire process pages to apparently to the page cache.
You would think that if you are short by few MB of memory MM would put
few MB worth of pages in the swap. But it wants to move entire processes
in to swap.

When the 1GB process exits, the swap usage doesn't change (dead swap
pages?).
And shutdown or swapoff will take forever due to #1 above.

Bulent




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Break 2.4 VM in five easy steps

2001-06-07 Thread Bulent Abali



 O.k.  I think I'm ready to nominate the dead swap pages for the big
 2.4.x VM bug award.  So we are burning cpu cycles in sys_swapoff
 instead of being IO bound?  Just wanting to understand this the cheap
way :)

There's no IO being done whatsoever (that I can see with only a blinky).
I can fire up ktrace and find out exactly what's going on if that would
be helpful.  Eating the dead swap pages from the active page list prior
to swapoff cures all but a short freeze.  Eating the rest (few of those)
might cure the rest, but I doubt it.

-Mike

1)  I second Mike's observation.  swapoff either from command line or
during
shutdown, just hangs there.  No disk I/O is being done as I could see
from the blinkers.  This is not a I/O boundness issue.  It is more like
a deadlock.

I happened to saw this one with debugger attached serial port.
The system was alive.  I think I was watching the free page count and
it was decreasing very slowly may be couple pages per second.  Bigger
the swap usage longer it takes to do swapoff.  For example, if I had
1GB in the swap space then it would take may be an half hour to shutdown...


2)  Now why I would have 1 GB in the swap space, that is another problem.
Here is what I observe and it doesn't make much sense to me.
Let's say I have 1GB of memory and plenty of swap.  And let's
say there is process with little less than 1GB size.  Suppose the system
starts swapping because it is short few megabytes of memory.
Within *seconds* of swapping, I see that the swap disk usage balloons to
nearly 1GB. Nearly entire memory moves in to the page cache.  If you
run xosview you will know what I mean.  Memory usage suddenly turns from
green to red :-).   And I know for a fact that my disk cannot do 1GB per
second :-). The SHARE column of the big process in top goes up by
hundreds
of megabytes.
So it appears to me that MM is marking the whole process memory to be
swapped out and probably reserving nearly 1 GB in the swap space and
furthermore moves entire process pages to apparently to the page cache.
You would think that if you are short by few MB of memory MM would put
few MB worth of pages in the swap. But it wants to move entire processes
in to swap.

When the 1GB process exits, the swap usage doesn't change (dead swap
pages?).
And shutdown or swapoff will take forever due to #1 above.

Bulent




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Please test: workaround to help swapoff behaviour

2001-06-07 Thread Bulent Abali





This is for the people who has been experiencing the lockups while running
swapoff.

Please test. (against 2.4.6-pre1)


--- linux.orig/mm/swapfile.c Wed Jun  6 18:16:45 2001
+++ linux/mm/swapfile.c Thu Jun  7 16:06:11 2001
@@ -345,6 +345,8 @@
 /*
  * Find a swap page in use and read it in.
  */
+if (current-need_resched)
+ schedule();
 swap_device_lock(si);
 for (i = 1; i  si-max ; i++) {
  if (si-swap_map[i]  0  si-swap_map[i] != SWAP_MAP_BAD)
{


I tested your patch against 2.4.5.  It works.  No more lockups.  Without
the
patch it took 14 minutes 51 seconds to complete swapoff (this is to recover
1.5GB of
swap space).  During this time the system was frozen.  No keyboard, no
screen, etc. Practically locked-up.

With the patch there are no more lockups. Swapoff kept running in the
background.
This is a winner.

But here is the caveat: swapoff keeps burning 100% of the cycles until it
completes.
This is not going to be a big deal during shutdowns.  Only when you enter
swapoff from
the command line it is going to be a problem.

I looked at try_to_unuse in swapfile.c.  I believe that the algorithm is
broken.
For each and every swap entry it is walking the entire process list
(for_each_task(p)).  It is also grabbing a whole bunch of locks
for each swap entry.  It might be worthwhile processing swap entries in
batches instead of one entry at a time.

In any case, I think having this patch is worthwhile as a quick and dirty
remedy.

Bulent Abali



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



can I call wake_up_interruptible_all from an interrupt service routine?

2001-06-05 Thread Bulent Abali

Interrupt service routine of a driver makes a wake_up_interruptible_all()
call to wake up a kernel thread.   Is that legitimate?   Thanks for any
  advice
you might have. please cc: your response to me if you decide to post to
the mailing list.
Bulent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



can I call wake_up_interruptible_all from an interrupt service routine?

2001-06-05 Thread Bulent Abali

Interrupt service routine of a driver makes a wake_up_interruptible_all()
call to wake up a kernel thread.   Is that legitimate?   Thanks for any
  advice
you might have. please cc: your response to me if you decide to post to
the mailing list.
Bulent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/