date:20070917

Hi,
  Sorry for my error.
  The problem is the current icmp_reply and ip_send_reply will
send out packets with wrong destination address. Not wrong source
address.
  My point is that we should always use the source address of packets we
received as the destination address of our reply packets.

On Mon, Sep 17, 2007 at 08:14:56PM -0700, [EMAIL PROTECTED] wrote:
> On Tue, 18 Sep 2007, YOSHIFUJI Hideaki / [EMAIL PROTECTED](B wrote:
> 
> >In article <[EMAIL PROTECTED]> (at Mon, 17 Sep 
> >2007 19:20:44 -0700 (PDT)), David Miller <[EMAIL PROTECTED]> says:
> >
> >>From: lepton <[EMAIL PROTECTED]>
> >>Date: Tue, 18 Sep 2007 10:16:17 +0800
> >>
> >>>Hi,
> >>>  In some situation, icmp_reply and ip_send_reply will send
> >>>  out packet with the wrong source addr, the following patch
> >>>  will fix this.
> >>>
> >>>  I don't understand why we must use rt->rt_src in the current
> >>>  code, if this is a wrong fix, please correct me.
> >>>
> >>>Signed-off-by: Lepton Wu <[EMAIL PROTECTED]>
> >>
> >>That the address is wrong is your opinion only :-)
> >>
> >>Source address selection is a rather complex topic, and
> >>here we are definitely purposefully using the source
> >>address selected by the routing lookup for the reply.
> >
> >And, if you do think something is "wrong", you need to describe it
> >in detail, at least.
> 
> I missed the beginning of the discussion, so apologies if I'm way off 
> base.
> 
> it sounds like the question is, when a packet hits the box that causes a 
> icmp_reply (or other packet) to be generated, which IP address should be 
> used as the source
> 
> 1. the destination address of the packet that generated the message
> 
> or.
> 
> 2. the IP address that the machine would use by default if the machine 
> were to generate a new connection to the destination.
> 
> I understand that in many cases the historical approach has been #2, but 
> as more machines get multiple IP addresses on each interface, I believe 
> that it's less of a surprise to other systems if the default is #1. most 
> of the time the other systems don't care (and useusally don't want to 
> know) if the service they are contacting is on a dedicated machine or is 
> just one IP among many sharing a box.
> 
> it gets especially bad when you have load balancing going on and the 
> results could come from multiple boxes.
> 
> yes, sysadmins deal with this today, but it's a pain to do so and is a 
> continuing dribble of suprises when things don't quite work the way you 
> expect them to as you consoldate things onto more powerful systems (or 
> distribute them among multiple systems).
> 
> if the packet got to the machine and the machine is accepting it, replying 
> back from the destination IP of that packet should be legitimate (it's 
> what you would do if there was a full connection after all) and greatly 
> reduces the cases where things change.
> 
> David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to reply packet

2007-09-17 Thread david

On Tue, 18 Sep 2007, YOSHIFUJI Hideaki / [EMAIL PROTECTED](B wrote:

In article <[EMAIL PROTECTED]> (at Mon, 17 Sep 2007 19:20:44 -0700 (PDT)), David 
Miller <[EMAIL PROTECTED]> says:

From: lepton <[EMAIL PROTECTED]>
Date: Tue, 18 Sep 2007 10:16:17 +0800

Hi,
  In some situation, icmp_reply and ip_send_reply will send
  out packet with the wrong source addr, the following patch
  will fix this.

  I don't understand why we must use rt->rt_src in the current
  code, if this is a wrong fix, please correct me.

Signed-off-by: Lepton Wu <[EMAIL PROTECTED]>

That the address is wrong is your opinion only :-)

Source address selection is a rather complex topic, and
here we are definitely purposefully using the source
address selected by the routing lookup for the reply.

And, if you do think something is "wrong", you need to describe it
in detail, at least.

I missed the beginning of the discussion, so apologies if I'm way off 
base.

it sounds like the question is, when a packet hits the box that causes a 
icmp_reply (or other packet) to be generated, which IP address should be 
used as the source

1. the destination address of the packet that generated the message

or.

2. the IP address that the machine would use by default if the machine 
were to generate a new connection to the destination.

I understand that in many cases the historical approach has been #2, but 
as more machines get multiple IP addresses on each interface, I believe 
that it's less of a surprise to other systems if the default is #1. most 
of the time the other systems don't care (and useusally don't want to 
know) if the service they are contacting is on a dedicated machine or is 
just one IP among many sharing a box.

it gets especially bad when you have load balancing going on and the 
results could come from multiple boxes.

yes, sysadmins deal with this today, but it's a pain to do so and is a 
continuing dribble of suprises when things don't quite work the way you 
expect them to as you consoldate things onto more powerful systems (or 
distribute them among multiple systems).

if the packet got to the machine and the machine is accepting it, replying 
back from the destination IP of that packet should be legitimate (it's 
what you would do if there was a full connection after all) and greatly 
reduces the cases where things change.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] CONFIG_ZONE_MOVABLE [2/2] config zone movable

2007-09-17 Thread KAMEZAWA Hiroyuki

On Mon, 17 Sep 2007 19:47:48 -0700
Andrew Morton <[EMAIL PROTECTED]> wrote:

> On Fri, 31 Aug 2007 19:14:15 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> 
> wrote:
> 
> > Makes ZONE_MOVABLE as configurable
> > 
> > Based on "zone_ifdef_cleanup_by_renumbering.patch"
> > 
> 
> This patch causes my old dual-pIII machine to instantly reboot: 0.01 
> seconds
> uptime.
> 
> http://userweb.kernel.org/~akpm/config-vmm.txt

Ok, will find problem.

Thanks,
-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to reply packet

Hi,
  sorry for my previous email.
  What I mean is icmp_reply and ip_send_reply
in some situation will send out packets with wrong 
DESTINATION address.  the source address is always
correct.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC -mm 2/2] i386/x86_64 boot: document for 32 bit boot protocol

2007-09-17 Thread Huang, Ying

On Mon, 2007-09-17 at 18:48 -0700, H. Peter Anvin wrote:
> Huang, Ying wrote:
> > 
> > OK, I will check the actual structure, and change the document
> > accordingly.
> > 
> 
> The best would probably be to fix zero-page.txt (and probably rename it
> something saner.)

Does the patch appended with the mail seems better?

If it is desired, I can move the zero page description into
zero-page.txt, and refer to it in 32-bit boot protocol description.

I delete the hd0_info and hd1_info from the zero page. If it is
undesired, I will move them back.

The field in zero page is fairly complex (such as struct edd_info). Do
you think it is necessary to document every field inside the first level
field, until the primary data type? Or we just provide the C struct
name?

Best Regards,
Huang Ying

---

Index: linux-2.6.23-rc4/Documentation/i386/boot.txt
===
--- linux-2.6.23-rc4.orig/Documentation/i386/boot.txt   2007-09-18 
10:40:34.0 +0800
+++ linux-2.6.23-rc4/Documentation/i386/boot.txt2007-09-18 
10:46:13.0 +0800
@@ -2,7 +2,7 @@
 
 
H. Peter Anvin <[EMAIL PROTECTED]>
-   Last update 2007-05-23
+   Last update 2007-09-14
 
 On the i386 platform, the Linux kernel uses a rather complicated boot
 convention.  This has evolved partially due to historical aspects, as
@@ -42,6 +42,9 @@
 Protocol 2.06: (Kernel 2.6.22) Added a field that contains the size of
the boot command line
 
+Protocol 2.07: (kernel 2.6.23) Added a field of 64-bit physical
+   pointer to single linked list of struct setup_data.
+   Added 32-bit boot protocol.
 
  MEMORY LAYOUT
 
@@ -168,6 +171,9 @@
 0234/1 2.05+   relocatable_kernel Whether kernel is relocatable or not
 0235/3 N/A pad2Unused
 0238/4 2.06+   cmdline_sizeMaximum size of the kernel command line
+023c/4 N/A pad3Unused
+0240/8 2.07+   setup_data  64-bit physical pointer to linked list
+   of struct setup_data
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
 real value is 4.
@@ -480,6 +486,36 @@
   cmdline_size characters. With protocol version 2.05 and earlier, the
   maximum size was 255.
 
+Field name:setup_data
+Type:  write (obligatory)
+Offset/size:   0x240/8
+Protocol:  2.07+
+
+  The 64-bit physical pointer to NULL terminated single linked list of
+  struct setup_data. This is used to define a more extensible boot
+  parameters passing mechanism. The definition of struct setup_data is
+  as follow:
+
+  struct setup_data {
+ u64 next;
+ u32 type;
+ u32 len;
+ u8  data[0];
+  } __attribute__((packed));
+
+  Where, the next is a 64-bit physical pointer to the next node of
+  linked list, the next field of the last node is 0; the type is used
+  to identify the contents of data; the len is the length of data
+  field; the data holds the real payload.
+
+  With this field, to add a new boot parameter written by bootloader,
+  it is not needed to add a new field to real mode header, just add a
+  new setup_data type is sufficient. But to add a new boot parameter
+  read by bootloader, it is still needed to add a new field.
+
+  TODO: Where is the safe place to place the linked list of struct
+   setup_data?
+
 
  THE KERNEL COMMAND LINE
 
@@ -753,3 +789,57 @@
After completing your hook, you should jump to the address
that was in this field before your boot loader overwrote it
(relocated, if appropriate.)
+
+
+ SETUP DATA TYPES
+
+
+ 32-bit BOOT PROTOCOL
+
+For machine with some new BIOS other than legacy BIOS, such as EFI,
+LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
+based on legacy BIOS can not be used, so a 32-bit boot protocol need
+to be defined.
+
+In 32-bit boot protocol, the first step in loading a Linux kernel
+should still be to load the real-mode code and then examine the kernel
+header at offset 0x01f1. But, it is not necessary to load all
+real-mode code, just first 4K bytes traditionally known as "zero page"
+is needed.
+
+In addition to read/modify/write kernel header of the zero page as
+that of 16-bit boot protocol, the boot loader should fill the
+following additional fields of the zero page too.
+
+Offset Proto   NameMeaning
+/Size
+
+000/0402.07+   screen_info Text mode or frame buffer information
+   (struct screen_info)
+040/0142.07+   apm_bios_info   APM BIOS information (struct 
apm_bios_info)
+060/0102.07+   ist_infoIntel SpeedStep (IST) BIOS support 
information
+   (struct ist_info)
+0A0/0102.07+   sys_desc_table  System description table (struct 
sys_desc_table)
+140/0802.07+   edid_info

Re: [PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to reply packet

Hi,
  sorry for lack of details.
  let's think about ip_send_reply. it is only called
by tcp_v4_send_ack and tcp_v4_reset. I don't know why
we need a source address diffrent from ip_hdr(skb)->s_addr
  icmp_reply is only called by icmp_echo and icmp_timestamp.
Is there a situation to need we use a source address diffrent
from ip_hdr(skb)->s_addr?

  My situaiton is:
  I DNAT some tcp packet to my box. some times the box will
reply reset or ack packet with tcp_v4_send_ack and tcp_v4_reset, 
when this happens, it will use the rt->s_addr instead of
ip_hdr(skb)->s_addr, then the packet will send out without change
the source addr. Becaus netfilter don't know these packets belongs
to the DNATed connection.

  Another people's situaiton is (quoted from email to me):

 While conducting a research about networking, I discovered
 improper handling of ICMP echo reply messages in Linux 2.4.26.  I
 looked into the code and noticed that the icmp_reply function sets the
 destination address in the reply packet to rt->rt_src.  This produces
 strange results in some cases as can be easily shown with hping and
 tcpdump.  Here is an example (NOTE: eth0 address is set to
 10.10.10.1/24):

  # tcpdump -n -i any icmp &

  [1] 16842
  tcpdump: WARNING: Promiscuous mode not supported on the "any" device
  tcpdump: verbose output suppressed, use -v or -vv for full protocol
  decode
  listening on any, link-type LINUX_SLL (Linux cooked), capture size 96
  bytes

  # hping2 --icmp --spoof 10.10.10.3 10.10.10.1

  HPING 10.10.10.1 (eth0 10.10.10.1): icmp mode set, 28 headers + 0
  data bytes
  02:16:53.206016 IP 10.10.10.3 > 10.10.10.1: icmp 8: echo request seq
  0
  02:16:53.206082 IP 10.10.10.1 > 10.10.10.1: icmp 8: echo reply seq 0
  02:16:54.202123 IP 10.10.10.3 > 10.10.10.1: icmp 8: echo request seq

  If ICMP echo requests with a spoofed source address are sent to the
  address of our eth0 interface (which of course happens through the
  loopback interface), the code of icmp_reply sets the destination
  address in the reply to 10.10.10.1 instead of simply reversing the
  source and destination addresses as required by the RFC.

On Tue, Sep 18, 2007 at 11:26:44AM +0900, YOSHIFUJI Hideaki / [EMAIL 
PROTECTED](B wrote:
> In article <[EMAIL PROTECTED]> (at Mon, 17 Sep 2007 19:20:44 -0700 (PDT)), 
> David Miller <[EMAIL PROTECTED]> says:
> 
> > From: lepton <[EMAIL PROTECTED]>
> > Date: Tue, 18 Sep 2007 10:16:17 +0800
> > 
> > > Hi,
> > >   In some situation, icmp_reply and ip_send_reply will send
> > >   out packet with the wrong source addr, the following patch
> > >   will fix this.
> > > 
> > >   I don't understand why we must use rt->rt_src in the current
> > >   code, if this is a wrong fix, please correct me.
> > > 
> > > Signed-off-by: Lepton Wu <[EMAIL PROTECTED]>
> > 
> > That the address is wrong is your opinion only :-)
> > 
> > Source address selection is a rather complex topic, and
> > here we are definitely purposefully using the source
> > address selected by the routing lookup for the reply.
> 
> And, if you do think something is "wrong", you need to describe it
> in detail, at least.
> 
> --yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] CONFIG_ZONE_MOVABLE [2/2] config zone movable

2007-09-17 Thread Andrew Morton

On Fri, 31 Aug 2007 19:14:15 +0900 KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

> Makes ZONE_MOVABLE as configurable
> 
> Based on "zone_ifdef_cleanup_by_renumbering.patch"
> 

This patch causes my old dual-pIII machine to instantly reboot: 0.01 seconds
uptime.

http://userweb.kernel.org/~akpm/config-vmm.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: Avoid pointless WARN_ON(irqs_disabled()) from panic codepath

2007-09-17 Thread Josh Boyer

On Mon, 17 Sep 2007 18:37:49 -0700
Randy Dunlap <[EMAIL PROTECTED]> wrote:

> On Tue, 18 Sep 2007 05:13:40 +0530 (IST) Satyam Sharma wrote:
> 
> > Untested (not even compile-tested) patch.
> > Could someone point me to ppc32/64 cross-compilers for i386?
> 
> OSDL had some, but those are gone now.
> I downloaded all of them and still use them, although it would
> be good to have some more recent versions of them.
> 
> I put the power* compiler tarballs here:
> 
> http://userweb.kernel.org/~rdunlap/cross-compilers/

Crosstool is widely used.  It'll build several combinations of
gcc/binutils/glibc for you.  

http://www.kegel.com/crosstool/

There's also the ELDK from Denx:  

http://www.denx.de/en/view/Software/WebHome#Embedded_Linux_Development_Kit

josh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to reply packet

2007-09-17 Thread YOSHIFUJI Hideaki / 吉藤英明

In article <[EMAIL PROTECTED]> (at Mon, 17 Sep 2007 19:20:44 -0700 (PDT)), 
David Miller <[EMAIL PROTECTED]> says:

> From: lepton <[EMAIL PROTECTED]>
> Date: Tue, 18 Sep 2007 10:16:17 +0800
> 
> > Hi,
> >   In some situation, icmp_reply and ip_send_reply will send
> >   out packet with the wrong source addr, the following patch
> >   will fix this.
> > 
> >   I don't understand why we must use rt->rt_src in the current
> >   code, if this is a wrong fix, please correct me.
> > 
> > Signed-off-by: Lepton Wu <[EMAIL PROTECTED]>
> 
> That the address is wrong is your opinion only :-)
> 
> Source address selection is a rather complex topic, and
> here we are definitely purposefully using the source
> address selected by the routing lookup for the reply.

And, if you do think something is "wrong", you need to describe it
in detail, at least.

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to reply packet

2007-09-17 Thread David Miller

From: lepton <[EMAIL PROTECTED]>
Date: Tue, 18 Sep 2007 10:16:17 +0800

> Hi,
>   In some situation, icmp_reply and ip_send_reply will send
>   out packet with the wrong source addr, the following patch
>   will fix this.
> 
>   I don't understand why we must use rt->rt_src in the current
>   code, if this is a wrong fix, please correct me.
> 
> Signed-off-by: Lepton Wu <[EMAIL PROTECTED]>

That the address is wrong is your opinion only :-)

Source address selection is a rather complex topic, and
here we are definitely purposefully using the source
address selected by the routing lookup for the reply.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] 2.6.22.6 NETWORKING [IPV4]: Always use source addr in skb to reply packet

Hi,
  In some situation, icmp_reply and ip_send_reply will send
  out packet with the wrong source addr, the following patch
  will fix this.

  I don't understand why we must use rt->rt_src in the current
  code, if this is a wrong fix, please correct me.

Signed-off-by: Lepton Wu <[EMAIL PROTECTED]>

diff -X linux-2.6.22.6/Documentation/dontdiff -pru 
linux-2.6.22.6/net/ipv4/icmp.c linux-2.6.22.6-lepton/net/ipv4/icmp.c
--- linux-2.6.22.6/net/ipv4/icmp.c  2007-09-14 17:41:18.0 +0800
+++ linux-2.6.22.6-lepton/net/ipv4/icmp.c   2007-09-18 09:57:30.0 
+0800
@@ -382,6 +382,7 @@ static void icmp_reply(struct icmp_bxm *
struct ipcm_cookie ipc;
struct rtable *rt = (struct rtable *)skb->dst;
__be32 daddr;
+   struct iphdr *ip = ip_hdr(skb);
 
if (ip_options_echo(_param->replyopts, skb))
return;
@@ -393,7 +394,7 @@ static void icmp_reply(struct icmp_bxm *
icmp_out_count(icmp_param->data.icmph.type);
 
inet->tos = ip_hdr(skb)->tos;
-   daddr = ipc.addr = rt->rt_src;
+   daddr = ipc.addr = ip->saddr;
ipc.opt = NULL;
if (icmp_param->replyopts.optlen) {
ipc.opt = _param->replyopts;
diff -X linux-2.6.22.6/Documentation/dontdiff -pru 
linux-2.6.22.6/net/ipv4/ip_output.c linux-2.6.22.6-lepton/net/ipv4/ip_output.c
--- linux-2.6.22.6/net/ipv4/ip_output.c 2007-09-14 17:41:18.0 +0800
+++ linux-2.6.22.6-lepton/net/ipv4/ip_output.c  2007-09-18 09:57:13.0 
+0800
@@ -1337,11 +1337,12 @@ void ip_send_reply(struct sock *sk, stru
struct ipcm_cookie ipc;
__be32 daddr;
struct rtable *rt = (struct rtable*)skb->dst;
+   struct iphdr *ip = ip_hdr(skb);
 
if (ip_options_echo(, skb))
return;
 
-   daddr = ipc.addr = rt->rt_src;
+   daddr = ipc.addr = ip->saddr;
ipc.opt = NULL;
 
if (replyopts.opt.optlen) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG:] forcedeth: MCP55 not allowing DHCP

2007-09-17 Thread Casey Dahlin

Casey Dahlin wrote:
I have an Asus Striker Extreme motherboard with two built in MCP55
GigE interfaces. When I build with the original Fedora 7 release
kernel (
ftp://ftp.belnet.be/linux/fedora/linux/releases/7/Fedora/i386/os/Fedora/kernel-2.6.21-1.3194.fc7.i686.rpm
) everything works fine. However, when I boot with any updated kernels
or any other kernel (have tried building from several points in the
linus git tree between 2.6.20 and .23-rc3, and 2.6.21.2 in -stable) I
cannot get an IP address via dhcp. There is no error in dmesg. The
card shows a link and otherwise appears to be working, but it is as if
the dhcp server has been removed from the network.

On a running system there is no indication that this is a kernel bug
at all, however by varying only the kernel the bug appears and
disappears. I've run all these tests repeatedly with no intervening
updates of any other packages.

As I said I attempted to build 2.6.21.2 ( the point of divergence
between the Fedora kernel in question and -stable ) and still the card
did not work. I will next attempt to manually build the rpm for the
release kernel. If this works I will try experimenting with the
included patches to narrow it down, but at this point I'm at a
complete loss.

-Casey Dahlin

Is there any feedback to be had on this? I've gotten no reply whatsoever
from several sources now.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC -mm 2/2] i386/x86_64 boot: document for 32 bit boot protocol

2007-09-17 Thread H. Peter Anvin

Huang, Ying wrote:
> 
> OK, I will check the actual structure, and change the document
> accordingly.
> 

The best would probably be to fix zero-page.txt (and probably rename it
something saner.)

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20 (XFS? related) crash after uptime of > 180 days during apt-get dist-upgrade on Debian Testing

2007-09-17 Thread David Chinner

On Mon, Sep 17, 2007 at 01:20:17PM -0400, Justin Piszcz wrote:
> Including the XFS mailing list in here too because it may be an XFS bug 
> looking at the call trace.
> 
> System: Debian Testing
> Kernel: 2.6.20
> Config: Attached
> 
> I was running apt-get dist-upgrade as I always do to get the latest 
> packages upgraded and the kernel OOPS'd when it was upgrading 'tzdata' and 
> the process went into D-state and I had to reboot.
> 
> The config file is from 2.6.20 but it had been moved to a 2.6.22 directory 
> for an upgrade, but all of the options have been left unchanged.
> 
> Here is the *OOPS I captured via dmesg before I rebooted:
> 
> [16201055.214559] nfsd: last server has exited
> [16201055.214566] nfsd: unexporting all filesystems
> [17341583.697472] BUG: unable to handle kernel paging request at virtual 
> address 99e00750
> [17341583.697480]  printing eip:
> [17341583.697482] c01531b0
> [17341583.697484] *pde = 
> [17341583.697488] Oops:  [#1]
> [17341583.697491] CPU:0
> [17341583.697493] EIP:0060:[]Not tainted VLI
> [17341583.697494] EFLAGS: 00210286   (2.6.20 #3)
> [17341583.697502] EIP is at __d_lookup+0x5d/0xd6
> [17341583.697505] eax: c8d7c17e   ebx: 99e00750   ecx: 0011   edx: 
> c17f9200
> [17341583.697508] esi: 99e00750   edi: d2a10016   ebp: c7fe2304   esp: 
> dba35d98
> [17341583.697511] ds: 007b   es: 007b   ss: 0068
> [17341583.697514] Process kdm_greet (pid: 22119, ti=dba34000 task=f52d4a70 
> task.ti=dba34000)
> [17341583.697516] Stack: c8d7c17e  dba35e10 f705d478 dba35db8 
> 002c d2a10016 d2a10042 [17341583.697522]dba35e10 dba35f30 
> dba35e10 c014ab6d dba35e1c c18c5240 dba35f04 c021877e [17341583.697528] 
> d2a10042 dba35e10 c8d7c17e dba35f30 c014c38f d2a10016 0101 dba35e48 
> [17341583.697534] Call Trace:
> [17341583.697537]  [] do_lookup+0x1c/0x168
> [17341583.697540]  [] xfs_vn_lookup+0x53/0x77
> [17341583.697547]  [] __link_path_walk+0x6e8/0xb1b
> [17341583.697551]  [] dput+0x18/0x121
> [17341583.697554]  [] link_path_walk+0x43/0xb8
> [17341583.697558]  [] do_path_lookup+0x75/0x181
> [17341583.697561]  [] get_empty_filp+0x2f/0xe5
> [17341583.697566]  [] __path_lookup_intent_open+0x45/0x80
> [17341583.697570]  [] path_lookup_open+0x20/0x25
> [17341583.697573]  [] open_namei+0x66/0x58a
> [17341583.697576]  [] do_filp_open+0x25/0x40
> [17341583.697580]  [] do_sys_open+0x3e/0xc7
> [17341583.697584]  [] sys_open+0x1c/0x20
> [17341583.697587]  [] syscall_call+0x7/0xb
> [17341583.697591]  ===
> [17341583.697593] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 
> 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 
> 47 <8b> 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
> [17341583.697621] EIP: [] __d_lookup+0x5d/0xd6 SS:ESP 
> 0068:dba35d98
> [17341583.697626]  <1>BUG: unable to handle kernel paging request at 
> virtual address 99e00750
> [17341648.066740]  printing eip:
> [17341648.066786] c01531b0
> [17341648.066868] *pde = 
> [17341648.066916] Oops:  [#2]
> [17341648.066965] CPU:0
> [17341648.066966] EIP:0060:[]Not tainted VLI
> [17341648.066967] EFLAGS: 00010286   (2.6.20 #3)
> [17341648.067115] EIP is at __d_lookup+0x5d/0xd6
> [17341648.067165] eax: 1efcce0e   ebx: 99e00750   ecx: 0011   edx: 
> c17f9200
> [17341648.067219] esi: 99e00750   edi: cc87901a   ebp: c7fe2304   esp: 
> f7755f04
> [17341648.067271] ds: 007b   es: 007b   ss: 0068
> [17341648.067320] Process dpkg (pid: 24684, ti=f7754000 task=d9846a70 
> task.ti=f7754000)
> [17341648.067371] Stack: 1efcce0e 46dd3a20 f7755f5c e489fe28  
> 0010 cc87901a  [17341648.067715]e489fe28 0001 
> f7755f54 c014b7cb f7755f5c ef0d4098 ffd9 cc879000 [17341648.068056] 
> 0001 f7755f54 c014cf84 f7755f54 e489fe28 c18c5240 1efcce0e 0010 
> [17341648.068397] Call Trace:
> [17341648.068482]  [] __lookup_hash+0x4a/0xef
> [17341648.068563]  [] do_rmdir+0x69/0xbb
> [17341648.068642]  [] syscall_call+0x7/0xb
> [17341648.068724]  ===
> [17341648.068770] Code: 81 f2 01 00 37 9e 8b 0d 18 3f 44 c0 d3 ea 31 d0 23 
> 05 14 3f 44 c0 8b 15 1c 3f 44 c0 8b 34 82 85 f6 75 08 eb 4d 89 de 85 db 74 
> 47 <8b> 1e 0f 18 03 90 8d 6e f4 8b 04 24 3b 45 18 75 e9 8b 44 24 0c 
> [17341648.070874] EIP: [] __d_lookup+0x5d/0xd6 SS:ESP 
> 0068:f7755f04
> [17341648.070988]
> 
> I doubt I can reproduce it as it has happened after 180 days or so, and I 
> am upgrading to 2.6.22.6 but I was wondering what exactly happened here?

No idea - it looks like dkpg was trying to remove a directory on the
same path the lookup was and both have gone splat in __d_lookup on
the same dentry. Something happened in  those 180 days that left a
landmine that was tripped over here, I think. I can't see any way of
tracking it down from this, but thanks for reporting it anyway,
Justin.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To

Re: Scheduler benchmarks - a follow-up

2007-09-17 Thread Rob Hussey

On 9/17/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
> * Rob Hussey <[EMAIL PROTECTED]> wrote:
>
> > http://www.healthcarelinen.com/misc/benchmarks/BOUND_hackbench_benchmark2.png
>
> heh - am i the only one impressed by the consistency of the blue line in
> this graph? :-) [ and the green line looks a bit like a .. staircase? ]
>
> i've meanwhile tested hackbench 90 and the performance difference
> between -ck and -cfs-devel seems to be mostly down to the more precise
> (but slower) sched_clock() introduced in v2.6.23 and to the startup
> penalty of freshly created tasks.
>
> Putting back the 2.6.22 version and tweaking the startup penalty gives
> this:
>
>  [hackbench 90, smaller is better]
>
> sched-devel.git  sched-devel.git+lowres-sched-clock+dsp
> ---  --
>   5.555  5.149
>   5.641  5.149
>   5.572  5.171
>   5.583  5.155
>   5.532  5.111
>   5.540  5.138
>   5.617  5.176
>   5.542  5.119
>   5.587  5.159
>   5.553  5.177
> --
>  avg: 5.572 avg: 5.150 (-8.1%)
>
> ('lowres-sched-clock' is the patch i sent in the previous mail. 'dsp' is
> a disable-startup-penalty patch that is in the latest sched-devel.git)
>
> i have used your .config to conduct this test.
>
> can you reproduce this with the (very-) latest sched-devel git tree:
>
>   git-pull 
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
>
> plus with the low-res-sched-clock patch (re-) attached below?
>
> Ingo
> ---
>  arch/i386/kernel/tsc.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux/arch/i386/kernel/tsc.c
> ===
> --- linux.orig/arch/i386/kernel/tsc.c
> +++ linux/arch/i386/kernel/tsc.c
> @@ -110,9 +110,9 @@ unsigned long long native_sched_clock(vo
>  *   very important for it to be as fast as the platform
>  *   can achive it. )
>  */
> -   if (unlikely(!tsc_enabled && !tsc_unstable))
> +   if (1 || unlikely(!tsc_enabled && !tsc_unstable))
> /* No locking but a rare wrong value is not a big deal: */
> -   return (jiffies_64 - INITIAL_JIFFIES) * (10 / HZ);
> +   return jiffies_64 * (10 / HZ);
>
> /* read the Time Stamp Counter: */
> rdtscll(this_offset);
> -

Sorry it took so long for me to get back.

Ok, to start the dmesg output for 2.6.22-ck1 is attached. The relevant
lines seem to be:
[   27.691348] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[   27.995427] Time: tsc clocksource has been installed.

I've updated to the latest sched-devel git, and applied the patch
above. I ran it through the same tests, but this time only while bound
to a single core. Some selected numbers:

lat_ctx -s 0 $i (the left most number is $i):

15  3.09
16  3.09
17  3.11
18  3.07
19  2.99
20  3.09
21  3.05
22  3.11
23  3.05
24  3.08
25  3.06

hackbench $i:

80 11.720
81 11.698
82 11.888
83 12.094
84 12.232
85 12.351
86 12.512
87 12.680
88 12.736
89 12.861
90 13.103

pipe-test (the left most number is the run #):

1  8.85
2  8.80
3  8.84
4  8.82
5  8.82
6  8.80
7  8.82
8  8.82
9  8.85
10 8.83

Once again, graphs:
http://www.healthcarelinen.com/misc/benchmarks/BOUND_PATCHED_lat_ctx_benchmark.png
http://www.healthcarelinen.com/misc/benchmarks/BOUND_PATCHED_hackbench_benchmark.png
http://www.healthcarelinen.com/misc/benchmarks/BOUND_PATCHED_pipe-test_benchmark.png

I saw in your other email that you'd like for me to try with
CONFIG_PREEMPT disabled. I should have a chance to try that very soon.

Regards,
Rob


dmesg-2.6.22-ck1.bz2
Description: BZip2 compressed data
<><><>

data_files2.tar.bz2
Description: BZip2 compressed data

Re: My position on general ``RAS'' tool support infrastructure

2007-09-17 Thread Randy Dunlap

On Thu, 13 Sep 2007 07:21:10 -0600 Eric W. Biederman wrote:

> Pete/Piet Delaney <[EMAIL PROTECTED]> writes:
> 
> > Jason, Eric:
> >
> > Did you read Keith Owens suggestion on RAS tools from:

Yes.  and I re-read it.

There are several things in Keith's email that make sense:

a.  all RAS tools should use a common interface
b.  it's not the kernel's job to decide which RAS tool runs first

Eric makes some good points too.  I'm mostly similar to Eric:
paranoid about trusting software/hardware after a panic (or oops).

So if someone wants to use multiple RAS tools on a panic event,
enabling an admin to set priorities is OK with me, but I'll only
trust the first one that is used, and even that one may have
problems.  IOW, I don't see a big need to support multiple RAS
tools at one time.  (speaking for myself)

> So if someone who is suggesting an implementation can absorb 
> and understand the requirements of the different groups and come
> up with solutions that meet the requirements of the different projects
> I think progress can be made.  That as far as I know takes talent.

Ack that.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: Avoid pointless WARN_ON(irqs_disabled()) from panic codepath

2007-09-17 Thread Randy Dunlap

On Tue, 18 Sep 2007 05:13:40 +0530 (IST) Satyam Sharma wrote:

> Untested (not even compile-tested) patch.
> Could someone point me to ppc32/64 cross-compilers for i386?

OSDL had some, but those are gone now.
I downloaded all of them and still use them, although it would
be good to have some more recent versions of them.

I put the power* compiler tarballs here:

http://userweb.kernel.org/~rdunlap/cross-compilers/

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Fwd: Intel DQ35JOE Mainboard 82566DM-2 Gigabit Network

2007-09-17 Thread John Duthie

I'm having a few Problems with a NEW PC

Spec is:
Intel DQ35JOE Mainboard
Intel Q6600 Quad core CPU
4GB ram
3 SATA HDDs
1 SATA DVD-RW

The Integrated NIC is not found by kernel 2.6.23-rc6 or  2.6.22.1
Am I missing an option in there ??

The Intel Drivers (e1000-7.6.5)  don't compile against 2.6.23-rc6 or 2.6.22.1
/usr/src/intel/e1000- 7.6.5/src/e1000_ethtool.c:2109: error:
'ethtool_op_get_perm_addr' undeclared here (not in a function)
( I know, wrong place to report this .. )

( also SATA dvd writer does not seem to write yet )

If anyone has Patches to try I'm Currently able and willing to test
them on this hardware config!

see attached stuff
mail me for more info if required !

TIA


dmesg.gz
Description: GNU Zip compressed data


lspciv.gz
Description: GNU Zip compressed data


dotconfig.gz
Description: GNU Zip compressed data

Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

2007-09-17 Thread J. Bruce Fields

On Tue, Sep 18, 2007 at 12:54:07AM +0100, Nix wrote:
> The code which calls new_do_write() looks like this:
> 
> ,[ libio/fileops.c:_IO_new_file_xsputn() ]
> |  if (do_write)
> |{
> |  count = new_do_write (f, s, do_write);
> |  to_do -= count;
> |  if (count < do_write)
> |return n - to_do;
> |}
> `
> 
> This code handles partial writes followed by errors by returning a
> suitable nonzero value, and immediate errors by returning -1.
> 
> In either case the buffer will have been filled as much as possible by
> that point, and will still be filled when (vf)printf() is next called.

OK, I'm a little lost at this point (what's n?  What's to_do?), but I'll
take your word for it.

I'd be kinda curious when exactly the behavior changed and why.

Also I suppose we should check which version of nfs-utils that fix is in
and make sure distributions are getting the fixed nfs-utils before they
get the new libc, or we're going to see this bug a lot

> This behaviour is, IIRC, mandated by the C Standard: I can find no
> reference in the Standard to streams being flushed on error, only
> on fclose(), fflush(), or program termination.

OK!

Let me know if the problem's fixed.

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC -mm 2/2] i386/x86_64 boot: document for 32 bit boot protocol

2007-09-17 Thread Huang, Ying

On Mon, 2007-09-17 at 08:29 -0700, H. Peter Anvin wrote:
> Huang, Ying wrote:
> > This patch defines a 32-bit boot protocol and adds corresponding
> > document.
> > +
> > +In addition to read/modify/write kernel header of the zero page as
> > +that of 16-bit boot protocol, the boot loader should fill the
> > +following additional fields of the zero page too.
> > +
> > +Offset TypeDescription
> > +--     ---
> > +0  32 bytesstruct screen_info, SCREEN_INFO
> > +   ATTENTION, overlaps the following !!!
> > +2  unsigned short  EXT_MEM_K, extended memory size in Kb (from int 
> > 0x15)
> > + 0x20  unsigned short  CL_MAGIC, commandline magic number (=0xA33F)
> > + 0x22  unsigned short  CL_OFFSET, commandline offset
> > +   Address of commandline is calculated:
> > + 0x9 + contents of CL_OFFSET
> > +   (only taken, when CL_MAGIC = 0xA33F)
> > + 0x40  20 bytesstruct apm_bios_info, APM_BIOS_INFO
> > + 0x60  16 bytesIntel SpeedStep (IST) BIOS support information
> > + 0x80  16 byteshd0-disk-parameter from intvector 0x41
> > + 0x90  16 byteshd1-disk-parameter from intvector 0x46
> > +
> > + 0xa0  16 bytesSystem description table truncated to 16 bytes.
> > +   ( struct sys_desc_table_struct )
> > + 0xb0 - 0x13f  Free. Add more parameters here if you really 
> > need them.
> > + 0x140- 0x1be  EDID_INFO Video mode setup
> > +
> > +0x1c4  unsigned long   EFI system table pointer
> > +0x1c8  unsigned long   EFI memory descriptor size
> > +0x1cc  unsigned long   EFI memory descriptor version
> > +0x1d0  unsigned long   EFI memory descriptor map pointer
> > +0x1d4  unsigned long   EFI memory descriptor map size
> > +0x1e0  unsigned long   ALT_MEM_K, alternative mem check, in Kb
> > +0x1e4  unsigned long   Scratch field for the kernel setup code
> > +0x1e8  charnumber of entries in E820MAP (below)
> > +0x1e9  unsigned char   number of entries in EDDBUF (below)
> > +0x1ea  unsigned char   number of entries in EDD_MBR_SIG_BUFFER (below)
> > +0x290 - 0x2cf  EDD_MBR_SIG_BUFFER (edd.S)
> > +0x2d0 - 0xd00  E820MAP
> > +0xd00 - 0xeff  EDDBUF (edd.S) for disk signature read sector
> > +0xd00 - 0xeeb  EDDBUF (edd.S) for edd data
> > +
> > +After loading and setuping the zero page, the boot loader can load the
> > +32/64-bit kernel in the same way as that of 16-bit boot protocol.
> > +
> > +In 32-bit boot protocol, the kernel is started by jumping to the
> > +32-bit kernel entry point, which is the start address of loaded
> > +32/64-bit kernel.
> > +
> > +At entry, the CPU must be in 32-bit protected mode with paging
> > +disabled; the CS and DS must be 4G flat segments; %esi holds the base
> > +address of the "zero page"; %esp, %ebp, %edi should be zero.
> 
> This is just replicating the "zero-page.txt" document, which can best be
> described as a "total lie" -- compare with the actual structure.

OK, I will check the actual structure, and change the document
accordingly.

Best Regards,
Huang Ying
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC -mm 1/2] i386/x86_64 boot: setup data

2007-09-17 Thread Huang, Ying

On Mon, 2007-09-17 at 08:30 -0700, H. Peter Anvin wrote:
> Huang, Ying wrote:
> > This patch add a field of 64-bit physical pointer to NULL terminated
> > single linked list of struct setup_data to real-mode kernel
> > header. This is used to define a more extensible boot parameters
> > passing mechanism.
> 
> You MUST NOT add a field like this without changing the version number,
> and, since you expect to enter the kernel at the PM entrypoint, you
> better *CHECK* that version number before ever descending down the chain.
> 

I forgot changing the version number in boot/head.S. I will add it. And
I will add version number checking before descending down the chain.

Best Regards,
Huang Ying
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Wasting our Freedom

2007-09-17 Thread Al Viro

On Mon, Sep 17, 2007 at 05:03:55PM -0700, David Schwartz wrote:
> 
> > "David Schwartz" <[EMAIL PROTECTED]> writes:
> 
> > > My point is that you *cannot* prevent a recipient of a
> > > derivative work from
> > > receiving any rights under either the GPL or the BSD to any protectable
> > > elements in that work.
> >
> > Of course you can.
> 
> No you can't.

Gentlemen, please remove your wanking selves back to the gutter you've
crawled from.  This is not slashdot[1].  This is not gnu.misc.discuss.
This is not alt.sex.cartooney.sue.sue.sue.  This is a technical maillist
and that dungpile doesn't belong here.  If you insist on hitting vger,
ask davem to create a new maillist ([EMAIL PROTECTED] would fit that kind
of traffic nicely) and for pity sake, do fuck off already.  Enough is
enough.

[1] the spews from nerds, the spews that splatter...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

2007-09-17 Thread Daniel Phillips

On Friday 07 September 2007 22:12, Mike Snitzer wrote:
> Can you be specific about which changes to existing mainline code
> were needed to make recursive reclaim "work" in your tests (albeit
> less ideally than peterz's patchset in your view)?

Sorry, I was incommunicado out on the high seas all last week.  OK, the
measures that actually prevent our ddsnap driver from deadlocking are:

  - Statically prove bounded memory use of all code in the writeout
path.

  - Implement any special measures required to be able to make such a
proof.

  - All allocations performed by the block driver must have access
to dedicated memory resources.

  - Disable the congestion_wait mechanism for our code as much as
possible, at least enough to obtain the maximum memory resources
that can be used on the writeout path.

The specific measure we implement in order to prove a bound is:

  - Throttle IO on our block device to a known amount of traffic for
which we are sure that the MEMALLOC reserve will always be
adequate.

Note that the boundedness proof we use is somewhat loose at the moment. 
It goes something like "we only need at most X kilobytes of reserve and 
there are X megabytes available".  Much of Peter's patch set is aimed 
at getting more precise about this, but to be sure, handwaving just 
like this has been part of core kernel since day one without too many 
ill effects.

The way we provide guaranteed access to memory resources is:

  - Run critical daemons in PF_MEMALLOC mode, including
any userspace daemons that must execute in the block IO path
   (cluster coders take note!)

Right now, all writeout submitted to ddsnap gets handed off to a daemon
running in PF_MEMALLOC mode.  This is a needless inefficiency that we 
want to remove in future, and handle as many of those submissions as 
possible entirely in the context of the submitter.  To do this, further 
measures are needed:

  - Network writes performed by the block driver must have access to
dedicated memory resources.

We have not yet managed to trigger network read memory deadlock, but it 
is just a matter of time, additional fancy virtual block devices, and 
enough stress.  So:

  - Network reads need some fancy extra support because dedicated
memory resources must be consumed before knowing whether the
network traffic belongs to a block device or not.

Now, the interesting thing about this whole discussion is, none of the 
measures that we are actually using at the moment are implemented in 
either Peter's or Christoph's patch set.  In other words, at present we 
do not require either patch set in order to run under heavy load 
without deadlocking.  But in order to generalize our solution to a 
wider range of virtual block devices and other problematic systems such 
as userspace filesystems, we need to incorporate a number of elements 
of Peter's patch set.

As far as Christoph's proposal goes, it is not required to prevent 
deadlocks.   Whether or not it is a good optimization is an open 
question.

Of all the patches posted so far related to this work, the only 
indispensable one is the bio throttling patch developed by Evgeniy and 
I in a parallel thread.  The other essential pieces are all implemented 
in our block driver for now.  Some of those can be generalized and 
moved at least partially into core, and some cannot.

I do need to write some sort of primer on this, because there is no 
fire-and-forget magic core kernel solution.  There are helpful things 
we can do in core, but some of it can only be implemented in the 
drivers themselves.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] Consolidate host virtualization support under Virtualization menu

2007-09-17 Thread Jeremy Fitzhardinge

Charles N Wyble wrote:
>
>
> Zachary Amsden wrote:
> >
> > Virtualization is completely different, and probably needs separate
> > server (kvm, lguest) and client (kvm, lguest, xen, vmware) sections in
> > it's menu.
>
>
> So what is the differentiation between client and server above? Just
> curious what makes kvm and lguest server and client.

"Host" and "guest" are better terms, I think.  Kvm is all host, since
guests need no modification.  lguest turns the kernel into both host and
guest.  Xen Linux kernels are all guest, since the Xen hypervisor is the
host.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ext34: ensure do_split leaves enough free space in both blocks

2007-09-17 Thread hooanon05


Andreas Dilger:
> > So this looks like 2.6.22 and 2.6.23 material, but the timing is getting
> > pretty squeezy.  Could people please give this change an extra-close
> > review, let me know?
> 
> I already discussed it at length with Eric and inspected the patch, so we
> could add:
> Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> 
> Haven't actually tested the code myself.

I've just tested the patch on linux-2.6.23-rc6 (i386) with the test
program I posted a few months ago, and found it solved the problem.
Thank you very much Eric Sandeen, Andreas Dilger and all in ML.

Junjiro Okajima
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: Avoid pointless WARN_ON(irqs_disabled()) from panic codepath



On Tue, 18 Sep 2007, Satyam Sharma wrote:
> 
> > [ cut here ]
> > Badness at arch/powerpc/kernel/smp.c:202
> 
> comes when smp_call_function_map() has been called with irqs disabled,
> which is illegal. However, there is a special case, the panic() codepath,
> when we do not want to warn about this -- warning at that time is pointless
> anyway, and only serves to scroll away the *real* cause of the panic and
> distracts from the real bug.

BTW arch/ppc/ has same issue, but that's gonna be removed by next year
anyways, so I think there's no point making a patch for that (?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] Consolidate host virtualization support under Virtualization menu

2007-09-17 Thread Charles N Wyble

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



Zachary Amsden wrote:
>
> Virtualization is completely different, and probably needs separate
> server (kvm, lguest) and client (kvm, lguest, xen, vmware) sections in
> it's menu.


So what is the differentiation between client and server above? Just
curious what makes kvm and lguest server and client.

>
> Zach
>
> ___
> Virtualization mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/virtualization
>
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG7w3ckQPZV56XDBMRAvvaAJ9cHl+A321MJyw6W4J4yIDurz0K2wCcDg8J
uOR6alAGvWjxEbThiuaeIDc=
=TQ3m
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Wasting our Freedom

2007-09-17 Thread David Schwartz

> "David Schwartz" <[EMAIL PROTECTED]> writes:

> > My point is that you *cannot* prevent a recipient of a
> > derivative work from
> > receiving any rights under either the GPL or the BSD to any protectable
> > elements in that work.
>
> Of course you can.

No you can't.

> What rights do you have to BSD-licenced works, made available
> (under BSD) to MS exclusively? You only get the binary object...

You are equating what rights I have with my ability to exercise those
rights. They are not the same thing. For example, I once bought the rights
to publically display the movie "Monty Python and the Holy Grail". To my
surprise, the rights to public display did not include an actual copy of the
film.

In any event, I never claimed that anyone has rights to a protectable
element that they do not possess a lawful copy of. That's a complete
separate issue and one that has nothing to do with what's being discussed
here because these are all cases where you have the work.

> You know, this is quite common practice - instead of assigning
> copyright, you can grant a BSD-style licence (for some fee,
> something like "do what you want but I will do what I want with
> my code").

Sure, *you* can grant a BSD-style license to any protectable elements *you*
authored. But unless your recpients can obtain a BSD-style license to all
protectable elements in the work from their respective authors, they cannot
modify or distribute it.

*You* cannot grant any rights to protectable elements authored by someone
else, unless you have a relicensing agreement. Neither the GPL nor the BSD
is one of those.

> >> If A sold a BSD licence to B only and this B sold a proprietary
> >> licence (for a derived work) to C, C (without that clause) wouldn't
> >> have a BSD licence to the original work. This is BTW common scenario.
> >
> > C most certainly would have a BSD license, should he choose to
> > comply with
> > terms, to every protectable element that is in both the
> > original work and
> > the work he received.

> But he may have received only binary program image - or the source
> under NDA.
> Sure, NDA doesn't cover public information, but BSD doesn't mean public.
> Now what?

What the hell does that have to do with anything? Are you just trying to be
deliberately dense or waste time? Is it not totally obvious how the
principles I explain apply to a case like that?

Only someone who signs an NDA must comply with it. If you signed an NDA, you
must comply with it. An NDA can definitely subtract rights. It's a complex
question whether an NDA can subtract GPL rights, but again, that has nothing
to do with what we're talking about here.

Sure, you can have the right from me to do X and still not be allowed to do
X because you agreed with someone else not to do it. So what?

> > C has no right to license any protectable element he did not author to
> > anyone else. He cannot set the license terms for those elements to C.

> Sure, the licence covers the >>>entire work<<<, not some "elements".

This is a misleading statement. The phrase "entire work" has two senses. The
license definitely does not cover the "entire work" in the sense of every
protectable element in the work unless each individual author of those
elements chose to offer that element under that license.

If by "entire work", you mean any compilation or derivative work copyright
the "final" author has, then yes, that's available under whatever license
the "final" author places it under. But that license does not actually
permit you to distribute the work.

This is really complicated and I wish I had a clear way to explain it.
Suppose I write a work and then you modify it. Assume your modification
includes adding new protectable elements to that work. When someone
distributes that new derivative work, they are distributing protectable
elements authored by both you and me.

Absent a relicensing agreement, they must obtain some rights from you and
some rights from me to do that. You cannot license the protectable elements
that I authored that are still in the resulting derivative work.

> > Neither the BSD nor the GPL ever give you the right to change the actual
> > license a work is offered under by the original author.
>
> Of course, that's a very distant thing.

Exactly. Every protectable element in the final work is licensed by the
original author to every recipient who takes advantage of the license offer.

> >> BTW: a work by multiple authors is a different thing than a work
> >> derived from another.
> >
> > In practice it doesn't matter.
>
> Of course it does. Only author of a (derived) work can licence
> it, in this case he/she could change the licence back to BSD,
> or sell it to MS (if not based on GPL etc).

Only the author of any protectable element can license it, whether it's in a
derivated work or by itself.

You are seriously confused if you think that just because you create a
derivative work that includes my protectable elements you can then license
the

Re: Wasting our Freedom

2007-09-17 Thread Ingo Schwarze

[EMAIL PROTECTED] wrote on Sun, Sep 16, 2007 at 04:40:38PM -0700:
> On Sun, 16 Sep 2007, Jacob Meuser wrote:

>> so the linux community is morally equivilent to a corporation?
>> that's what it sounds like you are all legally satisfied with.
>
> if it's legal it's legal. it's not a matter of the Linux community being 
> satisfied with it, it's a matter of the BSD people desiring it based on 
> their selection of license (and the repeated statements that this feature 
> of the BSD license being an advantage compared to the GPL makes it clear 
> that this isn't an unknown side effect, it's an explicit desire).

Indeed, that argument is often paraphrased in a way that makes it
hard to understand.  What i heard people say is not "If people make
derivative works based on BSD code, they should make them less free
instead of fully free", but it is: "If people caring nothing about
free software in the first place are building their own commercial
systems anyway, they should rather reuse BSD code than hacking up
their own bricolage of bug-ridden insecure stuff."

Granted, that's a different approach than taken by the GPL, which
essentially says "... anyway, they deserve to be on their own."

> so the Linux community is following the desires of the BSD community
> by following their license but the BSD community is unhappy, why?

Be careful not to confuse "desires" with "legal requirements"...  :-(

Given BSD code, BSD-licensed substantial improvements
make happier than restrictively licensed substantial improvements
make happier than derived non-free closed-source software
make happier than license violations.

Besides, the Linux communities neither qualify as "caring nothing
about free software" nor as "hacking up their own bricolage of
bug-ridden insecure stuff" (hopefully ;-).  So that argument
simply doesn't apply to you.  Probably, that's why Jacob talked
about "morally equivalent to a corporation".

> you claim that it's unethical for the linux community to use the
> code, but brag about NetApp useing the code.  what makes NetApp ok
> and Linux evil?  many people honestly don't understand the logic
> behind this.  please explain it.

Several people have already explained this nicely; the degree
of happiness may also depend on the level of cooperation and
understanding you expect from the people building on the code,
given their own intentions and goals.  I may well be thankful
towards an enemy just for not killing me, but at the same time
sad about a friend leaving me out in the rain.

( This just being stated in general; i'm not sure what the state
  of discussions in the various Linux communities is just now. )
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

2007-09-17 Thread Nix

On 17 Sep 2007, J. Bruce Fields stated:
> On Mon, Sep 17, 2007 at 11:23:46PM +0100, Nix wrote:
>> A while later we start seeing runs of malloc failures, which I think
>> correlated with the unexplained pauses in NFS response:
>
> Actually, they're nothing to do with malloc failures--the message
> printed here is misleading, and isn't even an error; it gets printed
> whenever an upcall to mountd is made.

Indeed, with more debugging, all the failures I see come from the call
to exp_find(), which is digging out exports...

>The problem is almost certainly a
> problem with kernel<->mountd communication--the kernel depends on mountd
> to answer questions about exported filesystems as part of the fh_verify
> code.

Ah! I keep forgetting that mountd isn't just used at mount time: damned
misleading names, grumble.

Restarting mountd clears the problem up temporarily, so you are
definitely right.

> commit dd087896285da9e160e13ee9f7d75381b67895e3
> Author: J. Bruce Fields <[EMAIL PROTECTED]>
> Date:   Thu Jul 26 16:30:46 2007 -0400

Aha! I'm on 3b55934b9baefecee17aefc3ea139e261a4b03b8, over a month older.

> On a recent Debian/Sid machine, I saw libc retrying stdio writes that
> returned write errors.

Debian Sid recently upgraded to glibc 2.6.x, as did I... earlier
versions of glibc will have had this behaviour too, but it may have been
less frequent.

> I don't know whether this libc behavior is correct or expected, but it
> seems safest to add the __fpurge() (suggested by Neil) to ensure data is
> thrown away.

It is expected, judging from my reading of the
code. stdio-common/vfprintf.c emits single chars using the outchar()
macro, and strings using the outstring() macro, using functions in libio
to do the job. The string output routine then calls _IO_file_xsputn(),
which, tracing through libio's jump tables and symbol aliases, ends up
calling _IO_new_file_xsputn() in libio/fileops.c. (I've only just
started to understand libio. It's basically undocumented as far as I can
tell, but it's deeply nifty. Think of stdio, only made entirely out of
hookable components. :) )

(Actual writing then thunks down through _IO_new_do_write() and
new_do_write() in the same file, which finally calls __write().  If
there's any kind of error this returns EOF after some opaque messing
about with a _cur_column value, which is as far as I can tell never
used!)

The code which calls new_do_write() looks like this:

,[ libio/fileops.c:_IO_new_file_xsputn() ]
|  if (do_write)
|{
|  count = new_do_write (f, s, do_write);
|  to_do -= count;
|  if (count < do_write)
|return n - to_do;
|}
`

This code handles partial writes followed by errors by returning a
suitable nonzero value, and immediate errors by returning -1.

In either case the buffer will have been filled as much as possible by
that point, and will still be filled when (vf)printf() is next called.

This behaviour is, IIRC, mandated by the C Standard: I can find no
reference in the Standard to streams being flushed on error, only
on fclose(), fflush(), or program termination.

I'm upgrading now: thank you!

-- 
`Some people don't think performance issues are "real bugs", and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-rc4-mm1 OOPS in forcedeth?

On Mon, 17 Sep 2007, Denis V. Lunev wrote:
> Dhaval Giani wrote:
> > On Thu, Sep 13, 2007 at 11:51:33PM -0400, Andrew James Wade wrote:

> >> EIP: [] tcp_rto_min+0xb/0x15 SS:ESP 0068:c0596dec

As Vlad Yasevich mentioned, this one is already fixed in 23-rc6.

The forcedeth oops is unrelated, but multiple people have reported that
same oops now -- adding Manfred Spraul to CC. [ original thread is at:
http://lkml.org/lkml/2007/9/1/115 ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Wasting our Freedom

2007-09-17 Thread Theodore Tso

On Mon, Sep 17, 2007 at 03:06:37PM -0700, Can E. Acar wrote:
> The only remaining issue is whether Nick & Jiri have enough
> original contributions to the code to be added to the Copyright.
> 
> I believe this needs to be resolved between Reyk and Nick and Jiri.
> 
> The main reason of Theo's message, linked earlier, was the
> lack of response on this issue. It seems that the SFLC is
> dismissing this issue, and thus stalling its resolution by the
> developers.

OK, so all of this flaming, and digging up of "licenses ripped off",
and chaff thrown up in the air, and moaning and bewailing about
"theft", is now down to these two lines regarding Nick and Jiri:

> * Copyright (c) 2004-2007 Reyk Floeter <[EMAIL PROTECTED]>
> * Copyright (c) 2006-2007 Nick Kossifidis <[EMAIL PROTECTED]>
> * Copyright (c) 2007 Jiri Slaby <[EMAIL PROTECTED]>
> [snip rest of BSD license]

It's under a BSD license; what material difference does those two
lines make, for goodness sake?  It's under a BSD license, so it's not
like anything won't be "given back".  Whether or not they have made
enough for changes is really a question for the lawyers, and may
differ from one jurisdiction to another --- but whether or not they
have now, or maybe will not make until later --- does it really make a
difference?  Who gets hurt if someone gets they get a bit more credit
than they deserve?  Certainly the most important thing is that Reyk is
given proper credit, right?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue

2007-09-17 Thread John Blackwood

Roland Dreier wrote:

Thanks for the explanation...

 > But basically, with CONFIG_PREEMPT_RT enabled, the lock points, such as
 > aqcuiring a spinlock, potentially become places where the current task
 > may be context switched out / preempted.
 > 
 > Therefore, when a call is made to lock a spinlock for example, the

 > caller should not currently have irqs disabled, or preemption disabled,
 > since a context switch may occur.

this doesn't seem relevant here...

Hi Roland,

right.  just some background info.

 > void fastcall rt_downgrade_write(struct rw_semaphore *rwsem)
 > {
 > BUG();
 > }

this seems to be the problem... the -rt patch turns downgrade_write()
into a BUG().

I need to look at the locking in user_mad.c again, but I think it may
be possible to replace both places that do downgrade_write() with
up_write() followed by down_read().

 - R.

that sounds like it would be a good solution for both preempt rt and 
non-preempt rt kernels.

thanks again for looking at this for us.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] powerpc: Avoid pointless WARN_ON(irqs_disabled()) from panic codepath


> [ cut here ]
> Badness at arch/powerpc/kernel/smp.c:202

comes when smp_call_function_map() has been called with irqs disabled,
which is illegal. However, there is a special case, the panic() codepath,
when we do not want to warn about this -- warning at that time is pointless
anyway, and only serves to scroll away the *real* cause of the panic and
distracts from the real bug.

* So let's extract the WARN_ON() from smp_call_function_map() into all its
  callers -- smp_call_function() and smp_call_function_single()

* Also, introduce another caller of smp_call_function_map(), namely
  __smp_call_function() (and make smp_call_function() a wrapper over this)
  which does *not* warn about disabled irqs

* Use this __smp_call_function() from the panic codepath's smp_send_stop()

We also end having to move code of smp_send_stop() below the definition
of __smp_call_function().

Signed-off-by: Satyam Sharma <[EMAIL PROTECTED]>

---

Untested (not even compile-tested) patch.
Could someone point me to ppc32/64 cross-compilers for i386?

 arch/powerpc/kernel/smp.c |   27 ++-
 1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1ea4316..b24dcba 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -152,11 +152,6 @@ static void stop_this_cpu(void *dummy)
;
 }
 
-void smp_send_stop(void)
-{
-   smp_call_function(stop_this_cpu, NULL, 1, 0);
-}
-
 /*
  * Structure and data for smp_call_function(). This is designed to minimise
  * static memory requirements. It also looks cleaner.
@@ -198,9 +193,6 @@ int smp_call_function_map(void (*func) (void *info), void 
*info, int nonatomic,
int cpu;
u64 timeout;
 
-   /* Can deadlock when called with interrupts disabled */
-   WARN_ON(irqs_disabled());
-
if (unlikely(smp_ops == NULL))
return ret;
 
@@ -270,10 +262,19 @@ int smp_call_function_map(void (*func) (void *info), void 
*info, int nonatomic,
return ret;
 }
 
+static int __smp_call_function(void (*func)(void *info), void *info,
+  int nonatomic, int wait)
+{
+   return smp_call_function_map(func,info,nonatomic,wait,cpu_online_map);
+}
+
 int smp_call_function(void (*func) (void *info), void *info, int nonatomic,
int wait)
 {
-   return smp_call_function_map(func,info,nonatomic,wait,cpu_online_map);
+   /* Can deadlock when called with interrupts disabled */
+   WARN_ON(irqs_disabled());
+
+   return __smp_call_function(func, info, nonatomic, wait);
 }
 EXPORT_SYMBOL(smp_call_function);
 
@@ -283,6 +284,9 @@ int smp_call_function_single(int cpu, void (*func) (void 
*info), void *info, int
cpumask_t map = CPU_MASK_NONE;
int ret = 0;
 
+   /* Can deadlock when called with interrupts disabled */
+   WARN_ON(irqs_disabled());
+
if (!cpu_online(cpu))
return -EINVAL;
 
@@ -299,6 +303,11 @@ int smp_call_function_single(int cpu, void (*func) (void 
*info), void *info, int
 }
 EXPORT_SYMBOL(smp_call_function_single);
 
+void smp_send_stop(void)
+{
+   __smp_call_function(stop_this_cpu, NULL, 1, 0);
+}
+
 void smp_call_function_interrupt(void)
 {
void (*func) (void *info);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Wasting our Freedom

2007-09-17 Thread Krzysztof Halasa

"David Schwartz" <[EMAIL PROTECTED]> writes:

> My point is that you *cannot* prevent a recipient of a derivative work from
> receiving any rights under either the GPL or the BSD to any protectable
> elements in that work.

Of course you can.
What rights do you have to BSD-licenced works, made available
(under BSD) to MS exclusively? You only get the binary object...

You know, this is quite common practice - instead of assigning
copyright, you can grant a BSD-style licence (for some fee,
something like "do what you want but I will do what I want with
my code").

>> If A sold a BSD licence to B only and this B sold a proprietary
>> licence (for a derived work) to C, C (without that clause) wouldn't
>> have a BSD licence to the original work. This is BTW common scenario.
>
> C most certainly would have a BSD license, should he choose to comply with
> terms, to every protectable element that is in both the original work and
> the work he received.

But he may have received only binary program image - or the source
under NDA.
Sure, NDA doesn't cover public information, but BSD doesn't mean public.
Now what?

> C has no right to license any protectable element he did not author to
> anyone else. He cannot set the license terms for those elements to C.

Sure, the licence covers the >>>entire work<<<, not some "elements".

> Neither the BSD nor the GPL ever give you the right to change the actual
> license a work is offered under by the original author.

Of course, that's a very distant thing.

>> BTW: a work by multiple authors is a different thing than a work
>> derived from another.
>
> In practice it doesn't matter.

Of course it does. Only author of a (derived) work can licence
it, in this case he/she could change the licence back to BSD,
or sell it to MS (if not based on GPL etc).

> Would you argue that I can license Disney's "The Lion King" movie to you if
> I promise not to sue you over any (no) rights that I possess to it?

Sure you can :-) that doesn't mean it would protect me from Disney,
but you can.

> You are confusing licenses of two very different types. The BSD and GPL
> licenses only cover modification and distribution, two rights you do not get
> to MS Windows at all. *Use* is not restricted under copyright.

I'm told in the USA use = copying from disk to RAM = distribution,
isn't it true? :-)
It doesn't matter of course.

> There is simply nothing remotely comparable to the BSD or GPL license in the
> case of MS Windows. There is no grant of additional rights beyond those you
> get automatically with lawful possession (such as use).

I don't compare them (though you can). You don't get a licence for
"original elements" in MS-Windows, do you?

> If MS wished to grant someone the right to modify or redistribute Windows,
> that person would also need to obtain the right to modify or distribute
> protectable elements not authored by Microsoft. The only way they could
> obtain those rights, assuming Microsoft didn't have written relicensing
> agreements, is from the original author under the original licenses.

Yes, but it isn't automatic. Imagine you have received something
from MS, under more permissive licence (I think such things did
happen). How do you, for example, recognice boundaries of the
elements, IOW what additional rights do you have to each line in
the code or pixel in the font?

The file itself only states:
(C) MS
portions (C) e.g. Bitstream
licenced under their special agreement

What extra rights do you receive from Bitstream? Perhaps you should
ask them if they have given you some licence? :-)

Or another example, redistributable runtime libraries. What extra
rights do you have?

What you write is true for GPL, but it doesn't mean it's true
everytime. It's just that clause in the GPL.
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] modpost: detect unterminated device id lists

2007-09-17 Thread Mauro Carvalho Chehab

Hi Andrew,

Em Seg, 2007-09-17 às 14:50 -0700, Andrew Morton escreveu:
> On Tue, 18 Sep 2007 03:15:14 +0530 (IST)
> Satyam Sharma <[EMAIL PROTECTED]> wrote:
> 
> > 
> > 
> > On Sun, 16 Sep 2007, Andrew Morton wrote:
> > 
> > > On Mon, 17 Sep 2007 05:54:45 +0530 "Satyam Sharma" <[EMAIL PROTECTED]> 
> > > wrote:
> > > 
> > > > On 9/17/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > I'm getting this:
> > > > >
> > > > > rusb2/pvrusb2: struct usb_device_id is 20 bytes.  The last of 3 is:
> > > > > 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
> > > > > 0x00
> > > > > 0x00 0x00 0x00 0x00 0x00
> > > > > FATAL: drivers/media/video/pvrusb2/pvrusb2: struct usb_device_id is 
> > > > > not terminated
> > > > > with a NULL entry!
> > > > >
> > > > > ("rusb2/pvrusb2" ??)
> > > > 
> > > > Hmm? Are you sure you didn't see any "drivers/media/video/pv" before the
> > > > "rusb2/pvrusb2" bit?
> > > 
> > > Fairly.  I looked twice.
> > 
> > "drivers/media/video/pvrusb2/pvrusb2" comes out correctly here ...
> > 
> > 
> > > > Looking at Kees' patch (and the existing code), I've no
> > > > clue how/why this should happen ... will try to reproduce here ...
> > > > 
> > > > 
> > > > > but:
> > > > >
> > > > > struct usb_device_id pvr2_device_table[] = {
> > > > > [PVR2_HDW_TYPE_29XXX] = { USB_DEVICE(0x2040, 0x2900) },
> > > > > [PVR2_HDW_TYPE_24XXX] = { USB_DEVICE(0x2040, 0x2400) },
> > > > > { USB_DEVICE(0, 0) },
> > > > > };
> > > > >
> > > > > looks OK?
> > > > >
> > > > > Using plain old "{ }" shut the warning up.
> > > > 
> > > > USB_DEVICE(0, 0) is not empty termination, actually, and this looks like
> > > > a genuine bug caught by the patch. As that dump shows, USB_DEVICE(0, 0)
> > > > assigns "0x03 0x00" (in little endian) to usb_device_id.match_flags. And
> > > > I don't think the USB code treats such an entry as an empty entry (?)
> > > > 
> > > > Interestingly, the "USB_DEVICE(0, 0)" thing is absent from latest -git
> > > > tree and also in my copy of 23-rc4-mm1 -- so this looks like something
> > > > you must've merged recently.
> > > 
> > > git-dvb very carefully does
> > > 
> > > --- a/drivers/media/video/pvrusb2/pvrusb2-hdw.c~git-dvb
> > > +++ a/drivers/media/video/pvrusb2/pvrusb2-hdw.c
> > > @@ -44,7 +44,7 @@
> > >  struct usb_device_id pvr2_device_table[] = {
> > >   [PVR2_HDW_TYPE_29XXX] = { USB_DEVICE(0x2040, 0x2900) },
> > >   [PVR2_HDW_TYPE_24XXX] = { USB_DEVICE(0x2040, 0x2400) },
> > > -   { }
> > > +   { USB_DEVICE(0, 0) },
> > > };
> > >
> > > MODULE_DEVICE_TABLE(usb, pvr2_device_table);
> > 
> > Ok, this is a false positive indeed, the core USB code does in fact
> > treat such an entry as an empty entry (usb_match_id() tests only the
> > .idVendor, .bDeviceClass, .bInterfaceClass and .driver_info members
> > for non-zero and not the .match_flags member).
> > 
> > However, a quick-grep-and-glance tells us that none of the other 2213
> > occurrences of USB_DEVICE() in the tree ever do this "(0,0)" thing,
> > so it does make sense to change this one to a simple "{ }" as well --
> > that's clearer style anyway, and the "standard" way to empty-terminate
> > in the rest of the tree, if nothing else.
> > 
> 
> yeah, I think so.  Mauro, could you please drop that change?

Patch dropped from my tree.

Cheers,
Mauro.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23-rc4-mm1][Bug] kernel BUG at include/linux/netdevice.h:339!

2007-09-17 Thread David Miller

From: Andrew Morton <[EMAIL PROTECTED]>
Date: Mon, 17 Sep 2007 14:16:22 -0700

> On Mon, 17 Sep 2007 17:46:38 +0530
> Kamalesh Babulal <[EMAIL PROTECTED]> wrote:
> 
> > Kernel Bug is hit with 2.6.23-rc4-mm1 kernel on ppc64 machine.
> > 
> > kernel BUG at include/linux/netdevice.h:339!
> 
> (please cc [EMAIL PROTECTED] on networking-related matters)
> 
> You died here:
> 
> static inline void napi_complete(struct napi_struct *n)
> {
> BUG_ON(!test_bit(NAPI_STATE_SCHED, >state));
> 
> The NAPI changes have had a few problems and hopefully things have
> been fixed up since then.  I'll try to get rc6-mm1 out this evening,
> so please retest that?

And if you trigger this still it is absolutely critical that
you tell us what networking device driver you are using at
the time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 048/104] KVM: Add and use pr_unimpl for standard formatting of unimplemented features

2007-09-17 Thread Rusty Russell

On Mon, 2007-09-17 at 09:16 -0700, Joe Perches wrote:
> On Mon, 2007-09-17 at 10:31 +0200, Avi Kivity wrote:
> > diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
> > index cfda3ab..6d25826 100644
> > --- a/drivers/kvm/kvm.h
> > +++ b/drivers/kvm/kvm.h
> > @@ -474,6 +474,14 @@ struct kvm_arch_ops {
> >  
> >  extern struct kvm_arch_ops *kvm_arch_ops;
> >  
> > +/* The guest did something we don't support. */
> > +#define pr_unimpl(vcpu, fmt, ...)  \
> > + do {  
> > \
> > +   if (printk_ratelimit()) \
> > +   printk(KERN_ERR "kvm: %i: cpu%i " fmt,  \
> > +  current->tgid, (vcpu)->vcpu_id , ## __VA_ARGS__); \
> > + } while(0)
> > +
> >  #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
> >  #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
> >  
> 
> This converts all KERN_ uses to KERN_ERR.
> It seems better to add a  argument to kvm_printf.
> pr_unimpl is perhaps a poor name choice.
> perhaps vcpu_printk_ratelimit(vcpu, level, fmt, ...)

Possibly, but remember that printk() is an admission of failure.  It's
only useful to developers, and the only reason for printk over
pr_debug() is for users to report to developers when guests crash.

pr_unimpl() means exactly what it says: the guest asked for something we
don't support.  If that turns out to be the last thing in the logs
before a crash, it's a clue.  The rest of the printks should probably
move to pr_debug().

Hope that helps,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Sun, 16 Sep 2007, Nick Piggin wrote:

> > > So if you argue that vmap is a downside, then please tell me how you
> > > consider the -ENOMEM of your approach to be better?
> >
> > That is again pretty undifferentiated. Are we talking about low page
> 
> In general.

There is no -ENOMEM approach. Lower order page allocation (< 
PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer 
will be activated. That is the nature of the failures that we saw early in 
the year when this was first merged into mm.

> > With the ZONE_MOVABLE you can remove the unmovable objects into a defined
> > pool then higher order success rates become reasonable.
> 
> OK, if you rely on reserve pools, then it is not 1st class support and hence
> it is a non-solution to VM and IO scalability problems.

ZONE_MOVABLE creates two memory pools in a machine. One of them is for 
movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first 
call support for order 0 pages?

> > > If, by special software layer, you mean the vmap/vunmap support in
> > > fsblock, let's see... that's probably all of a hundred or two lines.
> > > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > > pagecache and its new special mmap layer... Hmm, seems like a no
> > > brainer to me. You really still want to persue the "extra layer"
> > > argument as a point against fsblock here?
> >
> > Yes sure. You code could not live without these approaches. Without the
> 
> Actually: your code is the one that relies on higher order allocations. Now
> you're trying to turn that into an argument against fsblock?

fsblock also needs contiguous pages in order to have a beneficial 
effect that we seem to be looking for.

> > antifragmentation measures your fsblock code would not be very successful
> > in getting the larger contiguous segments you need to improve performance.
> 
> Complely wrong. *I* don't need to do any of that to improve performance.
> Actually the VM is well tuned for order-0 pages, and so seeing as I have
> sane hardware, 4K pagecache works beautifully for me.

Sure the system works fine as is. Not sure why we would need fsblock then.

> > (There is no new mmap layer, the higher order pagecache is simply the old
> > API with set_blocksize expanded).
> 
> Yes you add another layer in the userspace mapping code to handle higher
> order pagecache.

That would imply a new API or something? I do not see it.

> > Why: It is the same approach that you use.
> 
> Again, rubbish.

Ok the logical conclusion from the above is that you think your approach 
is rubbish Is there some way you could cool down a bit?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Mingming Cao

On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote:
> On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> > On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > > jbd/jbd2: Replace slab allocations with page cache allocations
> > > 
> > > From: Christoph Lameter <[EMAIL PROTECTED]>
> > > 
> > > JBD should not pass slab pages down to the block layer.
> > > Use page allocator pages instead. This will also prepare
> > > JBD for the large blocksize patchset.
> > > 
> > 
> > Currently memory allocation for committed_data(and frozen_buffer) for
> > bufferhead is done through jbd slab management, as Christoph Hellwig
> > pointed out that this is broken as jbd should not pass slab pages down
> > to IO layer. and suggested to use get_free_pages() directly.
> > 
> > The problem with this patch, as Andreas Dilger pointed today in ext4
> > interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> > 1/3-1/2 page space. 
> > 
> > What was the originally intention to set up slabs for committed_data(and
> > frozen_buffer) in JBD? Why not using kmalloc?
> > 
> > Mingming
> 
> Looks good. Small suggestion is to get rid of all kmalloc() usages and
> consistently use jbd_kmalloc() or jbd2_kmalloc().
> 
> Thanks,
> Badari
> 

Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.


Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
---
 fs/jbd/journal.c  |8 +---
 fs/jbd/revoke.c   |   12 ++--
 fs/jbd2/journal.c |8 +---
 fs/jbd2/revoke.c  |   12 ++--
 4 files changed, 22 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-17 14:32:16.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-17 14:33:59.0 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
-   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+   GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
-   journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+   GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
iput(journal->j_inode);
if (journal->j_revoke)
journal_destroy_revoke(journal);
-   kfree(journal->j_wbuf);
+   jbd_kfree(journal->j_wbuf);
jbd_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c   2007-09-17 14:32:22.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-17 14:35:13.0 -0700
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;
 
journal->j_revoke->hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal->j_revoke->hash_table) {
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
journal->j_revoke = NULL;
@@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ
 
journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
GFP_KERNEL);
if (!journal->j_revoke_table[1]) {
-   kfree(journal->j_revoke_table[0]->hash_table);
+   jbd_kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
return -ENOMEM;
}
@@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;
 
journal->j_revoke->hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal->j_revoke->hash_table) {
-   kfree(journal->j_revoke_table[0]->hash_table);
+   jbd_kfree(journal->j_revoke_table[0]->hash_table);

Re: [patch 1/8] Immediate Values - Global Modules List and Module Mutex

2007-09-17 Thread Rusty Russell

On Fri, 2007-09-14 at 11:32 -0400, Mathieu Desnoyers wrote:
> * Rusty Russell ([EMAIL PROTECTED]) wrote:
> > Alternatively, if you called it "immediate_init" then the semantics
> > change slightly, but are more obvious (ie. only use this when the value
> > isn't being accessed yet).  But it can't be __init then anyway.
> > 
> 
> I think your idea is good. immediate_init() could be used to update the
> immediate values at boot time _and_ at module load time, and we could
> use an architecture specific arch_immediate_update_init() to support it.

Right.

> As for "when" to use this, it should be used at boot time when
> interrupts are still disabled, still running in UP. It can also be used
> at module load time before any of the module code is executed, as long
> as the module code pages are writable (which they always are, for
> now..). Therefore, the flag seems inappropriate for module load
> arch_immediate_update_init. It cannot be put in __init section neither
> though if we use it like this.

I think from a user's POV it would be nice to have a 1:1 mapping with
normal initialization semantics (ie. it will work as long as you don't
access this value until initialized).  And I think this would be the
case.  eg:

int foo_func(void)
{
if (immediate_read(_immediate))
return 0;
...
}

int some_init(void)
{
immediate_init(some_immediate, 0);
register_foo(foo_func);
...
}


> > On an unrelated note, did you consider simply IPI-ing and doing the
> > substitution with all CPUs stopped?  If you only updated the immediate
> > references to this particular var, it should be fast enough not to upset
> > the RT guys, even.
> > 
> 
> Yes, I thought about this, but since I use immediate values in the
> kernel markers, which can be put in exception handlers (including nmi,
> mce handler), which cannot be disabled without important side-effects, I
> don't think trying to stop the CPUs is a workable solution.

OK, but can you justify the use of immediates within the nmi or mce
handlers?  They don't strike me as useful candidates for optimization.

> > Well, you can do that in asm without gcc support.  It's a little nasty:
> > since gcc will know nothing about the function call, it can't have side
> > effects which are visible in this function, and you'll have to save and
> > restore *all* regs if you decide to do the function call.  But it's
> > possible (a 5-byte nop gets changed to a call, the call does the pushes
> > and sets the args regs, calls the function, then pops everything and
> > rets).
> 
> GCC support is required if we want to embed inline functions inside
> unlikely branches depending on immediate values (no function call
> there). It also permits passing local variables as arguments to the
> function call (stack setup), which would be tricky, instrumentation site
> specific and non portable if done in assembly.

Well if this is the slow path, you don't want inline anyway.  But it
would be horribly, horribly arch-specific, yes.

Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

2007-09-17 Thread J. Bruce Fields

On Mon, Sep 17, 2007 at 11:23:46PM +0100, Nix wrote:
> Sep 17 22:57:55 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
> Sep 17 22:57:55 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 
>  d32ff38f 404811a6 a88d96ab 0x1f
> Sep 17 22:57:55 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 
>  d32ff38f 404811a6 a88d96ab)
> Sep 17 22:57:55 loki warning: kernel: nfsd: Dropping request due to malloc 
> failure!
> Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still 
> trying
> Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still 
> trying
> Sep 17 22:58:55 hades notice: kernel: nfs: server loki not responding, still 
> trying
> Sep 17 22:59:40 hades notice: kernel: nfs: server loki not responding, still 
> trying
> 
> 
> >From then on, *every* fh_verify() request fails the same way, and
> obviously if you can't verify any fds you can't do much with NFS.
> 
> Looking back in the log I see intermittent malloc failures starting
> almost as soon as I've booted (allowing a couple of minutes for me to
> turn debugging on):
> 
> Sep 17 22:25:50 hades notice: kernel: nfs: server loki OK
> [...]
> Sep 17 22:28:09 loki warning: kernel: nfsd_dispatch: vers 3 proc 19
> Sep 17 22:28:09 loki warning: kernel: nfsd: FSINFO(3)   28: 00070001 000fb001 
>  d32ff38f 404811a6 a88d96ab
> Sep 17 22:28:09 loki warning: kernel: nfsd: fh_verify(28: 00070001 000fb001 
>  d32ff38f 404811a6 a88d96ab)
> Sep 17 22:28:09 loki warning: kernel: nfsd: Dropping request due to malloc 
> failure!
> 
> A while later we start seeing runs of malloc failures, which I think
> correlated with the unexplained pauses in NFS response:

Actually, they're nothing to do with malloc failures--the message
printed here is misleading, and isn't even an error; it gets printed
whenever an upcall to mountd is made.  The problem is almost certainly a
problem with kernel<->mountd communication--the kernel depends on mountd
to answer questions about exported filesystems as part of the fh_verify
code.

It's just a shot in the dark, but you might try the latest nfs-utils
(get the latest out of git://linux-nfs.org/nfs-utils if you're already
on the most recent your distro will give you).  Or just apply the
following--which did fix a problem whose symptoms varied depending on
libc behavior.

If that doesn't work, I'd try

strace -s0 `pidof rpc.mountd`

and also look at the contents of /proc/net/rpc/nfsd.fh/contents.

--b.

commit dd087896285da9e160e13ee9f7d75381b67895e3
Author: J. Bruce Fields <[EMAIL PROTECTED]>
Date:   Thu Jul 26 16:30:46 2007 -0400

Use __fpurge to ensure single-line writes to cache files

On a recent Debian/Sid machine, I saw libc retrying stdio writes that
returned write errors.  The result is that if an export downcall returns
an error (which it can in normal operation, since it currently
(incorrectly) returns -ENOENT on any negative downcall), then subsequent
downcalls will write multiple lines (including the original line that
received the error).

The result is that the server fails to respond to any rpc call that
refers to an unexported mount point (such as a readdir of a directory
containing such a mountpoint), so client commands hang.

I don't know whether this libc behavior is correct or expected, but it
seems safest to add the __fpurge() (suggested by Neil) to ensure data is
thrown away.

Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

diff --git a/support/nfs/cacheio.c b/support/nfs/cacheio.c
index a76915b..9d271cd 100644
--- a/support/nfs/cacheio.c
+++ b/support/nfs/cacheio.c
@@ -17,6 +17,7 @@

 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -111,7 +112,18 @@ void qword_printint(FILE *f, int num)

 int qword_eol(FILE *f)
 {
+   int err;
+
fprintf(f,"\n");
+   err = fflush(f);
+   /*
+* We must send one line (and one line only) in a single write
+* call.  In case of a write error, libc may accumulate the
+* unwritten data and try to write it again later, resulting in a
+* multi-line write.  So we must explicitly ask it to throw away
+* any such cached data:
+*/
+   __fpurge(f);
return fflush(f);
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH mm] fix swapoff breakage; however...

2007-09-17 Thread Hugh Dickins

On Tue, 18 Sep 2007, Balbir Singh wrote:
> Hugh Dickins wrote:
> > 
> > What would make sense is (what I meant when I said swap counted
> > along with RSS) not to count pages out and back in as they are
> > go out to swap and back in, just keep count of instantiated pages
> > 
> 
> I am not sure how you define instantiated pages. I suspect that
> you mean RSS + pages swapped out (swap_pte)?

That's it.  (Whereas file pages counted out when paged out,
then counted back in when paged back in.)

> If a swapoff is going to push a container over it's limit, then
> we break the container and the isolation it provides.

Is it just my traditional bias, that makes me prefer you break
your container than my swapoff?  I'm not sure.

> Upon swapoff
> failure, may be we could get the container to print a nice
> little warning so that anyone else with CAP_SYS_ADMIN can fix the
> container limit and retry swapoff.

And then they hit the next one... rather like trying to work out
the dependencies of packages for oneself: a very tedious process.

If the swapoff succeeds, that does mean there was actually room
in memory (+ other swap) for everyone, even if some have gone over
their nominal limits.  (But if the swapoff runs out of memory in
the middle, yes, it might well have assigned the memory unfairly.)

The appropriate answer may depend on what you do when a container
tries to fault in one more page than its limit.  Apparently just
fail it (no attempt to page out another page from that container).

So, if the whole system is under memory pressure, kswapd will
be keeping the RSS of all tasks low, and they won't reach their
limits; whereas if the system is not under memory pressure,
tasks will easily approach their limits and so fail.

Please tell me my understanding is wrong!

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting

2007-09-17 Thread Andi Kleen


> Handling policy ref counts for hugepages is a bit trickier.
> huge_zonelist() returns a zone list that might come from a 
> shared or vma 'BIND policy.  In this case, we should hold the
> reference until after the huge page allocation in 
> dequeue_hugepage().  The patch modifies huge_zonelist() to
> return a pointer to the mempolicy if it needs to be unref'd
> after allocation.

Acked-by: Andi Kleen <[EMAIL PROTECTED]>

Andrew, can you please queue that for .23?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting

2007-09-17 Thread Andi Kleen


> The patch does require concurrent increments and decrements in the main 
> fault patch. The potential is to create another bouncing cacheline for 
> concurrent faults. This looks like it would cause a performance issue.

While may be true correctness is always more important than performance.
So I think this is the right thing for .23. Any performance improvements
if needed can come later.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[2.6.22.6] nfsd: fh_verify() `malloc failure' with lots of free memory leads to NFS hang

2007-09-17 Thread Nix

Back in early 2006 I reported persistent hangs on the NFS server,
whereby all of a sudden about ten minutes after boot my primary NFS
server would cease responding to NFS requests until it was rebooted.


That time, the problem vanished when I switched to NFS-over-TCP:


Well, I just rebooted --- post-glibc-upgrade from 2.5 to 2.6.1, no
kernel upgrade or anything, so this bug has been latent during at
least my last three weeks of uptime. And it's back. (I've been
seeing strange long pauses doing things like ls, and I suspect
they are related: see below.)

/proc/sys/nfsd/exports on the freezing server:

# Version 1.1
# Path Client(Flags) # IPs
/usr/packages.bin/non-free  
hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=90a98d8a:a8be4806:aea3a4e1:fe3437a0)
/home/.loki.wkstn.nix   
esperi.srvr.nix(rw,root_squash,async,wdelay,no_subtree_check,uuid=8ff32fd3:a6114840:ab968da8:25b41721)
/home/.loki.wkstn.nix   
hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=8ff32fd3:a6114840:ab968da8:25b41721)
/usr/lib/X11/fonts  
hades.wkstn.nix(ro,root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/share/xplanet  
hades.wkstn.nix(ro,root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/share/xemacs   
hades.wkstn.nix(rw,no_root_squash,async,wdelay,uuid=87553c5b:d84740fc:b7d9f7e8:4f749689)
/usr/packages   
hades.wkstn.nix(rw,no_root_squash,async,wdelay,no_subtree_check,uuid=2a35f82a:cca144df:a1123587:23527f53)

I turned on ALL-class nfsd debugging and here's what I see as it freezes
up:

Sep 17 22:57:00 loki warning: kernel: nfsd_dispatch: vers 3 proc 1
Sep 17 22:57:00 loki warning: kernel: nfsd: GETATTR(3)  36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab
Sep 17 22:57:00 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:44 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:44 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:44 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:45 loki warning: kernel: nfsd: Dropping request due to malloc 
failure!
Sep 17 22:57:52 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:52 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:52 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:52 loki warning: kernel: nfsd: Dropping request due to malloc 
failure!
Sep 17 22:57:55 loki warning: kernel: nfsd_dispatch: vers 3 proc 4
Sep 17 22:57:55 loki warning: kernel: nfsd: ACCESS(3)   36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab 0x1f
Sep 17 22:57:55 loki warning: kernel: nfsd: fh_verify(36: 01070001 000fb001 
 d32ff38f 404811a6 a88d96ab)
Sep 17 22:57:55 loki warning: kernel: nfsd: Dropping request due to malloc 
failure!
Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still 
trying
Sep 17 22:58:50 hades notice: kernel: nfs: server loki not responding, still 
trying
Sep 17 22:58:55 hades notice: kernel: nfs: server loki not responding, still 
trying
Sep 17 22:59:40 hades notice: kernel: nfs: server loki not responding, still 
trying


>From then on, *every* fh_verify() request fails the same way, and
obviously if you can't verify any fds you can't do much with NFS.

Looking back in the log I see intermittent malloc failures starting
almost as soon as I've booted (allowing a couple of minutes for me to
turn debugging on):

Sep 17 22:25:50 hades notice: kernel: nfs: server loki OK
[...]
Sep 17 22:28:09 loki warning: kernel: nfsd_dispatch: vers 3 proc 19
Sep 17 22:28:09 loki warning: kernel: nfsd: FSINFO(3)   28: 00070001 000fb001 
 d32ff38f 404811a6 a88d96ab
Sep 17 22:28:09 loki warning: kernel: nfsd: fh_verify(28: 00070001 000fb001 
 d32ff38f 404811a6 a88d96ab)
Sep 17 22:28:09 loki warning: kernel: nfsd: Dropping request due to malloc 
failure!

A while later we start seeing runs of malloc failures, which I think
correlated with the unexplained pauses in NFS response:

Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 
 5b3c5587 fc4047d8 e8f7d9b7 20480 bytes at 4096
Sep 17 22:33:59 loki warning: kernel: nfsd: fh_verify(44: 02070001 0001ce75 
 5b3c5587 fc4047d8 e8f7d9b7)
Sep 17 22:33:59 loki warning: kernel: nfsd: Dropping request due to malloc 
failure!
Sep 17 22:33:59 loki warning: kernel: nfsd_dispatch: vers 3 proc 6
Sep 17 22:33:59 loki warning: kernel: nfsd: READ(3) 44: 02070001 0001ce75 
 5b3c5587 fc4047d8 e8f7d9b7 28672 bytes at 53248
Sep 17 22:33:59 loki

Add all thread stats for TASKSTATS_CMD_ATTR_TGID (v5)

2007-09-17 Thread Guillaume Chazarain

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
the basic and extended accounting.  With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
of a thread group.  This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V4 (http://lkml.org/lkml/2007/9/15/171):
- Revert gratuitous user interface change (returning exit_code >> 8 instead of
exit_code). Thanks Oleg Nesterov.
- Revert useless heavyweight locking (lock_task_sighand() in fill_tgid_exit).
Thanks Oleg.
- Correctly fill the TGID in taskstats_exit(). Thanks Oleg.

Changelog since V3 (http://lkml.org/lkml/2007/8/31/121):
- Removed userspace example, either it gets accepted in util-linux-ng or I'll
maintain it elsewhere.
- Added kerneldoc for fill_threadgroup() and add_tsk().
- Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton
and Oleg Nesterov.
- Use lock_task_sighand() instead of spin_lock_irqsave(>sighand->siglock)
for consistency with the locking of task->signal->stats in fill_tgid().
- Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg.
- Documented double accounting race seen by Oleg.
- Rephrased the fill_tgid_exit() comment as per Oleg's recommendation.
- Documented the special case for the AFORK ac_flag.
- Use the exit status (code >> 8) instead of the exit code as documented in
Documentation/accounting/taskstats-struct.txt.
- Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as
suggested by Oleg.

Changelog since V2 (http://lkml.org/lkml/2007/8/19/96):
- Added a testcase
- Added an indirection between the stats producer and consumer:
add_task() & fill_threadgroup()
- TGID stats are either summed from all the threads or taken from the leader

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads.

Signed-off-by: Guillaume Chazarain <[EMAIL PROTECTED]>
Cc: Balbir Singh <[EMAIL PROTECTED]>
Cc: Jay Lan <[EMAIL PROTECTED]>
Cc: Jonathan Lim <[EMAIL PROTECTED]>
Cc: Oleg Nesterov <[EMAIL PROTECTED]>
---

 include/linux/tsacct_kern.h |   12 ++-
 kernel/taskstats.c  |  135 +-
 kernel/tsacct.c |  113 
 3 files changed, 159 insertions(+), 101 deletions(-)

diff -r 2908770b8fc2 include/linux/tsacct_kern.h
--- a/include/linux/tsacct_kern.h   Sun Sep 16 22:24:49 2007 -0700
+++ b/include/linux/tsacct_kern.h   Tue Aug 28 20:35:27 2007 +0200
@@ -10,17 +10,23 @@
 #include 
 
 #ifdef CONFIG_TASKSTATS
-extern void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk);
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task);
 #else
-static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct 
*tsk)
+static inline void bacct_fill_threadgroup(struct taskstats *stats, struct 
task_struct *task)
+{}
+static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct 
*task)
 {}
 #endif /* CONFIG_TASKSTATS */
 
 #ifdef CONFIG_TASK_XACCT
-extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
 extern void acct_update_integrals(struct task_struct *tsk);
 extern void acct_clear_integrals(struct task_struct *tsk);
 #else
+static inline void xacct_fill_threadgroup(struct taskstats *stats, struct 
task_struct *task)
+{}
 static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct 
*p)
 {}
 static inline void acct_update_integrals(struct task_struct *tsk)
diff -r 2908770b8fc2 kernel/taskstats.c
--- a/kernel/taskstats.cSun Sep 16 22:24:49 2007 -0700
+++ b/kernel/taskstats.cMon Sep 17 22:55:04 2007 +0200
@@ -168,6 +168,68 @@ static void send_cpu_listeners(struct sk
up_write(>sem);
 }
 
+/**
+ * fill_threadgroup - initialize some common stats for the thread group
+ * @stats: the taskstats to write into
+ * @task: the thread representing the whole group
+ *
+ * There are two types of taskstats fields when considering a thread group:
+ * - those that can be aggregated from each thread in the group (like CPU
+ * times),
+ * - those that cannot be aggregated (like UID) or are identical (like
+ * memory usage), so are taken from the group leader.
+ * XXX_threadgroup() methods deal with the first type while XXX_add_tsk() with
+ * the second.
+ */
+static void fill_threadgroup(struct taskstats *stats, struct

Re: [PATCH 1/3] IB/ehca: Fix large page HW cap defines

obviously OK...applied.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

 > The IGMP enabling patch posted by me on September 2nd isn't on your list
 > http://lists.openfabrics.org/pipermail/general/2007-September/040250.html
 > can you add it?

Yes, I lost that somehow.  I will add it to my list of things to take
a look at (no opinion yet).

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Mon, 17 Sep 2007, Bernd Schmidt wrote:

> Christoph Lameter wrote:
> > True. That is why we want to limit the number of unmovable allocations and
> > that is why ZONE_MOVABLE exists to limit those. However, unmovable
> > allocations are already rare today. The overwhelming majority of allocations
> > are movable and reclaimable. You can see that f.e. by looking at
> > /proc/meminfo and see how high SUnreclaim: is (does not catch everything but
> > its a good indicator).
> 
> Just to inject another factor into the discussion, please remember that Linux
> also runs on nommu systems, where things like user space allocations are
> neither movable nor reclaimable.

Hmmm However, sorting of the allocations would result in avoiding 
defragmentation to some degree?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Wasting our Freedom

2007-09-17 Thread Ingo Schwarze

Adrian Bunk wrote on Mon, Sep 17, 2007 at 02:57:14PM +0200:

> But stating in your licence that noone has to give back but then 
> complaining to some people on ethical grounds that they should give
> back is simply dishonest.
> 
> Is your intention to allow people to include your code into GPL'ed code 
> and never give back, or is your intention that this shouldn't happen?
> 
> And whatever your intention is should be stated in your licence.

As this is a recurring argument in the present discussion, let's
address it, even though it lies somewhat beside the main topic.
What i wish and what i try to enforce by legal contracts are two
completely different things.  In particular, it is _not_ a smart
idea to try to enforce all one's wishes by legal means.

For example, i wish that as much as possible of the code i write be
freely available such that others can use it, too, and i wish that
others write useful code and make it free such that i can use it.
When i publish code, i wish bugfixes to be fed back to me, and i
hope that others might free their derivative works, too.  Besides,
i might hope that people at large behave in human and rational ways
and refrain from doing harm to others.  In particular i might wish
the fruits of my work not to be abused to harm or oppress people.
Quite probably, lots of software developers share similar wishes,
whatever licenses they happen to be employing.

But this doesn't imply i should be putting any of the above into
the license for my code.  Once people attach additional conditions
to their licences, sooner or later i get stuck when trying to
combine different code covered by different licences.  However well
intentioned, in practice, those additional conditions habitually
turn out to be incompatible - even when, regarded seperately, all
of them might appear to make some sense.

Now doubtless, the two main additional conditions imposed by the GPL -
derivative works may only be distributed if they are made as open and
as free as the original - are among those making the most sense of all
the additional conditions you might imagine, in the sense that nearly
any developer of free software will wish that anybody holding the
copyright on a derivative work would make that free.  Still, when
trying to combine code with different licences, even the GPL at times
turns out to be a bother.  This does not only apply to the case of
non-free closed-source commercial code, but also to cases where
authors intended to make their code free, but, be it by inexperience
or because they failed to restrain themselves, unfortunately added
some uncommon condition to the license.  Combining such code with ISC
or BSD code is hardly ever problem, combining such code with GPL code
may well be.

Thus, even when wishing derivative works to be free in their turn,
i still see a strong theoretical and a strong practical argument to
choose the ISC license over the GPL: Theoretically, it's just the
categorical imperative: If everybody would be adding her or his
favorite condition to her or his license, we would not end up in
free software, but in chaos.  Practically, i'm quite fed up with
GPL license incompatibility issues always popping up at the most
inconvenient places, and still more, with all those license
compatibility discussions.  With the ISC license, there are no
incompatibility issues and no incompatibility discussions, it just
works.  Of course, i lose the option to sue people to open up
derivative works, but i keep the hope that some people (especially
those engaged in free software themselves) understand and keep up
the spirit, and above all, i avoid lots of legalese worries.
Ultimately, it's kind of a trade-off.

To summarize, there are valid reasons to wish that people would make
derivative works free, but to not require it in the license.  Just
like there are valid reasons to wish that people should not use the
code for waging war, but to not require that in the license.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Sun, 16 Sep 2007, Nick Piggin wrote:

> > > fsblock doesn't need any of those hacks, of course.
> >
> > Nor does mine for the low orders that we are considering. For order >
> > MAX_ORDER this is unavoidable since the page allocator cannot manage such
> > large pages. It can be used for lower order if there are issues (that I
> > have not seen yet).
> 
> Or we can just avoid all doubt (and doesn't have arbitrary limitations
> according to what you think might be reasonable or how well the
> system actually behaves).

We can avoid all doubt in this patchset as well by adding support for 
fallback to a vmalloced compound page.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Sun, 16 Sep 2007, Jörn Engel wrote:

> I bet!  My (false) assumption was the same as Goswin's.  If non-movable
> pages are clearly seperated from movable ones and will evict movable
> ones before polluting further mixed superpages, Nick's scenario would be
> nearly infinitely impossible.
> 
> Assumption doesn't reflect current code.  Enforcing this assumption
> would cost extra overhead.  The amount of effort to make Christoph's
> approach work reliably seems substantial and I have no idea whether it
> would be worth it.

My approach is based on Mel's code and is already working the way you 
describe. Page cache allocs are marked __GFP_MOVABLE by Mel's work.

Re: Wasting our Freedom

2007-09-17 Thread Can E. Acar

Theodore Tso wrote:
> On Mon, Sep 17, 2007 at 09:23:41PM +0200, Claudio Jeker wrote:
>> Because they put their copyright plus license on code that they barely
>> modified. If they would have added substantial work into the OpenHAL code
>> and by doing that creating something new I would not say much.
> 
> Number 1, some of the Linux wireless developers screwed up earlier
> versions.  No denying that, the problems were pointed out during the
> patch reviewed problem, AND THEY WERE FIXED.

Not all, see below:

> Number 2, if you take a look at their latest set of changes (which
> have still not been accepted), the HAL code is under a pure BSD
> license (ath5k_hw.c).  Other portions are dual licensed, but not the
> HAL --- if people would only take a look at
> 
> http://git.kernel.org/?p=linux/kernel/git/linville/wireless-dev.git;a=tree;f=drivers/net/wireless;h=2d6caeba0924c34b9539960b9ab568ab3d193fc8;hb=everything
> 

from latest ath5k_hw.c:

* Copyright (c) 2004-2007 Reyk Floeter <[EMAIL PROTECTED]>
* Copyright (c) 2006-2007 Nick Kossifidis <[EMAIL PROTECTED]>
* Copyright (c) 2007 Jiri Slaby <[EMAIL PROTECTED]>
[snip rest of BSD license]

The only remaining issue is whether Nick & Jiri have enough
original contributions to the code to be added to the Copyright.

I believe this needs to be resolved between Reyk and Nick and Jiri.

The main reason of Theo's message, linked earlier, was the
lack of response on this issue. It seems that the SFLC is
dismissing this issue, and thus stalling its resolution by the
developers.

The rest is, as you say, history.

Can

-- 
In theory, there is no difference between theory and practice.
But, in practice, there is.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Sun, 16 Sep 2007, Nick Piggin wrote:

> I don't know how it would prevent fragmentation from building up
> anyway. It's commonly the case that potentially unmovable objects
> are allowed to fill up all of ram (dentries, inodes, etc).

Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from 
ZONE_MOVABLE and thus the memory that can be allocated for them is 
limited.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Userspace tuner

2007-09-17 Thread Bill Davidsen


Dâniel Fraga wrote:

Well, I'd like to see Linus' opinion about this, because while
programmers keep discussing this, users are waiting forever... so if
Markus has a concrete and better solution, why don't use it?

And as far as I know, Markus is the programmer who is most
interested in this code. I didn't see anybody else in the world doing
his work...

And I always had a impression that if most of things could be
done in user space, than it will be better (for example, devfs -> udev).
Why do everything in kernel space? Lets put *less* code in the kernel,
not more code. And besides that, code in user space can be changed
easily. Code in kernel has to wait a long time for Linus to accept (*if*
he accepts).

The problem with user space drivers is that it encourages binary only 
drivers, drivers which work only for a limited set of hardware, and 
other means to reduce choice for the user. There's a reason why binary 
modules make the kernel tainted, I have to feel that this is more and 
worse of same.


Linus will have an opinion, no doubt.

--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Badari Pulavarty

On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > jbd/jbd2: Replace slab allocations with page cache allocations
> > 
> > From: Christoph Lameter <[EMAIL PROTECTED]>
> > 
> > JBD should not pass slab pages down to the block layer.
> > Use page allocator pages instead. This will also prepare
> > JBD for the large blocksize patchset.
> > 
> 
> Currently memory allocation for committed_data(and frozen_buffer) for
> bufferhead is done through jbd slab management, as Christoph Hellwig
> pointed out that this is broken as jbd should not pass slab pages down
> to IO layer. and suggested to use get_free_pages() directly.
> 
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?
> 
> Mingming

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] selinux: Improving SELinux read/write performance

2007-09-17 Thread James Morris

On Mon, 17 Sep 2007, Stephen Smalley wrote:

> > It reduces the selinux overhead on read/write by only revalidating
> > permissions in selinux_file_permission if the task or inode labels have
> > changed or the policy has changed since the open-time check.  A new LSM
> > hook, security_dentry_open, is added to capture the necessary state at
> > open time to allow this optimization.
> > 
> > Signed-off-by: Yuichi Nakamura<[EMAIL PROTECTED]>
> 
> Thanks, looks good.
> 
> Acked-by:  Stephen Smalley <[EMAIL PROTECTED]>

Applied to 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6.git#for-akpm


-- 
James Morris
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [kvm-devel] [PATCH 2/3] Refactor hypercall infrastructure (v3)

2007-09-17 Thread Nakajima, Jun

Anthony Liguori wrote:
> This patch refactors the current hypercall infrastructure to better
support
> live 
> migration and SMP.  It eliminates the hypercall page by trapping the
UD
> exception that would occur if you used the wrong hypercall instruction
for the
> underlying architecture and replacing it with the right one lazily.
> 
> It also introduces the infrastructure to probe for hypercall available
via
> CPUID leaves 0x4000.  CPUID leaf 0x4001 should be filled out
by
> userspace.
> 
> A fall-out of this patch is that the unhandled hypercalls no longer
trap to
> userspace.  There is very little reason though to use a hypercall to
> communicate 
> with userspace as PIO or MMIO can be used.  There is no code in tree
that uses
> userspace hypercalls.
> 
> Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>
> 
> diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
> index 3b29256..cc5dfb4 100644
> --- a/include/linux/kvm_para.h
> +++ b/include/linux/kvm_para.h
> @@ -1,73 +1,110 @@
>  #ifndef __LINUX_KVM_PARA_H
>  #define __LINUX_KVM_PARA_H
> 
> -/*
> - * Guest OS interface for KVM paravirtualization
> - *
> - * Note: this interface is totally experimental, and is certain to
change
> - *   as we make progress.
> +/* This CPUID returns the signature 'KVMKVMKVM' in ebx, ecx, and edx.
It
> + * should be used to determine that a VM is running under KVM.



> +
> +static inline int kvm_para_available(void)
> +{
> + unsigned int eax, ebx, ecx, edx;
> + char signature[13];
> +
> + cpuid(KVM_CPUID_SIGNATURE, , , , );
> + memcpy(signature + 0, , 4);
> + memcpy(signature + 4, , 4);
> + memcpy(signature + 8, , 4);
> + signature[12] = 0;
> +
> + if (strcmp(signature, "KVMKVMKVM") == 0)

> + return 1;
> +
> + return 0;
> +}

I think we should compare 12 characters (not just 9, as far as my eyes
tell), and can we use some cute one, like "FantasticKVM"? ;-)

Jun
---
Intel Open Source Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/11] eCryptfs: Replace magic numbers

Replace some magic numbers with sizeof() equivalents.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index 3b3cf27..425a144 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -1426,10 +1426,10 @@ static int parse_header_metadata(struct 
ecryptfs_crypt_stat *crypt_stat,
u32 header_extent_size;
u16 num_header_extents_at_front;
 
-   memcpy(_extent_size, virt, 4);
+   memcpy(_extent_size, virt, sizeof(u32));
header_extent_size = be32_to_cpu(header_extent_size);
-   virt += 4;
-   memcpy(_header_extents_at_front, virt, 2);
+   virt += sizeof(u32);
+   memcpy(_header_extents_at_front, virt, sizeof(u16));
num_header_extents_at_front = be16_to_cpu(num_header_extents_at_front);
crypt_stat->num_header_extents_at_front =
(int)num_header_extents_at_front;
-- 
1.5.1.6

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/11] eCryptfs: Remove unused functions and kmem_cache

The switch to read_write.c routines and the persistent file make a
number of functions unnecessary. This patch removes them.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c  |  150 --
 fs/ecryptfs/ecryptfs_kernel.h |   21 +---
 fs/ecryptfs/file.c|   28 
 fs/ecryptfs/main.c|5 -
 fs/ecryptfs/mmap.c|  336 -
 5 files changed, 1 insertions(+), 539 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index b3014d7..3b3cf27 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -353,119 +353,6 @@ out:
return rc;
 }
 
-static void
-ecryptfs_extent_to_lwr_pg_idx_and_offset(unsigned long *lower_page_idx,
-int *byte_offset,
-struct ecryptfs_crypt_stat *crypt_stat,
-unsigned long extent_num)
-{
-   unsigned long lower_extent_num;
-   int extents_occupied_by_headers_at_front;
-   int bytes_occupied_by_headers_at_front;
-   int extent_offset;
-   int extents_per_page;
-
-   bytes_occupied_by_headers_at_front =
-   (crypt_stat->extent_size
-* crypt_stat->num_header_extents_at_front);
-   extents_occupied_by_headers_at_front =
-   ( bytes_occupied_by_headers_at_front
- / crypt_stat->extent_size );
-   lower_extent_num = extents_occupied_by_headers_at_front + extent_num;
-   extents_per_page = PAGE_CACHE_SIZE / crypt_stat->extent_size;
-   (*lower_page_idx) = lower_extent_num / extents_per_page;
-   extent_offset = lower_extent_num % extents_per_page;
-   (*byte_offset) = extent_offset * crypt_stat->extent_size;
-   ecryptfs_printk(KERN_DEBUG, " * crypt_stat->extent_size = "
-   "[%d]\n", crypt_stat->extent_size);
-   ecryptfs_printk(KERN_DEBUG, " * crypt_stat->"
-   "num_header_extents_at_front = [%d]\n",
-   crypt_stat->num_header_extents_at_front);
-   ecryptfs_printk(KERN_DEBUG, " * extents_occupied_by_headers_at_"
-   "front = [%d]\n", extents_occupied_by_headers_at_front);
-   ecryptfs_printk(KERN_DEBUG, " * lower_extent_num = [0x%.16x]\n",
-   lower_extent_num);
-   ecryptfs_printk(KERN_DEBUG, " * extents_per_page = [%d]\n",
-   extents_per_page);
-   ecryptfs_printk(KERN_DEBUG, " * (*lower_page_idx) = [0x%.16x]\n",
-   (*lower_page_idx));
-   ecryptfs_printk(KERN_DEBUG, " * extent_offset = [%d]\n",
-   extent_offset);
-   ecryptfs_printk(KERN_DEBUG, " * (*byte_offset) = [%d]\n",
-   (*byte_offset));
-}
-
-static int ecryptfs_write_out_page(struct ecryptfs_page_crypt_context *ctx,
-  struct page *lower_page,
-  struct inode *lower_inode,
-  int byte_offset_in_page, int bytes_to_write)
-{
-   int rc = 0;
-
-   if (ctx->mode == ECRYPTFS_PREPARE_COMMIT_MODE) {
-   rc = ecryptfs_commit_lower_page(lower_page, lower_inode,
-   ctx->param.lower_file,
-   byte_offset_in_page,
-   bytes_to_write);
-   if (rc) {
-   ecryptfs_printk(KERN_ERR, "Error calling lower "
-   "commit; rc = [%d]\n", rc);
-   goto out;
-   }
-   } else {
-   rc = ecryptfs_writepage_and_release_lower_page(lower_page,
-  lower_inode,
-  ctx->param.wbc);
-   if (rc) {
-   ecryptfs_printk(KERN_ERR, "Error calling lower "
-   "writepage(); rc = [%d]\n", rc);
-   goto out;
-   }
-   }
-out:
-   return rc;
-}
-
-static int ecryptfs_read_in_page(struct ecryptfs_page_crypt_context *ctx,
-struct page **lower_page,
-struct inode *lower_inode,
-unsigned long lower_page_idx,
-int byte_offset_in_page)
-{
-   int rc = 0;
-
-   if (ctx->mode == ECRYPTFS_PREPARE_COMMIT_MODE) {
-   /* TODO: Limit this to only the data extents that are
-* needed */
-   rc = ecryptfs_get_lower_page(lower_page, lower_inode,
-ctx->param.lower_file,
-lower_page_idx,
-byte_offset_in_page,
-

[PATCH 9/11] eCryptfs: Initialize persistent lower file on inode create

Initialize persistent lower file on inode create.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/super.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/ecryptfs/super.c b/fs/ecryptfs/super.c
index b97e210..f8cdab2 100644
--- a/fs/ecryptfs/super.c
+++ b/fs/ecryptfs/super.c
@@ -47,15 +47,16 @@ struct kmem_cache *ecryptfs_inode_info_cache;
  */
 static struct inode *ecryptfs_alloc_inode(struct super_block *sb)
 {
-   struct ecryptfs_inode_info *ecryptfs_inode;
+   struct ecryptfs_inode_info *inode_info;
struct inode *inode = NULL;
 
-   ecryptfs_inode = kmem_cache_alloc(ecryptfs_inode_info_cache,
- GFP_KERNEL);
-   if (unlikely(!ecryptfs_inode))
+   inode_info = kmem_cache_alloc(ecryptfs_inode_info_cache, GFP_KERNEL);
+   if (unlikely(!inode_info))
goto out;
-   ecryptfs_init_crypt_stat(_inode->crypt_stat);
-   inode = _inode->vfs_inode;
+   ecryptfs_init_crypt_stat(_info->crypt_stat);
+   mutex_init(_info->lower_file_mutex);
+   inode_info->lower_file = NULL;
+   inode = _info->vfs_inode;
 out:
return inode;
 }
-- 
1.5.1.6

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/11] eCryptfs: Convert mmap functions to use persistent file

Convert readpage, prepare_write, and commit_write to use read_write.c
routines. Remove sync_page; I cannot think of a good reason for
implementing that in eCryptfs.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/mmap.c |  199 +++-
 1 files changed, 103 insertions(+), 96 deletions(-)

diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 60e635e..dd68dd3 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -267,9 +267,78 @@ static void set_header_info(char *page_virt,
 }
 
 /**
+ * ecryptfs_copy_up_encrypted_with_header
+ * @page: Sort of a ``virtual'' representation of the encrypted lower
+ *file. The actual lower file does not have the metadata in
+ *the header.
+ * @crypt_stat: The eCryptfs inode's cryptographic context
+ *
+ * The ``view'' is the version of the file that userspace winds up
+ * seeing, with the header information inserted.
+ */
+static int
+ecryptfs_copy_up_encrypted_with_header(struct page *page,
+  struct ecryptfs_crypt_stat *crypt_stat)
+{
+   loff_t extent_num_in_page = 0;
+   loff_t num_extents_per_page = (PAGE_CACHE_SIZE
+  / crypt_stat->extent_size);
+   int rc = 0;
+
+   while (extent_num_in_page < num_extents_per_page) {
+   loff_t view_extent_num = ((page->index * num_extents_per_page)
+ + extent_num_in_page);
+
+   if (view_extent_num < crypt_stat->num_header_extents_at_front) {
+   /* This is a header extent */
+   char *page_virt;
+
+   page_virt = kmap_atomic(page, KM_USER0);
+   memset(page_virt, 0, PAGE_CACHE_SIZE);
+   /* TODO: Support more than one header extent */
+   if (view_extent_num == 0) {
+   rc = ecryptfs_read_xattr_region(
+   page_virt, page->mapping->host);
+   set_header_info(page_virt, crypt_stat);
+   }
+   kunmap_atomic(page_virt, KM_USER0);
+   flush_dcache_page(page);
+   if (rc) {
+   ClearPageUptodate(page);
+   printk(KERN_ERR "%s: Error reading xattr "
+  "region; rc = [%d]\n", __FUNCTION__, rc);
+   goto out;
+   }
+   SetPageUptodate(page);
+   } else {
+   /* This is an encrypted data extent */
+   loff_t lower_offset =
+   ((view_extent_num -
+ crypt_stat->num_header_extents_at_front)
+* crypt_stat->extent_size);
+
+   rc = ecryptfs_read_lower_page_segment(
+   page, (lower_offset >> PAGE_CACHE_SHIFT),
+   (lower_offset & ~PAGE_CACHE_MASK),
+   crypt_stat->extent_size, page->mapping->host);
+   if (rc) {
+   printk(KERN_ERR "%s: Error attempting to read "
+  "extent at offset [%lld] in the lower "
+  "file; rc = [%d]\n", __FUNCTION__,
+  lower_offset, rc);
+   goto out;
+   }
+   }
+   extent_num_in_page++;
+   }
+out:
+   return rc;
+}
+
+/**
  * ecryptfs_readpage
- * @file: This is an ecryptfs file
- * @page: ecryptfs associated page to stick the read data into
+ * @file: An eCryptfs file
+ * @page: Page from eCryptfs inode mapping into which to stick the read data
  *
  * Read in a page, decrypting if necessary.
  *
@@ -277,59 +346,35 @@ static void set_header_info(char *page_virt,
  */
 static int ecryptfs_readpage(struct file *file, struct page *page)
 {
+   struct ecryptfs_crypt_stat *crypt_stat =
+   
_inode_to_private(file->f_path.dentry->d_inode)->crypt_stat;
int rc = 0;
-   struct ecryptfs_crypt_stat *crypt_stat;
 
-   BUG_ON(!(file && file->f_path.dentry && file->f_path.dentry->d_inode));
-   crypt_stat = _inode_to_private(file->f_path.dentry->d_inode)
-   ->crypt_stat;
if (!crypt_stat
|| !(crypt_stat->flags & ECRYPTFS_ENCRYPTED)
|| (crypt_stat->flags & ECRYPTFS_NEW_FILE)) {
ecryptfs_printk(KERN_DEBUG,
"Passing through unencrypted page\n");
-   rc = ecryptfs_do_readpage(file, page, page->index);
-   if (rc) {
-   ecryptfs_printk(KERN_ERR, "Error reading page; rc = "
-

[PATCH 7/11] eCryptfs: Make open, truncate, and setattr use persistent file

Rather than open a new lower file for every eCryptfs file that is
opened, truncated, or setattr'd, instead use the existing lower
persistent file for the eCryptfs inode. Change truncate to use
read_write.c functions. Change ecryptfs_getxattr() to use the common
ecryptfs_getxattr_lower() function.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c |2 +-
 fs/ecryptfs/file.c   |   50 --
 fs/ecryptfs/inode.c  |  113 +++---
 3 files changed, 44 insertions(+), 121 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index 6b4d310..b3014d7 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -1674,7 +1674,7 @@ out:
 /**
  * ecryptfs_read_xattr_region
  * @page_virt: The vitual address into which to read the xattr data
- * @ecryptfs_dentry: The eCryptfs dentry
+ * @ecryptfs_inode: The eCryptfs inode
  *
  * Attempts to read the crypto metadata from the extended attribute
  * region of the lower file.
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index df70bfa..95be9a9 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -187,11 +187,7 @@ static int ecryptfs_open(struct inode *inode, struct file 
*file)
/* Private value of ecryptfs_dentry allocated in
 * ecryptfs_lookup() */
struct dentry *lower_dentry = ecryptfs_dentry_to_lower(ecryptfs_dentry);
-   struct inode *lower_inode = NULL;
-   struct file *lower_file = NULL;
-   struct vfsmount *lower_mnt;
struct ecryptfs_file_info *file_info;
-   int lower_flags;
 
mount_crypt_stat = _superblock_to_private(
ecryptfs_dentry->d_sb)->mount_crypt_stat;
@@ -219,26 +215,12 @@ static int ecryptfs_open(struct inode *inode, struct file 
*file)
if (!(crypt_stat->flags & ECRYPTFS_POLICY_APPLIED)) {
ecryptfs_printk(KERN_DEBUG, "Setting flags for stat...\n");
/* Policy code enabled in future release */
-   crypt_stat->flags |= ECRYPTFS_POLICY_APPLIED;
-   crypt_stat->flags |= ECRYPTFS_ENCRYPTED;
+   crypt_stat->flags |= (ECRYPTFS_POLICY_APPLIED
+ | ECRYPTFS_ENCRYPTED);
}
mutex_unlock(_stat->cs_mutex);
-   lower_flags = file->f_flags;
-   if ((lower_flags & O_ACCMODE) == O_WRONLY)
-   lower_flags = (lower_flags & O_ACCMODE) | O_RDWR;
-   if (file->f_flags & O_APPEND)
-   lower_flags &= ~O_APPEND;
-   lower_mnt = ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
-   /* Corresponding fput() in ecryptfs_release() */
-   rc = ecryptfs_open_lower_file(_file, lower_dentry, lower_mnt,
- lower_flags);
-   if (rc) {
-   ecryptfs_printk(KERN_ERR, "Error opening lower file\n");
-   goto out_puts;
-   }
-   ecryptfs_set_file_lower(file, lower_file);
-   /* Isn't this check the same as the one in lookup? */
-   lower_inode = lower_dentry->d_inode;
+   ecryptfs_set_file_lower(
+   file, ecryptfs_inode_to_private(inode)->lower_file);
if (S_ISDIR(ecryptfs_dentry->d_inode->i_mode)) {
ecryptfs_printk(KERN_DEBUG, "This is a directory\n");
crypt_stat->flags &= ~(ECRYPTFS_ENCRYPTED);
@@ -260,7 +242,7 @@ static int ecryptfs_open(struct inode *inode, struct file 
*file)
   "and plaintext passthrough mode is not "
   "enabled; returning -EIO\n");
mutex_unlock(_stat->cs_mutex);
-   goto out_puts;
+   goto out_free;
}
rc = 0;
crypt_stat->flags &= ~(ECRYPTFS_ENCRYPTED);
@@ -272,11 +254,8 @@ static int ecryptfs_open(struct inode *inode, struct file 
*file)
ecryptfs_printk(KERN_DEBUG, "inode w/ addr = [0x%p], i_ino = [0x%.16x] "
"size: [0x%.16x]\n", inode, inode->i_ino,
i_size_read(inode));
-   ecryptfs_set_file_lower(file, lower_file);
goto out;
-out_puts:
-   mntput(lower_mnt);
-   dput(lower_dentry);
+out_free:
kmem_cache_free(ecryptfs_file_info_cache,
ecryptfs_file_to_private(file));
 out:
@@ -296,20 +275,9 @@ static int ecryptfs_flush(struct file *file, fl_owner_t td)
 
 static int ecryptfs_release(struct inode *inode, struct file *file)
 {
-   struct file *lower_file = ecryptfs_file_to_lower(file);
-   struct ecryptfs_file_info *file_info = ecryptfs_file_to_private(file);
-   struct inode *lower_inode = ecryptfs_inode_to_lower(inode);
-   int rc;
-
-   rc = ecryptfs_close_lower_file(lower_file);
-   if (rc) {
-   printk(KERN_ERR "Error closing lower_file\n");
-   goto out;
-   }
-   inode->i_blocks =

Re: [PATCH] add consts where appropriate in sound/pci/hda/*

2007-09-17 Thread Denys Vlasenko

On Monday 17 September 2007 11:01, Takashi Iwai wrote:
> > There is a lot of data structures in that code,
> > and most of them seems to be read-only.
> > 
> > I added const modifiers to most of such places:
> > 
> >textdata bss dec hex filename
> >  106315  179564  36  285915   45cdb snd-hda-intel.o
> >  2830512624  36  285711   45c0f snd-hda-intel_patched.o
> > 
> > Patch is attached.
> > 
> > It moves "static struct hda_codec_preset *hda_preset_tables[]"
> > from hda_patch.h to hda_codec.c, and then adds
> > #include "hda_patch.h"
> > in a few .c files so that definitions of e.g.
> > const struct hda_codec_preset snd_hda_preset_analog[]
> > are checked to match declarations in hda_patch.h
> > 
> > The rest of the patch (bulk of it) adds "const"
> > in many places.
> > 
> > Patch is compile tested. Please apply.
> > 
> > Signed-off-by: Denys Vlasenko <[EMAIL PROTECTED]>
> 
> Sorry for the late reply.
> 
> First, thanks for your patch.  Although I have also a similar patch
> pending on my tree, but it wasn't applied, because I'd like to mark
> these functions/data rather as __devinit*.  And, sadly, init and const
> don't like with each other.

Unless we will go to the pains of implementing __devrodata,
which doesn't sound encouraging.

> So, my plan is to apply __devinit but 
> without const.

Yes, I see. const as it stands is not very useful in kernel anyway
(only a small code reduction sometimes).
ro or rw, the data is still taking space.

Well, maybe someday ld will be sooo clever that it will actually
merge rodata which is identical, but so far it is not implemented.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/11] eCryptfs: Update metadata read/write functions

Update the metadata read/write functions and grow_file() to use the
read_write.c routines. Do not open another lower file; use the
persistent lower file instead. Provide a separate function for
crypto.c::ecryptfs_read_xattr_region() to get to the lower xattr
without having to go through the eCryptfs getxattr.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c  |  126 +++--
 fs/ecryptfs/ecryptfs_kernel.h |   15 +++--
 fs/ecryptfs/file.c|2 +-
 fs/ecryptfs/inode.c   |  101 +++--
 fs/ecryptfs/mmap.c|2 +-
 5 files changed, 113 insertions(+), 133 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index d6a0680..6b4d310 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -1344,21 +1344,28 @@ out:
return rc;
 }
 
-int ecryptfs_read_and_validate_header_region(char *data, struct dentry *dentry,
-struct vfsmount *mnt)
+int ecryptfs_read_and_validate_header_region(char *data,
+struct inode *ecryptfs_inode)
 {
+   struct ecryptfs_crypt_stat *crypt_stat =
+   &(ecryptfs_inode_to_private(ecryptfs_inode)->crypt_stat);
int rc;
 
-   rc = ecryptfs_read_header_region(data, dentry, mnt);
-   if (rc)
+   rc = ecryptfs_read_lower(data, 0, crypt_stat->extent_size,
+ecryptfs_inode);
+   if (rc) {
+   printk(KERN_ERR "%s: Error reading header region; rc = [%d]\n",
+  __FUNCTION__, rc);
goto out;
-   if (!contains_ecryptfs_marker(data + ECRYPTFS_FILE_SIZE_BYTES))
+   }
+   if (!contains_ecryptfs_marker(data + ECRYPTFS_FILE_SIZE_BYTES)) {
rc = -EINVAL;
+   ecryptfs_printk(KERN_DEBUG, "Valid marker not found\n");
+   }
 out:
return rc;
 }
 
-
 void
 ecryptfs_write_header_metadata(char *virt,
   struct ecryptfs_crypt_stat *crypt_stat,
@@ -1443,24 +1450,18 @@ static int ecryptfs_write_headers_virt(char *page_virt, 
size_t *size,
 
 static int
 ecryptfs_write_metadata_to_contents(struct ecryptfs_crypt_stat *crypt_stat,
-   struct file *lower_file, char *page_virt)
+   struct dentry *ecryptfs_dentry,
+   char *page_virt)
 {
-   mm_segment_t oldfs;
int current_header_page;
int header_pages;
-   ssize_t size;
-   int rc = 0;
+   int rc;
 
-   lower_file->f_pos = 0;
-   oldfs = get_fs();
-   set_fs(get_ds());
-   size = vfs_write(lower_file, (char __user *)page_virt, PAGE_CACHE_SIZE,
-_file->f_pos);
-   if (size < 0) {
-   rc = (int)size;
-   printk(KERN_ERR "Error attempting to write lower page; "
-  "rc = [%d]\n", rc);
-   set_fs(oldfs);
+   if ((rc = ecryptfs_write_lower(ecryptfs_dentry->d_inode, page_virt,
+  0, PAGE_CACHE_SIZE))) {
+   printk(KERN_ERR "%s: Error attempting to write header "
+  "information to lower file; rc = [%d]\n", __FUNCTION__,
+  rc);
goto out;
}
header_pages = ((crypt_stat->extent_size
@@ -1469,18 +1470,19 @@ ecryptfs_write_metadata_to_contents(struct 
ecryptfs_crypt_stat *crypt_stat,
memset(page_virt, 0, PAGE_CACHE_SIZE);
current_header_page = 1;
while (current_header_page < header_pages) {
-   size = vfs_write(lower_file, (char __user *)page_virt,
-PAGE_CACHE_SIZE, _file->f_pos);
-   if (size < 0) {
-   rc = (int)size;
-   printk(KERN_ERR "Error attempting to write lower page; "
-  "rc = [%d]\n", rc);
-   set_fs(oldfs);
+   loff_t offset;
+
+   offset = (current_header_page << PAGE_CACHE_SHIFT);
+   if ((rc = ecryptfs_write_lower(ecryptfs_dentry->d_inode,
+  page_virt, offset,
+  PAGE_CACHE_SIZE))) {
+   printk(KERN_ERR "%s: Error attempting to write header "
+  "information to lower file; rc = [%d]\n",
+  __FUNCTION__, rc);
goto out;
}
current_header_page++;
}
-   set_fs(oldfs);
 out:
return rc;
 }
@@ -1500,7 +1502,6 @@ ecryptfs_write_metadata_to_xattr(struct dentry 
*ecryptfs_dentry,
 /**
  * ecryptfs_write_metadata
  * @ecryptfs_dentry: The eCryptfs dentry
- * @lower_file: The lower file struct, which was returned from dentry_open
  *
  * Write the file headers out.

[PATCH 5/11] eCryptfs: Set up and destroy persistent lower file

This patch sets up and destroys the persistent lower file for each
eCryptfs inode.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/inode.c |   23 +++---
 fs/ecryptfs/main.c  |   65 +++
 fs/ecryptfs/super.c |   22 +++--
 3 files changed, 103 insertions(+), 7 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 7192a81..c746b5d 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -119,10 +119,23 @@ ecryptfs_do_create(struct inode *directory_inode,
}
rc = ecryptfs_create_underlying_file(lower_dir_dentry->d_inode,
 ecryptfs_dentry, mode, nd);
-   if (unlikely(rc)) {
-   ecryptfs_printk(KERN_ERR,
-   "Failure to create underlying file\n");
-   goto out_lock;
+   if (rc) {
+   struct inode *ecryptfs_inode = ecryptfs_dentry->d_inode;
+   struct ecryptfs_inode_info *inode_info =
+   ecryptfs_inode_to_private(ecryptfs_inode);
+
+   printk(KERN_WARNING "%s: Error creating underlying file; "
+  "rc = [%d]; checking for existing\n", __FUNCTION__, rc);
+   if (inode_info) {
+   mutex_lock(_info->lower_file_mutex);
+   if (!inode_info->lower_file) {
+   mutex_unlock(_info->lower_file_mutex);
+   printk(KERN_ERR "%s: Failure to set underlying "
+  "file; rc = [%d]\n", __FUNCTION__, rc);
+   goto out_lock;
+   }
+   mutex_unlock(_info->lower_file_mutex);
+   }
}
rc = ecryptfs_interpose(lower_dentry, ecryptfs_dentry,
directory_inode->i_sb, 0);
@@ -252,6 +265,8 @@ ecryptfs_create(struct inode *directory_inode, struct 
dentry *ecryptfs_dentry,
 {
int rc;
 
+   /* ecryptfs_do_create() calls ecryptfs_interpose(), which opens
+* the crypt_stat->lower_file (persistent file) */
rc = ecryptfs_do_create(directory_inode, ecryptfs_dentry, mode, nd);
if (unlikely(rc)) {
ecryptfs_printk(KERN_WARNING, "Failed to create file in"
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 967bad0..3e324f8 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -98,6 +98,64 @@ void __ecryptfs_printk(const char *fmt, ...)
 }
 
 /**
+ * ecryptfs_init_persistent_file
+ * @ecryptfs_dentry: Fully initialized eCryptfs dentry object, with
+ *   the lower dentry and the lower mount set
+ *
+ * eCryptfs only ever keeps a single open file for every lower
+ * inode. All I/O operations to the lower inode occur through that
+ * file. When the first eCryptfs dentry that interposes with the first
+ * lower dentry for that inode is created, this function creates the
+ * persistent file struct and associates it with the eCryptfs
+ * inode. When the eCryptfs inode is destroyed, the file is closed.
+ *
+ * The persistent file will be opened with read/write permissions, if
+ * possible. Otherwise, it is opened read-only.
+ *
+ * This function does nothing if a lower persistent file is already
+ * associated with the eCryptfs inode.
+ *
+ * Returns zero on success; non-zero otherwise
+ */
+int ecryptfs_init_persistent_file(struct dentry *ecryptfs_dentry)
+{
+   struct ecryptfs_inode_info *inode_info =
+   ecryptfs_inode_to_private(ecryptfs_dentry->d_inode);
+   int rc = 0;
+
+   mutex_lock(_info->lower_file_mutex);
+   if (!inode_info->lower_file) {
+   struct dentry *lower_dentry;
+   struct vfsmount *lower_mnt =
+   ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
+
+   lower_dentry = ecryptfs_dentry_to_lower(ecryptfs_dentry);
+   /* Corresponding dput() and mntput() are done when the
+* persistent file is fput() when the eCryptfs inode
+* is destroyed. */
+   dget(lower_dentry);
+   mntget(lower_mnt);
+   inode_info->lower_file = dentry_open(lower_dentry,
+lower_mnt,
+(O_RDWR | O_LARGEFILE));
+   if (IS_ERR(inode_info->lower_file))
+   inode_info->lower_file = dentry_open(lower_dentry,
+lower_mnt,
+(O_RDONLY
+ | O_LARGEFILE));
+   if (IS_ERR(inode_info->lower_file)) {
+   printk(KERN_ERR "Error opening lower persistent file "
+  "for lower_dentry [0x%p] and lower_mnt [0x%p]\n",

[PATCH 4/11] eCryptfs: Replace encrypt, decrypt, and inode size write

Replace page encryption and decryption routines and inode size write
routine with versions that utilize the read_write.c functions.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c  |  427 ++--
 fs/ecryptfs/ecryptfs_kernel.h |   14 +-
 fs/ecryptfs/inode.c   |   12 +-
 fs/ecryptfs/mmap.c|  131 -
 fs/ecryptfs/read_write.c  |   12 +-
 5 files changed, 290 insertions(+), 306 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index 5d8a553..b829d3c 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -467,8 +467,91 @@ out:
 }
 
 /**
+ * ecryptfs_lower_offset_for_extent
+ *
+ * Convert an eCryptfs page index into a lower byte offset
+ */
+void ecryptfs_lower_offset_for_extent(loff_t *offset, loff_t extent_num,
+ struct ecryptfs_crypt_stat *crypt_stat)
+{
+   (*offset) = ((crypt_stat->extent_size
+ * crypt_stat->num_header_extents_at_front)
++ (crypt_stat->extent_size * extent_num));
+}
+
+/**
+ * ecryptfs_encrypt_extent
+ * @enc_extent_page: Allocated page into which to encrypt the data in
+ *   @page
+ * @crypt_stat: crypt_stat containing cryptographic context for the
+ *  encryption operation
+ * @page: Page containing plaintext data extent to encrypt
+ * @extent_offset: Page extent offset for use in generating IV
+ *
+ * Encrypts one extent of data.
+ *
+ * Return zero on success; non-zero otherwise
+ */
+static int ecryptfs_encrypt_extent(struct page *enc_extent_page,
+  struct ecryptfs_crypt_stat *crypt_stat,
+  struct page *page,
+  unsigned long extent_offset)
+{
+   unsigned long extent_base;
+   char extent_iv[ECRYPTFS_MAX_IV_BYTES];
+   int rc;
+
+   extent_base = (page->index
+  * (PAGE_CACHE_SIZE / crypt_stat->extent_size));
+   rc = ecryptfs_derive_iv(extent_iv, crypt_stat,
+   (extent_base + extent_offset));
+   if (rc) {
+   ecryptfs_printk(KERN_ERR, "Error attempting to "
+   "derive IV for extent [0x%.16x]; "
+   "rc = [%d]\n", (extent_base + extent_offset),
+   rc);
+   goto out;
+   }
+   if (unlikely(ecryptfs_verbosity > 0)) {
+   ecryptfs_printk(KERN_DEBUG, "Encrypting extent "
+   "with iv:\n");
+   ecryptfs_dump_hex(extent_iv, crypt_stat->iv_bytes);
+   ecryptfs_printk(KERN_DEBUG, "First 8 bytes before "
+   "encryption:\n");
+   ecryptfs_dump_hex((char *)
+ (page_address(page)
+  + (extent_offset * crypt_stat->extent_size)),
+ 8);
+   }
+   rc = ecryptfs_encrypt_page_offset(crypt_stat, enc_extent_page, 0,
+ page, (extent_offset
+* crypt_stat->extent_size),
+ crypt_stat->extent_size, extent_iv);
+   if (rc < 0) {
+   printk(KERN_ERR "%s: Error attempting to encrypt page with "
+  "page->index = [%ld], extent_offset = [%ld]; "
+  "rc = [%d]\n", __FUNCTION__, page->index, extent_offset,
+  rc);
+   goto out;
+   }
+   rc = 0;
+   if (unlikely(ecryptfs_verbosity > 0)) {
+   ecryptfs_printk(KERN_DEBUG, "Encrypt extent [0x%.16x]; "
+   "rc = [%d]\n", (extent_base + extent_offset),
+   rc);
+   ecryptfs_printk(KERN_DEBUG, "First 8 bytes after "
+   "encryption:\n");
+   ecryptfs_dump_hex((char *)(page_address(enc_extent_page)), 8);
+   }
+out:
+   return rc;
+}
+
+/**
  * ecryptfs_encrypt_page
- * @ctx: The context of the page
+ * @page: Page mapped from the eCryptfs inode for the file; contains
+ *decrypted content that needs to be encrypted (to a temporary
+ *page; not in place) and written out to the lower file
  *
  * Encrypt an eCryptfs page. This is done on a per-extent basis. Note
  * that eCryptfs pages may straddle the lower pages -- for instance,
@@ -478,128 +561,121 @@ out:
  * file, 24K of page 0 of the lower file will be read and decrypted,
  * and then 8K of page 1 of the lower file will be read and decrypted.
  *
- * The actual operations performed on each page depends on the
- * contents of the ecryptfs_page_crypt_context struct.
- *
  * Returns zero on success; negative on error
  */
-int ecryptfs_encrypt_page(struct ecryptfs_page_crypt_context *ctx)
+int ecryptfs_encrypt_page(struct

[PATCH 3/11] eCryptfs: read_write.c routines

Add a set of functions through which all I/O to lower files is
consolidated. This patch adds a new inode_info reference to a
persistent lower file for each eCryptfs inode; another patch later in
this series will set that up. This persistent lower file is what the
read_write.c functions use to call vfs_read() and vfs_write() on the
lower filesystem, so even when reads and writes come in through
aops->readpage and aops->writepage, we can satisfy them without
resorting to direct access to the lower inode's address space.
Several function declarations are going to be changing with this
patchset. For now, in order to keep from breaking the build, I am
putting dummy parameters in for those functions.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/Makefile  |2 +-
 fs/ecryptfs/ecryptfs_kernel.h |   18 ++
 fs/ecryptfs/mmap.c|2 +-
 fs/ecryptfs/read_write.c  |  359 +
 4 files changed, 379 insertions(+), 2 deletions(-)
 create mode 100644 fs/ecryptfs/read_write.c

diff --git a/fs/ecryptfs/Makefile b/fs/ecryptfs/Makefile
index 1f11072..7688570 100644
--- a/fs/ecryptfs/Makefile
+++ b/fs/ecryptfs/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_ECRYPT_FS) += ecryptfs.o
 
-ecryptfs-objs := dentry.o file.o inode.o main.o super.o mmap.o crypto.o 
keystore.o messaging.o netlink.o debug.o
+ecryptfs-objs := dentry.o file.o inode.o main.o super.o mmap.o read_write.o 
crypto.o keystore.o messaging.o netlink.o debug.o
diff --git a/fs/ecryptfs/ecryptfs_kernel.h b/fs/ecryptfs/ecryptfs_kernel.h
index a618ab7..e6a68a8 100644
--- a/fs/ecryptfs/ecryptfs_kernel.h
+++ b/fs/ecryptfs/ecryptfs_kernel.h
@@ -260,6 +260,8 @@ struct ecryptfs_crypt_stat {
 struct ecryptfs_inode_info {
struct inode vfs_inode;
struct inode *wii_inode;
+   struct file *lower_file;
+   struct mutex lower_file_mutex;
struct ecryptfs_crypt_stat crypt_stat;
 };
 
@@ -653,5 +655,21 @@ int ecryptfs_keyring_auth_tok_for_sig(struct key 
**auth_tok_key,
  char *sig);
 int ecryptfs_write_zeros(struct file *file, pgoff_t index, int start,
 int num_zeros);
+int ecryptfs_write_lower(struct inode *ecryptfs_inode, char *data,
+loff_t offset, size_t size);
+int ecryptfs_write_lower_page_segment(struct inode *ecryptfs_inode,
+ struct page *page_for_lower,
+ size_t offset_in_page, size_t size);
+int ecryptfs_write(struct file *ecryptfs_file, char *data, loff_t offset,
+  size_t size);
+int ecryptfs_read_lower(char *data, loff_t offset, size_t size,
+   struct inode *ecryptfs_inode);
+int ecryptfs_read_lower_page_segment(struct page *page_for_ecryptfs,
+pgoff_t page_index,
+size_t offset_in_page, size_t size,
+struct inode *ecryptfs_inode);
+int ecryptfs_read(char *data, loff_t offset, size_t size,
+ struct file *ecryptfs_file);
+struct page *ecryptfs_get1page(struct file *file, loff_t index);
 
 #endif /* #ifndef ECRYPTFS_KERNEL_H */
diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
index 307f7ee..0c53320 100644
--- a/fs/ecryptfs/mmap.c
+++ b/fs/ecryptfs/mmap.c
@@ -44,7 +44,7 @@ struct kmem_cache *ecryptfs_lower_page_cache;
  * Returns unlocked and up-to-date page (if ok), with increased
  * refcnt.
  */
-static struct page *ecryptfs_get1page(struct file *file, int index)
+struct page *ecryptfs_get1page(struct file *file, loff_t index)
 {
struct dentry *dentry;
struct inode *inode;
diff --git a/fs/ecryptfs/read_write.c b/fs/ecryptfs/read_write.c
new file mode 100644
index 000..e59c94a
--- /dev/null
+++ b/fs/ecryptfs/read_write.c
@@ -0,0 +1,359 @@
+/**
+ * eCryptfs: Linux filesystem encryption layer
+ *
+ * Copyright (C) 2007 International Business Machines Corp.
+ *   Author(s): Michael A. Halcrow <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
+ * 02111-1307, USA.
+ */
+
+#include 
+#include 
+#include "ecryptfs_kernel.h"
+
+/**
+ * ecryptfs_write_lower
+ * @ecryptfs_inode: The eCryptfs inode
+ * @data: Data to write
+ * @offset: Byte offset in the lower file

Re: [PATCH] modpost: detect unterminated device id lists

2007-09-17 Thread Andrew Morton

On Tue, 18 Sep 2007 03:15:14 +0530 (IST)
Satyam Sharma <[EMAIL PROTECTED]> wrote:

> 
> 
> On Sun, 16 Sep 2007, Andrew Morton wrote:
> 
> > On Mon, 17 Sep 2007 05:54:45 +0530 "Satyam Sharma" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > On 9/17/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I'm getting this:
> > > >
> > > > rusb2/pvrusb2: struct usb_device_id is 20 bytes.  The last of 3 is:
> > > > 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
> > > > 0x00
> > > > 0x00 0x00 0x00 0x00 0x00
> > > > FATAL: drivers/media/video/pvrusb2/pvrusb2: struct usb_device_id is not 
> > > > terminated
> > > > with a NULL entry!
> > > >
> > > > ("rusb2/pvrusb2" ??)
> > > 
> > > Hmm? Are you sure you didn't see any "drivers/media/video/pv" before the
> > > "rusb2/pvrusb2" bit?
> > 
> > Fairly.  I looked twice.
> 
> "drivers/media/video/pvrusb2/pvrusb2" comes out correctly here ...
> 
> 
> > > Looking at Kees' patch (and the existing code), I've no
> > > clue how/why this should happen ... will try to reproduce here ...
> > > 
> > > 
> > > > but:
> > > >
> > > > struct usb_device_id pvr2_device_table[] = {
> > > > [PVR2_HDW_TYPE_29XXX] = { USB_DEVICE(0x2040, 0x2900) },
> > > > [PVR2_HDW_TYPE_24XXX] = { USB_DEVICE(0x2040, 0x2400) },
> > > > { USB_DEVICE(0, 0) },
> > > > };
> > > >
> > > > looks OK?
> > > >
> > > > Using plain old "{ }" shut the warning up.
> > > 
> > > USB_DEVICE(0, 0) is not empty termination, actually, and this looks like
> > > a genuine bug caught by the patch. As that dump shows, USB_DEVICE(0, 0)
> > > assigns "0x03 0x00" (in little endian) to usb_device_id.match_flags. And
> > > I don't think the USB code treats such an entry as an empty entry (?)
> > > 
> > > Interestingly, the "USB_DEVICE(0, 0)" thing is absent from latest -git
> > > tree and also in my copy of 23-rc4-mm1 -- so this looks like something
> > > you must've merged recently.
> > 
> > git-dvb very carefully does
> > 
> > --- a/drivers/media/video/pvrusb2/pvrusb2-hdw.c~git-dvb
> > +++ a/drivers/media/video/pvrusb2/pvrusb2-hdw.c
> > @@ -44,7 +44,7 @@
> >  struct usb_device_id pvr2_device_table[] = {
> > [PVR2_HDW_TYPE_29XXX] = { USB_DEVICE(0x2040, 0x2900) },
> > [PVR2_HDW_TYPE_24XXX] = { USB_DEVICE(0x2040, 0x2400) },
> > -   { }
> > +   { USB_DEVICE(0, 0) },
> > };
> >  
> > MODULE_DEVICE_TABLE(usb, pvr2_device_table);
> 
> Ok, this is a false positive indeed, the core USB code does in fact
> treat such an entry as an empty entry (usb_match_id() tests only the
> .idVendor, .bDeviceClass, .bInterfaceClass and .driver_info members
> for non-zero and not the .match_flags member).
> 
> However, a quick-grep-and-glance tells us that none of the other 2213
> occurrences of USB_DEVICE() in the tree ever do this "(0,0)" thing,
> so it does make sense to change this one to a simple "{ }" as well --
> that's clearer style anyway, and the "standard" way to empty-terminate
> in the rest of the tree, if nothing else.
> 

yeah, I think so.  Mauro, could you please drop that change?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/11] eCryptfs: Remove assignments in if-statements

Remove assignments in if-statements.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c|   17 --
 fs/ecryptfs/file.c  |8 --
 fs/ecryptfs/inode.c |   35 ++
 fs/ecryptfs/keystore.c  |   55 +-
 fs/ecryptfs/main.c  |   28 ++-
 fs/ecryptfs/messaging.c |5 ++-
 fs/ecryptfs/mmap.c  |5 ++-
 7 files changed, 89 insertions(+), 64 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index 3dbb21a..5d8a553 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -1277,8 +1277,8 @@ static int ecryptfs_read_header_region(char *data, struct 
dentry *dentry,
mm_segment_t oldfs;
int rc;
 
-   if ((rc = ecryptfs_open_lower_file(_file, dentry, mnt,
-  O_RDONLY))) {
+   rc = ecryptfs_open_lower_file(_file, dentry, mnt, O_RDONLY);
+   if (rc) {
printk(KERN_ERR
   "Error opening lower_file to read header region\n");
goto out;
@@ -1289,7 +1289,8 @@ static int ecryptfs_read_header_region(char *data, struct 
dentry *dentry,
rc = lower_file->f_op->read(lower_file, (char __user *)data,
  ECRYPTFS_DEFAULT_EXTENT_SIZE, _file->f_pos);
set_fs(oldfs);
-   if ((rc = ecryptfs_close_lower_file(lower_file))) {
+   rc = ecryptfs_close_lower_file(lower_file);
+   if (rc) {
printk(KERN_ERR "Error closing lower_file\n");
goto out;
}
@@ -1951,9 +1952,10 @@ ecryptfs_add_new_key_tfm(struct ecryptfs_key_tfm 
**key_tfm, char *cipher_name,
strncpy(tmp_tfm->cipher_name, cipher_name,
ECRYPTFS_MAX_CIPHER_NAME_SIZE);
tmp_tfm->key_size = key_size;
-   if ((rc = ecryptfs_process_key_cipher(_tfm->key_tfm,
- tmp_tfm->cipher_name,
- _tfm->key_size))) {
+   rc = ecryptfs_process_key_cipher(_tfm->key_tfm,
+tmp_tfm->cipher_name,
+_tfm->key_size);
+   if (rc) {
printk(KERN_ERR "Error attempting to initialize key TFM "
   "cipher with name = [%s]; rc = [%d]\n",
   tmp_tfm->cipher_name, rc);
@@ -1988,7 +1990,8 @@ int ecryptfs_get_tfm_and_mutex_for_cipher_name(struct 
crypto_blkcipher **tfm,
}
}
mutex_unlock(_tfm_list_mutex);
-   if ((rc = ecryptfs_add_new_key_tfm(_tfm, cipher_name, 0))) {
+   rc = ecryptfs_add_new_key_tfm(_tfm, cipher_name, 0);
+   if (rc) {
printk(KERN_ERR "Error adding new key_tfm to list; rc = [%d]\n",
   rc);
goto out;
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 12ba7e3..59c846d 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -230,8 +230,9 @@ static int ecryptfs_open(struct inode *inode, struct file 
*file)
lower_flags &= ~O_APPEND;
lower_mnt = ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
/* Corresponding fput() in ecryptfs_release() */
-   if ((rc = ecryptfs_open_lower_file(_file, lower_dentry, lower_mnt,
-  lower_flags))) {
+   rc = ecryptfs_open_lower_file(_file, lower_dentry, lower_mnt,
+ lower_flags);
+   if (rc) {
ecryptfs_printk(KERN_ERR, "Error opening lower file\n");
goto out_puts;
}
@@ -300,7 +301,8 @@ static int ecryptfs_release(struct inode *inode, struct 
file *file)
struct inode *lower_inode = ecryptfs_inode_to_lower(inode);
int rc;
 
-   if ((rc = ecryptfs_close_lower_file(lower_file))) {
+   rc = ecryptfs_close_lower_file(lower_file);
+   if (rc) {
printk(KERN_ERR "Error closing lower_file\n");
goto out;
}
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index abac91c..d70f599 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -202,8 +202,9 @@ static int ecryptfs_initialize_file(struct dentry 
*ecryptfs_dentry)
lower_flags = ((O_CREAT | O_TRUNC) & O_ACCMODE) | O_RDWR;
lower_mnt = ecryptfs_dentry_to_lower_mnt(ecryptfs_dentry);
/* Corresponding fput() at end of this function */
-   if ((rc = ecryptfs_open_lower_file(_file, lower_dentry, lower_mnt,
-  lower_flags))) {
+   rc = ecryptfs_open_lower_file(_file, lower_dentry, lower_mnt,
+ lower_flags);
+   if (rc) {
ecryptfs_printk(KERN_ERR,
"Error opening dentry; rc = [%i]\n", rc);
goto out;
@@ -229,7 +230,8 @@ static int ecryptfs_initialize_file(struct dentry 
*ecryptfs_dentry)

[PATCH 1/11] eCryptfs: Remove header_extent_size

There is no point to keeping a separate header_extent_size and an
extent_size. The total size of the header can always be represented as
some multiple of the regular data extent size.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>
---
 fs/ecryptfs/crypto.c  |   40 
 fs/ecryptfs/ecryptfs_kernel.h |   39 +++
 fs/ecryptfs/inode.c   |7 ---
 fs/ecryptfs/mmap.c|2 +-
 4 files changed, 52 insertions(+), 36 deletions(-)

diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
index 8e9b36d..3dbb21a 100644
--- a/fs/ecryptfs/crypto.c
+++ b/fs/ecryptfs/crypto.c
@@ -366,8 +366,8 @@ ecryptfs_extent_to_lwr_pg_idx_and_offset(unsigned long 
*lower_page_idx,
int extents_per_page;
 
bytes_occupied_by_headers_at_front =
-   ( crypt_stat->header_extent_size
- * crypt_stat->num_header_extents_at_front );
+   (crypt_stat->extent_size
+* crypt_stat->num_header_extents_at_front);
extents_occupied_by_headers_at_front =
( bytes_occupied_by_headers_at_front
  / crypt_stat->extent_size );
@@ -376,8 +376,8 @@ ecryptfs_extent_to_lwr_pg_idx_and_offset(unsigned long 
*lower_page_idx,
(*lower_page_idx) = lower_extent_num / extents_per_page;
extent_offset = lower_extent_num % extents_per_page;
(*byte_offset) = extent_offset * crypt_stat->extent_size;
-   ecryptfs_printk(KERN_DEBUG, " * crypt_stat->header_extent_size = "
-   "[%d]\n", crypt_stat->header_extent_size);
+   ecryptfs_printk(KERN_DEBUG, " * crypt_stat->extent_size = "
+   "[%d]\n", crypt_stat->extent_size);
ecryptfs_printk(KERN_DEBUG, " * crypt_stat->"
"num_header_extents_at_front = [%d]\n",
crypt_stat->num_header_extents_at_front);
@@ -899,15 +899,17 @@ void ecryptfs_set_default_sizes(struct 
ecryptfs_crypt_stat *crypt_stat)
crypt_stat->extent_size = ECRYPTFS_DEFAULT_EXTENT_SIZE;
set_extent_mask_and_shift(crypt_stat);
crypt_stat->iv_bytes = ECRYPTFS_DEFAULT_IV_BYTES;
-   if (PAGE_CACHE_SIZE <= ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE) {
-   crypt_stat->header_extent_size =
-   ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE;
-   } else
-   crypt_stat->header_extent_size = PAGE_CACHE_SIZE;
if (crypt_stat->flags & ECRYPTFS_METADATA_IN_XATTR)
crypt_stat->num_header_extents_at_front = 0;
-   else
-   crypt_stat->num_header_extents_at_front = 1;
+   else {
+   if (PAGE_CACHE_SIZE <= ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE)
+   crypt_stat->num_header_extents_at_front =
+   (ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE
+/ crypt_stat->extent_size);
+   else
+   crypt_stat->num_header_extents_at_front =
+   (PAGE_CACHE_SIZE / crypt_stat->extent_size);
+   }
 }
 
 /**
@@ -1319,7 +1321,7 @@ ecryptfs_write_header_metadata(char *virt,
u32 header_extent_size;
u16 num_header_extents_at_front;
 
-   header_extent_size = (u32)crypt_stat->header_extent_size;
+   header_extent_size = (u32)crypt_stat->extent_size;
num_header_extents_at_front =
(u16)crypt_stat->num_header_extents_at_front;
header_extent_size = cpu_to_be32(header_extent_size);
@@ -1415,7 +1417,7 @@ ecryptfs_write_metadata_to_contents(struct 
ecryptfs_crypt_stat *crypt_stat,
set_fs(oldfs);
goto out;
}
-   header_pages = ((crypt_stat->header_extent_size
+   header_pages = ((crypt_stat->extent_size
 * crypt_stat->num_header_extents_at_front)
/ PAGE_CACHE_SIZE);
memset(page_virt, 0, PAGE_CACHE_SIZE);
@@ -1532,17 +1534,16 @@ static int parse_header_metadata(struct 
ecryptfs_crypt_stat *crypt_stat,
virt += 4;
memcpy(_header_extents_at_front, virt, 2);
num_header_extents_at_front = be16_to_cpu(num_header_extents_at_front);
-   crypt_stat->header_extent_size = (int)header_extent_size;
crypt_stat->num_header_extents_at_front =
(int)num_header_extents_at_front;
-   (*bytes_read) = 6;
+   (*bytes_read) = (sizeof(u32) + sizeof(u16));
if ((validate_header_size == ECRYPTFS_VALIDATE_HEADER_SIZE)
-   && ((crypt_stat->header_extent_size
+   && ((crypt_stat->extent_size
 * crypt_stat->num_header_extents_at_front)
< ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE)) {
rc = -EINVAL;
-   ecryptfs_printk(KERN_WARNING, "Invalid header extent size: "
-   "[%d]\n", crypt_stat->header_extent_size);
+   printk(KERN_WARNING "Invalid number

[PATCH 0/11] eCryptfs: Introduce persistent lower files for each eCryptfs inode

Currently, eCryptfs directly accesses the lower inode address space,
doing things like grab_cache_page() on lower_inode->i_mapping. It
really should not do that. The main point of this patch set is to make
all I/O with the lower files go through vfs_read() and vfs_write()
instead.

In order to accomplish this, eCryptfs needs a way to call vfs_read()
and vfs_write() on the lower file when ecryptfs_aops->readpage() and
ecryptfs_aops->writepage() are called. I propose keeping a persistent
lower file around for each eCryptfs inode. This is the only lower file
that eCryptfs will open for any given eCryptfs inode; multiple
eCryptfs files may map to this one persistent lower file. When the
eCrypfs inode is destroyed, this persistent lower file is closed.

Consolidating all reads and writes to the lower file to a single
execution path simplifies the code. This should also make it easier to
port eCryptfs to use the asynchronous crypto API functions. Note that
this patch set also removes all direct calls to lower prepare_write()
and commite_write(), fixing an oops when mounted on NFS.

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: InfiniBand/RDMA merge plans for 2.6.24

 > > IPoIB CM handles this properly by gathering together single pages in
 > > skbs' fragment lists.

 > Then can we reuse IPoIB CM code here?

Yes, if possible, refactoring things so that the rx skb allocation
code becomes common between CM and non-CM would definitely make sense.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] revert ath5k ioread32()/iowrite32() usage - use readl()/writel(), we're MMIO-only

2007-09-17 Thread Jiri Slaby

On 09/17/2007 10:59 PM, Jeff Garzik wrote:
> Jiri Slaby wrote:
>> NACK, this is wrong. iomap returns platform dependant return value,
>> which may or
> 
> Incorrect.  readl() and writel() work just fine on all existing
> platforms where Atheros may be used.

Ok, this is what Alan Cox wrote about that and you didn't reply to it, so I
thought he's right. Anyway I wouldn't rely on iomap that it will never be
changed even on x86 -- what's the (performance) impact of having ioread instead
of readl? How much data are transferred this way?

http://lkml.org/lkml/2007/8/25/50;>
On Sat, 25 Aug 2007 04:56:19 -0400
Jeff Garzik <[EMAIL PROTECTED]> wrote:

> If the driver knows its MMIO, using readX/writeX after pci_iomap() is
> just fine, for all current implementations, and it makes sense that way.

There is nothing that guarantees this is permitted, any more than there
is anything saying not to use outb/outl. Some of the implementations do
quite strange things. It may happen to work but its not in the
documentation or the comments.

If you want to change this then you need to check the existing usages and
update all the docs if its safe, oh and tell the sparc64 pcmcia people to
take a hike, which is probably not a big problem.

Please, can anybody clarify it?

thanks,
-- 
Jiri Slaby ([EMAIL PROTECTED])
Faculty of Informatics, Masaryk University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 alpha unistd.h changes

2007-09-17 Thread Adrian Bunk

On Mon, Sep 17, 2007 at 10:33:07PM +0200, Oliver Falk wrote:
> Hi!

Hi Oliver!

>...
> As these additions are quite new to upstream kernel, but at Alphacore we
> have patched it since a while now (I don't know about other Alpha ports;
> Debian folks may speak up now!), I would suggest to use the same
> 'ordering' of the syscalls upstream and add the new syscalls that we had
> not in place, but are now upstream to the end of our 'old' list.
>...

I just checked:

It seems Debian didn't patch them into the kernel at all, and since two 
months Debian unstable ships kernel 2.6.22 with the upstream syscall 
numbers.

> Best,
>  Oliver

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] modpost: detect unterminated device id lists



On Sun, 16 Sep 2007, Andrew Morton wrote:

> On Mon, 17 Sep 2007 05:54:45 +0530 "Satyam Sharma" <[EMAIL PROTECTED]> wrote:
> 
> > On 9/17/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > >
> > > I'm getting this:
> > >
> > > rusb2/pvrusb2: struct usb_device_id is 20 bytes.  The last of 3 is:
> > > 0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> > > 0x00 0x00 0x00 0x00 0x00
> > > FATAL: drivers/media/video/pvrusb2/pvrusb2: struct usb_device_id is not 
> > > terminated
> > > with a NULL entry!
> > >
> > > ("rusb2/pvrusb2" ??)
> > 
> > Hmm? Are you sure you didn't see any "drivers/media/video/pv" before the
> > "rusb2/pvrusb2" bit?
> 
> Fairly.  I looked twice.

"drivers/media/video/pvrusb2/pvrusb2" comes out correctly here ...


> > Looking at Kees' patch (and the existing code), I've no
> > clue how/why this should happen ... will try to reproduce here ...
> > 
> > 
> > > but:
> > >
> > > struct usb_device_id pvr2_device_table[] = {
> > > [PVR2_HDW_TYPE_29XXX] = { USB_DEVICE(0x2040, 0x2900) },
> > > [PVR2_HDW_TYPE_24XXX] = { USB_DEVICE(0x2040, 0x2400) },
> > > { USB_DEVICE(0, 0) },
> > > };
> > >
> > > looks OK?
> > >
> > > Using plain old "{ }" shut the warning up.
> > 
> > USB_DEVICE(0, 0) is not empty termination, actually, and this looks like
> > a genuine bug caught by the patch. As that dump shows, USB_DEVICE(0, 0)
> > assigns "0x03 0x00" (in little endian) to usb_device_id.match_flags. And
> > I don't think the USB code treats such an entry as an empty entry (?)
> > 
> > Interestingly, the "USB_DEVICE(0, 0)" thing is absent from latest -git
> > tree and also in my copy of 23-rc4-mm1 -- so this looks like something
> > you must've merged recently.
> 
> git-dvb very carefully does
> 
> --- a/drivers/media/video/pvrusb2/pvrusb2-hdw.c~git-dvb
> +++ a/drivers/media/video/pvrusb2/pvrusb2-hdw.c
> @@ -44,7 +44,7 @@
>  struct usb_device_id pvr2_device_table[] = {
>   [PVR2_HDW_TYPE_29XXX] = { USB_DEVICE(0x2040, 0x2900) },
>   [PVR2_HDW_TYPE_24XXX] = { USB_DEVICE(0x2040, 0x2400) },
> -   { }
> +   { USB_DEVICE(0, 0) },
> };
>
> MODULE_DEVICE_TABLE(usb, pvr2_device_table);

Ok, this is a false positive indeed, the core USB code does in fact
treat such an entry as an empty entry (usb_match_id() tests only the
.idVendor, .bDeviceClass, .bInterfaceClass and .driver_info members
for non-zero and not the .match_flags member).

However, a quick-grep-and-glance tells us that none of the other 2213
occurrences of USB_DEVICE() in the tree ever do this "(0,0)" thing,
so it does make sense to change this one to a simple "{ }" as well --
that's clearer style anyway, and the "standard" way to empty-terminate
in the rest of the tree, if nothing else.


Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] [PATCH] [WORKAROUND] CONFIG_PREEMPT_RT and ib_umad_close() issue

Thanks for the explanation...

 > But basically, with CONFIG_PREEMPT_RT enabled, the lock points, such as
 > aqcuiring a spinlock, potentially become places where the current task
 > may be context switched out / preempted.
 > 
 > Therefore, when a call is made to lock a spinlock for example, the
 > caller should not currently have irqs disabled, or preemption disabled,
 > since a context switch may occur.

this doesn't seem relevant here...

 > void fastcall rt_downgrade_write(struct rw_semaphore *rwsem)
 > {
 > BUG();
 > }

this seems to be the problem... the -rt patch turns downgrade_write()
into a BUG().

I need to look at the locking in user_mad.c again, but I think it may
be possible to replace both places that do downgrade_write() with
up_write() followed by down_read().

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 18/33] containers implement namespace tracking subsystem

2007-09-17 Thread Paul Menage

From: "Serge E. Hallyn" <[EMAIL PROTECTED]>
(container->cgroup renaming by Paul Menage <[EMAIL PROTECTED]>)

When a task enters a new namespace via a clone() or unshare(), a new cgroup
is created and the task moves into it.

This version names cgroups which are automatically created using
cgroup_clone() as "node_" where pid is the pid of the unsharing or
cloned process.  (Thanks Pavel for the idea) This is safe because if the
process unshares again, it will create

/cgroups/(...)/node_/node_

The only possibilities (AFAICT) for a -EEXIST on unshare are

1. pid wraparound
2. a process fails an unshare, then tries again.

Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
node_ will be empty and can be rmdir'ed to make the subsequent unshare()
succeed.

Changelog:
Name cloned cgroups as "node_".

Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
Signed-off-by: Paul Menage <[EMAIL PROTECTED]>

---

 include/linux/cgroup_subsys.h |6 +
 include/linux/nsproxy.h  |7 ++
 init/Kconfig |9 ++
 kernel/Makefile  |1 
 kernel/ns_cgroup.c|  100 +
 kernel/nsproxy.c |   17 
 6 files changed, 139 insertions(+), 1 deletion(-)

diff -puN 
include/linux/cgroup_subsys.h~cgroups-implement-namespace-tracking-subsystem 
include/linux/cgroup_subsys.h
--- 
a/include/linux/cgroup_subsys.h~cgroups-implement-namespace-tracking-subsystem
+++ a/include/linux/cgroup_subsys.h
@@ -24,3 +24,9 @@ SUBSYS(debug)
 #endif
 
 /* */
+
+#ifdef CONFIG_CGROUP_NS
+SUBSYS(ns)
+#endif
+
+/* */
diff -puN 
include/linux/nsproxy.h~cgroups-implement-namespace-tracking-subsystem 
include/linux/nsproxy.h
--- a/include/linux/nsproxy.h~cgroups-implement-namespace-tracking-subsystem
+++ a/include/linux/nsproxy.h
@@ -55,4 +55,11 @@ static inline void exit_task_namespaces(
put_nsproxy(ns);
}
 }
+
+#ifdef CONFIG_CGROUP_NS
+int ns_cgroup_clone(struct task_struct *tsk);
+#else
+static inline int ns_cgroup_clone(struct task_struct *tsk) { return 0; }
+#endif
+
 #endif
diff -puN init/Kconfig~cgroups-implement-namespace-tracking-subsystem 
init/Kconfig
--- a/init/Kconfig~cgroups-implement-namespace-tracking-subsystem
+++ a/init/Kconfig
@@ -323,6 +323,15 @@ config SYSFS_DEPRECATED
  If you are using a distro that was released in 2006 or later,
  it should be safe to say N here.
 
+config CGROUP_NS
+bool "Namespace cgroup subsystem"
+select CGROUPS
+help
+  Provides a simple namespace cgroup subsystem to
+  provide hierarchical naming of sets of namespaces,
+  for instance virtual servers and checkpoint/restart
+  jobs.
+
 config PROC_PID_CPUSET
bool "Include legacy /proc//cpuset file"
depends on CPUSETS
diff -puN kernel/Makefile~cgroups-implement-namespace-tracking-subsystem 
kernel/Makefile
--- a/kernel/Makefile~cgroups-implement-namespace-tracking-subsystem
+++ a/kernel/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
+obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
 obj-$(CONFIG_IKCONFIG) += configs.o
 obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff -puN /dev/null kernel/ns_cgroup.c
--- /dev/null
+++ a/kernel/ns_cgroup.c
@@ -0,0 +1,100 @@
+/*
+ * ns_cgroup.c - namespace cgroup subsystem
+ *
+ * Copyright 2006, 2007 IBM Corp
+ */
+
+#include 
+#include 
+#include 
+
+struct ns_cgroup {
+   struct cgroup_subsys_state css;
+   spinlock_t lock;
+};
+
+struct cgroup_subsys ns_subsys;
+
+static inline struct ns_cgroup *cgroup_to_ns(
+   struct cgroup *cgroup)
+{
+   return container_of(cgroup_subsys_state(cgroup, ns_subsys_id),
+   struct ns_cgroup, css);
+}
+
+int ns_cgroup_clone(struct task_struct *task)
+{
+   return cgroup_clone(task, _subsys);
+}
+
+/*
+ * Rules:
+ *   1. you can only enter a cgroup which is a child of your current
+ * cgroup
+ *   2. you can only place another process into a cgroup if
+ * a. you have CAP_SYS_ADMIN
+ * b. your cgroup is an ancestor of task's destination cgroup
+ *   (hence either you are in the same cgroup as task, or in an
+ *ancestor cgroup thereof)
+ */
+static int ns_can_attach(struct cgroup_subsys *ss,
+   struct cgroup *new_cgroup, struct task_struct *task)
+{
+   struct cgroup *orig;
+
+   if (current != task) {
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   if (!cgroup_is_descendant(new_cgroup))
+   return -EPERM;
+   }
+
+   if (atomic_read(_cgroup->count) != 0)
+   return -EPERM;
+
+   orig = task_cgroup(task, ns_subsys_id);
+   if (orig && orig != new_cgroup->parent)
+

Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting

On Mon, 17 Sep 2007, Lee Schermerhorn wrote:

> Only for vma policy, right?  show_numa_maps() isn't a performance path,
> and shared policies are already reference counted--just not unref'd!

Right.

> I do have some ideas for enhancements to memtoy to test vma policies in
> a multi-threaded task.  I have the basic multi-threading infrastructure
> that binds threads to cpus, allocates node local stacks, thread state
> structs, ... in my mmtrace tool that I can probably hack for use in
> memtoy to provoke cacheline bouncing of the mem policy.  But, if pft
> does the trick, I won't rush the memtoy enhancments...

Well pft is old and limited in what it can do. I'd be glad if you could 
put it into memtoy. Then it may perhaps be useful in the future.

> Meanwhile, we do have a mem policy ref counting bug in the mainline.

But we have had this ref counting issue forever with no ill effect. Memory 
policies were designed to have almost no overhead for the default 
allocation paths. Incrementing and decrementing refcounters makes that 
design no longer light weight as it was intended to be.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Configurable reclaim batch size