Re: [PATCH] target: Update copyright ownership to 2012

2012-11-10 Thread Kyle Moffett
On Fri, Nov 9, 2012 at 3:00 PM, Nicholas A. Bellinger
 wrote:
> This patch to update copyright year to current for principal target core
> ownership is now being pushed into target-pending/for-next.

Pardon me, but you were just publicly accused of violating the GPL, so
your response is to send a patch removing the copyright notices of all
other organizations from the SCSI-target code?  Have you obtained
ownership of all the relevant copyrights for Linux-iSCSI.org, PyX
Technologies, Inc, and SBE, Inc?  If not, then this patch is an
attempted violation of those organizations copyrights and of the GPL
(which requires that you preserve copyright notices).

Further, while these notices are the only ones listed in those files,
they are not the only individuals outside of RisingTide Systems which
have significant copyright interest in this code.  If your goal is to
obtain exclusive copyright ownership over this code then there are a
great many other people you must contact and convince first.

I would encourage you to talk privately with the Software Freedom
Conservancy before sending more patches of this nature.

Cheers,
Kyle Moffett

> diff --git a/drivers/target/target_core_alua.c 
> b/drivers/target/target_core_alua.c
> - * Copyright (c) 2009-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_configfs.c 
> b/drivers/target/target_core_configfs.c
> - * Copyright (c) 2008-2011 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_device.c 
> b/drivers/target/target_core_device.c
> - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005-2006 SBE, Inc.  All Rights Reserved.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_fabric_configfs.c 
> b/drivers/target/target_core_fabric_configfs.c
> - * Copyright (c) 2010,2011 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_fabric_lib.c 
> b/drivers/target/target_core_fabric_lib.c
> - * Copyright (c) 2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_file.c 
> b/drivers/target/target_core_file.c
> - * Copyright (c) 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005-2006 SBE, Inc.  All Rights Reserved.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_hba.c 
> b/drivers/target/target_core_hba.c
> - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_iblock.c 
> b/drivers/target/target_core_iblock.c
> - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c
> - * Copyright (c) 2009, 2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_pscsi.c 
> b/drivers/target/target_core_pscsi.c
> - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c
> - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_sbc.c 
> b/drivers/target/target_core_sbc.c
> - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_spc.c 
> b/drivers/target/target_core_spc.c
> - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_stat.c 
> b/drivers/target/target_core_stat.c
> - * Copyright (c) 2011 Linux-iSCSI.org
> - * Copyright (c) 2006-2007 SBE, Inc.  All Rights Reserved.
> diff --git a/drivers/target/target_core_tmr.c 
> b/drivers/target/target_core_tmr.c
> - * Copyright (c) 2009,2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_tpg.c 
> b/drivers/target/target_core_tpg.c
> - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_transport.c 
> b/drivers/target/target_core_transport.c
> - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
> - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
> - * Copyright (c) 2008-2010 Linux-iSCSI.org
> diff --git a/drivers/target/target_core_ua.c b/drivers/target/target_core_ua.c
> - * Copyright (c) 2009,2010 Linux-iSCSI.org
--
To unsubscrib

Re: [PATCH] target: Update copyright ownership to 2012

2012-11-10 Thread Kyle Moffett
On Fri, Nov 9, 2012 at 3:00 PM, Nicholas A. Bellinger
n...@linux-iscsi.org wrote:
 This patch to update copyright year to current for principal target core
 ownership is now being pushed into target-pending/for-next.

Pardon me, but you were just publicly accused of violating the GPL, so
your response is to send a patch removing the copyright notices of all
other organizations from the SCSI-target code?  Have you obtained
ownership of all the relevant copyrights for Linux-iSCSI.org, PyX
Technologies, Inc, and SBE, Inc?  If not, then this patch is an
attempted violation of those organizations copyrights and of the GPL
(which requires that you preserve copyright notices).

Further, while these notices are the only ones listed in those files,
they are not the only individuals outside of RisingTide Systems which
have significant copyright interest in this code.  If your goal is to
obtain exclusive copyright ownership over this code then there are a
great many other people you must contact and convince first.

I would encourage you to talk privately with the Software Freedom
Conservancy before sending more patches of this nature.

Cheers,
Kyle Moffett

 diff --git a/drivers/target/target_core_alua.c 
 b/drivers/target/target_core_alua.c
 - * Copyright (c) 2009-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_configfs.c 
 b/drivers/target/target_core_configfs.c
 - * Copyright (c) 2008-2011 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_device.c 
 b/drivers/target/target_core_device.c
 - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005-2006 SBE, Inc.  All Rights Reserved.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_fabric_configfs.c 
 b/drivers/target/target_core_fabric_configfs.c
 - * Copyright (c) 2010,2011 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_fabric_lib.c 
 b/drivers/target/target_core_fabric_lib.c
 - * Copyright (c) 2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_file.c 
 b/drivers/target/target_core_file.c
 - * Copyright (c) 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005-2006 SBE, Inc.  All Rights Reserved.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_hba.c 
 b/drivers/target/target_core_hba.c
 - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_iblock.c 
 b/drivers/target/target_core_iblock.c
 - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c
 - * Copyright (c) 2009, 2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_pscsi.c 
 b/drivers/target/target_core_pscsi.c
 - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_rd.c b/drivers/target/target_core_rd.c
 - * Copyright (c) 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_sbc.c 
 b/drivers/target/target_core_sbc.c
 - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_spc.c 
 b/drivers/target/target_core_spc.c
 - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_stat.c 
 b/drivers/target/target_core_stat.c
 - * Copyright (c) 2011 Linux-iSCSI.org
 - * Copyright (c) 2006-2007 SBE, Inc.  All Rights Reserved.
 diff --git a/drivers/target/target_core_tmr.c 
 b/drivers/target/target_core_tmr.c
 - * Copyright (c) 2009,2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_tpg.c 
 b/drivers/target/target_core_tpg.c
 - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_transport.c 
 b/drivers/target/target_core_transport.c
 - * Copyright (c) 2002, 2003, 2004, 2005 PyX Technologies, Inc.
 - * Copyright (c) 2005, 2006, 2007 SBE, Inc.
 - * Copyright (c) 2008-2010 Linux-iSCSI.org
 diff --git a/drivers/target/target_core_ua.c b/drivers/target/target_core_ua.c
 - * Copyright (c) 2009,2010 Linux-iSCSI.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Best method to control a "transmit-only" mode on fiber NICs (specifically sky2)

2008-02-15 Thread Kyle Moffett
Hi,

The company I'm working for has an unusual fiber NIC configuration
that we use for one of our network appliances.  We connect only a
single fiber from the TX port on one NIC to the RX port on another
NIC, providing a physically-one-way path for enhanced security.
Unfortunately this doesn't work with most NIC drivers, as even with
auto-negotiation off they look for link probe pulses before they
consider the link "up" and are willing to send packets.  We have been
able to use Myricom 10GigE NICs with a custom firmware image.  More
recently we have patched the sky2 driver to turn on the FIB_FORCE_LNK
flag in the PHY control register; this seems to work on the
Marvell-chipset boards we have here.

What would be the preferred way to control this "force link" flag?
Right now we are accessing it using ethtool; we have added an
additional "duplex" mode: "DUPLEX_TXONLY", with a value of 2.  When
you specify a speed and turn off autonegotiation ("./patched-ethtool
-s eth2 speed 1000 autoneg off duplex txonly"), it will turn on the
specified bit in the PHY control register and the link will
automatically come up.  We also have one related bug-fix^Wdirty hack
for sky2 to reset the PHY a second time during netif-up after enabling
interrupts; otherwise the immediate "link up" interrupt gets lost.
Once I get approval from the company I will patch the post itself for
review.

I look forward to your comments and suggestions

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Best method to control a transmit-only mode on fiber NICs (specifically sky2)

2008-02-15 Thread Kyle Moffett
Hi,

The company I'm working for has an unusual fiber NIC configuration
that we use for one of our network appliances.  We connect only a
single fiber from the TX port on one NIC to the RX port on another
NIC, providing a physically-one-way path for enhanced security.
Unfortunately this doesn't work with most NIC drivers, as even with
auto-negotiation off they look for link probe pulses before they
consider the link up and are willing to send packets.  We have been
able to use Myricom 10GigE NICs with a custom firmware image.  More
recently we have patched the sky2 driver to turn on the FIB_FORCE_LNK
flag in the PHY control register; this seems to work on the
Marvell-chipset boards we have here.

What would be the preferred way to control this force link flag?
Right now we are accessing it using ethtool; we have added an
additional duplex mode: DUPLEX_TXONLY, with a value of 2.  When
you specify a speed and turn off autonegotiation (./patched-ethtool
-s eth2 speed 1000 autoneg off duplex txonly), it will turn on the
specified bit in the PHY control register and the link will
automatically come up.  We also have one related bug-fix^Wdirty hack
for sky2 to reset the PHY a second time during netif-up after enabling
interrupts; otherwise the immediate link up interrupt gets lost.
Once I get approval from the company I will patch the post itself for
review.

I look forward to your comments and suggestions

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[NET/IPv6] Race condition with flow_cache_genid?

2008-02-06 Thread Kyle Moffett
Whoops, I accidentally sent this to [EMAIL PROTECTED] instead of
[EMAIL PROTECTED]  Original email below:


Hi, I was poking around trying to figure out how to install the Mobile
IPv6 daemons this evening and noticed they required a kernel patch,
although upon further inspection the kernel patch seemed to already be
applied in 2.6.24.  Unfortunately the flow cache appears to be
horribly racy.  Attached below are the only uses of the variable
"flow_cache_genid" in 2.6.24.

Now, I am no expert in this particular area of the code, but the
"atomic_t flow_cache_genid" variable is ONLY ever used with
atomic_inc() and atomic_read().  There are no memory barriers or other
dec_and_test()-style functions, so that variable could just as easily
be replaced with a plain old C int.

Basically either there is some missing locking here or it does not
need to be "atomic_t".  Judging from the way it *appears* to be used
to check if cache entries are up-to-date with the latest changes in
policy, I would guess the former.

In particular that whole "flow_cache_lookup()" thing looks racy as
hell, since somebody could be in the middle of that looking at "if
(fle->genid == atomic_read(_cache_genid))".  It does the
atomic_read(), which BTW is literally implemented as:
  #define atomic_read(atomicvar) ((atomicvar)->value)
on some platforms.  Immediately after the atomic read (or even before,
since there's no cache-flush or read-modify-write), somebody calls
into "selinux_xfrm_notify_policyload()" and increments the
flow_cache_genid becase selinux just loaded a security policy.  Now
we're accepting a cache entry which applies to PREVIOUS security
policy.  I can only assume that's really bad.

Even worse, there seems to be a race between SELinux loading a new
policy and calling selinux_xfrm_notify_policyload(), since we could
easily get packets and process them according to the old cache entry
on one CPU before SELinux has had a chance to update the generation ID
from the other.  Furthermore, there's no guarantee the CPU caches will
get updated in reasonable time.  Clearly SELinux needs to have some
way of atomically invalidating the flow cache of all CPUs
*simultaneously* with loading a new policy, which probably means they
both need to be under the same lock, or something.

The same problem appears to occur with updating the XFRM policy and
incrementing flow_cache_genid.  Probably the fastest solution is to
put the flow cache under the xfrm_policy_lock (which already disables
local bottom-halves), and either take that lock during SELinux policy
load or if there are lock ordering problems then add a variable
"flow_cache_ignore" and change the xfrm_notify hooks:

void selinux_xfrm_notify_policyload_pre(void)
{
write_lock_bh(_policy_lock);
flow_cache_genid++;
flow_cache_ignore = 1;
write_unlock_bh(_policy_lock);
}

void selinux_xfrm_notify_policyload_post(void)
{
write_lock_bh(_policy_lock);
    flow_cache_ignore = 0;
write_unlock_bh(_policy_lock);
}

Cheers,
Kyle Moffett


BEGIN QUOTED CODE INVOLVING flow_cache_genid:

include/net/flow.h:94:
extern atomic_t flow_cache_genid;

net/core/flow.c:39:
atomic_t flow_cache_genid = ATOMIC_INIT(0);

net/core/flow.c:169:flow_cache_lookup():
if (flow_hash_rnd_recalc(cpu))
flow_new_hash_rnd(cpu);
hash = flow_hash_code(key, cpu);

head = _table(cpu)[hash];
for (fle = *head; fle; fle = fle->next) {
if (fle->family == family &&
fle->dir == dir &&
flow_key_compare(key, >key) == 0) {
if (fle->genid == atomic_read(_cache_genid)) {
void *ret = fle->object;

if (ret)
atomic_inc(fle->object_ref);
local_bh_enable();

return ret;
}
break;
}
}

net/xfrm/xfrm_policy.c:1025:
int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
{
write_lock_bh(_policy_lock);
pol = __xfrm_policy_unlink(pol, dir);
write_unlock_bh(_policy_lock);
if (pol) {
if (dir < XFRM_POLICY_MAX)
atomic_inc(_cache_genid);
xfrm_policy_kill(pol);
return 0;
}
return -ENOENT;
}

net/ipv6/inet6_connection_sock.c:142:
static inline
void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
struct in6_addr *daddr, struct in6_addr *saddr)
{
__ip6_dst_store(sk, dst, daddr, saddr);

#ifdef CONFIG_XFRM
{
struct rt6_info *rt = (struct rt6_info  *)dst;
rt->rt6i_flow_cache_genid = atomic_read(_cache

Re: [PATCH] Allow NBD to be used locally

2008-02-02 Thread Kyle Moffett
Whoops, only hit "Reply" on the first email, sorry Jan.

On Feb 2, 2008 7:54 PM, Jan Engelhardt <[EMAIL PROTECTED]> wrote:
> On Feb 2 2008 18:31, [EMAIL PROTECTED] wrote:
> >
> >> How will that work? Fuse makes up a filesystem - not helpful
> >> if you have a raw disk without a known fs to mount.
> >
> >take zfs-fuse or ntfs-3g for example.
> >you have a blockdevice or backing-file containing data structures and fuse 
> >makes those show up as a filesystem.
> >i think vmware-mount is not different here.
>
> vmware-mount IS different, it provides the _block_ device,
> which is then mounted through the usual mount(2) mechanism
> (if there is a filesystem driver for it).

As far as I can tell, vmware-mount should be re-implemented as a
little perl script around "dmsetup" and/or "losetup", possibly with
"dm-userspace" patched into the kernel to allow you to handle
non-mapped blocks in your userspace daemon when somebody tries to
access them.  If you don't need that ability then straight dm-loop and
dm-linear will work.

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[NET/IPv6] Race condition with flow_cache_genid?

2008-02-02 Thread Kyle Moffett
Hi, I was poking around trying to figure out how to install the Mobile
IPv6 daemons this evening and noticed they required a kernel patch,
although upon further inspection the kernel patch seemed to already be
applied in 2.6.24.  Unfortunately the flow cache appears to be
horribly racy.  Attached below are the only uses of the variable
"flow_cache_genid" in 2.6.24.

Now, I am no expert in this particular area of the code, but the
"atomic_t flow_cache_genid" variable is ONLY ever used with
atomic_inc() and atomic_read().  There are no memory barriers or other
dec_and_test()-style functions, so that variable could just as easily
be replaced with a plain old C int.

Basically either there is some missing locking here or it does not
need to be "atomic_t".  Judging from the way it *appears* to be used
to check if cache entries are up-to-date with the latest changes in
policy, I would guess the former.

In particular that whole "flow_cache_lookup()" thing looks racy as
hell, since somebody could be in the middle of that looking at "if
(fle->genid == atomic_read(_cache_genid))".  It does the
atomic_read(), which BTW is literally implemented as:
  #define atomic_read(atomicvar) ((atomicvar)->value)
on some platforms.  Immediately after the atomic read (or even before,
since there's no cache-flush or read-modify-write), somebody calls
into "selinux_xfrm_notify_policyload()" and increments the
flow_cache_genid becase selinux just loaded a security policy.  Now
we're accepting a cache entry which applies to PREVIOUS security
policy.  I can only assume that's really bad.

Even worse, there seems to be a race between SELinux loading a new
policy and calling selinux_xfrm_notify_policyload(), since we could
easily get packets and process them according to the old cache entry
on one CPU before SELinux has had a chance to update the generation ID
from the other.  Furthermore, there's no guarantee the CPU caches will
get updated in reasonable time.  Clearly SELinux needs to have some
way of atomically invalidating the flow cache of all CPUs
*simultaneously* with loading a new policy, which probably means they
both need to be under the same lock, or something.

The same problem appears to occur with updating the XFRM policy and
incrementing flow_cache_genid.  Probably the fastest solution is to
put the flow cache under the xfrm_policy_lock (which already disables
local bottom-halves), and either take that lock during SELinux policy
load or if there are lock ordering problems then add a variable
"flow_cache_ignore" and change the xfrm_notify hooks:

void selinux_xfrm_notify_policyload_pre(void)
{
write_lock_bh(_policy_lock);
flow_cache_genid++;
flow_cache_ignore = 1;
write_unlock_bh(_policy_lock);
}

void selinux_xfrm_notify_policyload_post(void)
{
write_lock_bh(_policy_lock);
    flow_cache_ignore = 0;
write_unlock_bh(_policy_lock);
}

Cheers,
Kyle Moffett


BEGIN QUOTED CODE INVOLVING flow_cache_genid:

include/net/flow.h:94:
extern atomic_t flow_cache_genid;

net/core/flow.c:39:
atomic_t flow_cache_genid = ATOMIC_INIT(0);

net/core/flow.c:169:flow_cache_lookup():
if (flow_hash_rnd_recalc(cpu))
flow_new_hash_rnd(cpu);
hash = flow_hash_code(key, cpu);

head = _table(cpu)[hash];
for (fle = *head; fle; fle = fle->next) {
if (fle->family == family &&
fle->dir == dir &&
flow_key_compare(key, >key) == 0) {
if (fle->genid == atomic_read(_cache_genid)) {
void *ret = fle->object;

if (ret)
atomic_inc(fle->object_ref);
local_bh_enable();

return ret;
}
break;
}
}

net/xfrm/xfrm_policy.c:1025:
int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
{
write_lock_bh(_policy_lock);
pol = __xfrm_policy_unlink(pol, dir);
write_unlock_bh(_policy_lock);
if (pol) {
if (dir < XFRM_POLICY_MAX)
atomic_inc(_cache_genid);
xfrm_policy_kill(pol);
return 0;
}
return -ENOENT;
}

net/ipv6/inet6_connection_sock.c:142:
static inline
void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
struct in6_addr *daddr, struct in6_addr *saddr)
{
__ip6_dst_store(sk, dst, daddr, saddr);

#ifdef CONFIG_XFRM
{
struct rt6_info *rt = (struct rt6_info  *)dst;
rt->rt6i_flow_cache_genid = atomic_read(_cache_genid);
}
#endif
}

security/selinux/include/xfrm.h:41:
static inline void selinux_xfrm_notify_policyloa

[NET/IPv6] Race condition with flow_cache_genid?

2008-02-02 Thread Kyle Moffett
Hi, I was poking around trying to figure out how to install the Mobile
IPv6 daemons this evening and noticed they required a kernel patch,
although upon further inspection the kernel patch seemed to already be
applied in 2.6.24.  Unfortunately the flow cache appears to be
horribly racy.  Attached below are the only uses of the variable
flow_cache_genid in 2.6.24.

Now, I am no expert in this particular area of the code, but the
atomic_t flow_cache_genid variable is ONLY ever used with
atomic_inc() and atomic_read().  There are no memory barriers or other
dec_and_test()-style functions, so that variable could just as easily
be replaced with a plain old C int.

Basically either there is some missing locking here or it does not
need to be atomic_t.  Judging from the way it *appears* to be used
to check if cache entries are up-to-date with the latest changes in
policy, I would guess the former.

In particular that whole flow_cache_lookup() thing looks racy as
hell, since somebody could be in the middle of that looking at if
(fle-genid == atomic_read(flow_cache_genid)).  It does the
atomic_read(), which BTW is literally implemented as:
  #define atomic_read(atomicvar) ((atomicvar)-value)
on some platforms.  Immediately after the atomic read (or even before,
since there's no cache-flush or read-modify-write), somebody calls
into selinux_xfrm_notify_policyload() and increments the
flow_cache_genid becase selinux just loaded a security policy.  Now
we're accepting a cache entry which applies to PREVIOUS security
policy.  I can only assume that's really bad.

Even worse, there seems to be a race between SELinux loading a new
policy and calling selinux_xfrm_notify_policyload(), since we could
easily get packets and process them according to the old cache entry
on one CPU before SELinux has had a chance to update the generation ID
from the other.  Furthermore, there's no guarantee the CPU caches will
get updated in reasonable time.  Clearly SELinux needs to have some
way of atomically invalidating the flow cache of all CPUs
*simultaneously* with loading a new policy, which probably means they
both need to be under the same lock, or something.

The same problem appears to occur with updating the XFRM policy and
incrementing flow_cache_genid.  Probably the fastest solution is to
put the flow cache under the xfrm_policy_lock (which already disables
local bottom-halves), and either take that lock during SELinux policy
load or if there are lock ordering problems then add a variable
flow_cache_ignore and change the xfrm_notify hooks:

void selinux_xfrm_notify_policyload_pre(void)
{
write_lock_bh(xfrm_policy_lock);
flow_cache_genid++;
flow_cache_ignore = 1;
write_unlock_bh(xfrm_policy_lock);
}

void selinux_xfrm_notify_policyload_post(void)
{
write_lock_bh(xfrm_policy_lock);
flow_cache_ignore = 0;
write_unlock_bh(xfrm_policy_lock);
}

Cheers,
Kyle Moffett


BEGIN QUOTED CODE INVOLVING flow_cache_genid:

include/net/flow.h:94:
extern atomic_t flow_cache_genid;

net/core/flow.c:39:
atomic_t flow_cache_genid = ATOMIC_INIT(0);

net/core/flow.c:169:flow_cache_lookup():
if (flow_hash_rnd_recalc(cpu))
flow_new_hash_rnd(cpu);
hash = flow_hash_code(key, cpu);

head = flow_table(cpu)[hash];
for (fle = *head; fle; fle = fle-next) {
if (fle-family == family 
fle-dir == dir 
flow_key_compare(key, fle-key) == 0) {
if (fle-genid == atomic_read(flow_cache_genid)) {
void *ret = fle-object;

if (ret)
atomic_inc(fle-object_ref);
local_bh_enable();

return ret;
}
break;
}
}

net/xfrm/xfrm_policy.c:1025:
int xfrm_policy_delete(struct xfrm_policy *pol, int dir)
{
write_lock_bh(xfrm_policy_lock);
pol = __xfrm_policy_unlink(pol, dir);
write_unlock_bh(xfrm_policy_lock);
if (pol) {
if (dir  XFRM_POLICY_MAX)
atomic_inc(flow_cache_genid);
xfrm_policy_kill(pol);
return 0;
}
return -ENOENT;
}

net/ipv6/inet6_connection_sock.c:142:
static inline
void __inet6_csk_dst_store(struct sock *sk, struct dst_entry *dst,
struct in6_addr *daddr, struct in6_addr *saddr)
{
__ip6_dst_store(sk, dst, daddr, saddr);

#ifdef CONFIG_XFRM
{
struct rt6_info *rt = (struct rt6_info  *)dst;
rt-rt6i_flow_cache_genid = atomic_read(flow_cache_genid);
}
#endif
}

security/selinux/include/xfrm.h:41:
static inline void selinux_xfrm_notify_policyload(void)
{
atomic_inc(flow_cache_genid);
}
--
To unsubscribe from this list: send the line unsubscribe

Re: [PATCH] Allow NBD to be used locally

2008-02-02 Thread Kyle Moffett
Whoops, only hit Reply on the first email, sorry Jan.

On Feb 2, 2008 7:54 PM, Jan Engelhardt [EMAIL PROTECTED] wrote:
 On Feb 2 2008 18:31, [EMAIL PROTECTED] wrote:
 
  How will that work? Fuse makes up a filesystem - not helpful
  if you have a raw disk without a known fs to mount.
 
 take zfs-fuse or ntfs-3g for example.
 you have a blockdevice or backing-file containing data structures and fuse 
 makes those show up as a filesystem.
 i think vmware-mount is not different here.

 vmware-mount IS different, it provides the _block_ device,
 which is then mounted through the usual mount(2) mechanism
 (if there is a filesystem driver for it).

As far as I can tell, vmware-mount should be re-implemented as a
little perl script around dmsetup and/or losetup, possibly with
dm-userspace patched into the kernel to allow you to handle
non-mapped blocks in your userspace daemon when somebody tries to
access them.  If you don't need that ability then straight dm-loop and
dm-linear will work.

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/9] bfs: move function prototype to the proper header file

2008-01-24 Thread Kyle Moffett

On Jan 24, 2008, at 18:13, Dmitri Vorobiev wrote:

Heikki Orsila пишет:

On Fri, Jan 25, 2008 at 01:32:04AM +0300, Dmitri Vorobiev wrote:

+/* inode.c */
+extern void dump_imap(const char *, struct super_block *);
+


Functions should not be externed, remove extern keyword.


Care to explain why?

Following is an explanation why the contrary is probably true:

1) We have lots of precedents in existing code:

[EMAIL PROTECTED]:~/Projects/misc/linux$ git-grep 'extern void' include |  
wc -l

5523
[EMAIL PROTECTED]:~/Projects/misc/linux$



The "extern" keyword on functions is *completely* redundant.

For C variables:
  Declaration:  extern int foo;
  Definition:   int foo;
  File-scoped:  static int foo;

For C functions:
  Declaration:  void foo(int x);
  Definition:   void foo(int x) { /*...body...*/ }
  File-scoped:  static void foo(int x) { /*...body...*/ }

The compiler will *allow* you to use "extern" on the function  
prototype, but the presence or absence of a function body is  
sufficiently obvious for it to determine whether the prototype is a  
declaration or a definition that the "extern" keyword is not required  
and therefore redundant.


For maximum readability and cleanliness I recommend that you leave off  
the "extern" on the function declarations; it makes the lines much  
longer without obvious gain.


Cheers,
Kyle Moffett


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/9] bfs: move function prototype to the proper header file

2008-01-24 Thread Kyle Moffett

On Jan 24, 2008, at 18:13, Dmitri Vorobiev wrote:

Heikki Orsila пишет:

On Fri, Jan 25, 2008 at 01:32:04AM +0300, Dmitri Vorobiev wrote:

+/* inode.c */
+extern void dump_imap(const char *, struct super_block *);
+


Functions should not be externed, remove extern keyword.


Care to explain why?

Following is an explanation why the contrary is probably true:

1) We have lots of precedents in existing code:

[EMAIL PROTECTED]:~/Projects/misc/linux$ git-grep 'extern void' include |  
wc -l

5523
[EMAIL PROTECTED]:~/Projects/misc/linux$



The extern keyword on functions is *completely* redundant.

For C variables:
  Declaration:  extern int foo;
  Definition:   int foo;
  File-scoped:  static int foo;

For C functions:
  Declaration:  void foo(int x);
  Definition:   void foo(int x) { /*...body...*/ }
  File-scoped:  static void foo(int x) { /*...body...*/ }

The compiler will *allow* you to use extern on the function  
prototype, but the presence or absence of a function body is  
sufficiently obvious for it to determine whether the prototype is a  
declaration or a definition that the extern keyword is not required  
and therefore redundant.


For maximum readability and cleanliness I recommend that you leave off  
the extern on the function declarations; it makes the lines much  
longer without obvious gain.


Cheers,
Kyle Moffett


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] Permit filesystem local caching

2008-01-15 Thread Kyle Moffett

On Jan 15, 2008, at 18:46, David Howells wrote:

 (*) 01-keys-inc-payload.diff
 (*) 02-keys-search-keyring.diff
 (*) 03-keys-callout-blob.diff


One vaguely related question:  Is there presently any way to adjust  
the per-user max-key-data limit? I've been tinkering with using the  
new-ish MIT kerberos "KEYRING:" credentials-cache code to hold keys  
for persistent daemons.  Unfortunately "root" keeps hitting the limit  
even with only about 16 keys allocated across a few sessions.  After  
perusing the docs I can't find any documentation on adjusting the  
limits.


I'd really like some way to specifically allow root to allocate up to  
several megs worth of non-swappable key data, although I suppose just  
increasing the global limit slightly wouldn't be bad either.  If such  
functionality already exists then I'd appreciate a pointer to it (and  
possibly respond in kind with documentation patches).


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/26] Permit filesystem local caching

2008-01-15 Thread Kyle Moffett

On Jan 15, 2008, at 18:46, David Howells wrote:

 (*) 01-keys-inc-payload.diff
 (*) 02-keys-search-keyring.diff
 (*) 03-keys-callout-blob.diff


One vaguely related question:  Is there presently any way to adjust  
the per-user max-key-data limit? I've been tinkering with using the  
new-ish MIT kerberos KEYRING: credentials-cache code to hold keys  
for persistent daemons.  Unfortunately root keeps hitting the limit  
even with only about 16 keys allocated across a few sessions.  After  
perusing the docs I can't find any documentation on adjusting the  
limits.


I'd really like some way to specifically allow root to allocate up to  
several megs worth of non-swappable key data, although I suppose just  
increasing the global limit slightly wouldn't be bad either.  If such  
functionality already exists then I'd appreciate a pointer to it (and  
possibly respond in kind with documentation patches).


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: The ext3 way of journalling

2008-01-08 Thread Kyle Moffett

On Jan 08, 2008, at 15:51:53, Andi Kleen wrote:

Theodore Tso <[EMAIL PROTECTED]> writes:
Now, there are good reasons for doing periodic checks every N  
mounts and after M months.  And it has to do with PC class  
hardware.  (Ted's aphorism: "PC class hardware is cr*p").


If these reasons are good ones (some skepticism here) then the  
correct way to really handle this would be to do regular background  
scrubbing during runtime; ideally with metadata checksums so that  
you can actually detect all corruption.


Poor man's background scrubbing:

(A)  Use LVM like virtually all modern distros offer
(B)  Leave some extra space in your LVM volume group (enough for 1  
snapshot over the time it takes to do an FSCK).

(C)  Periodically run the following scriptlet:

set -e
START="$(date +'%Y%m%d%H%M%S')"
lvcreate -s -n "${VOLUME}-snap" "${VG}/${VOLUME}"
if nice +20 fsck -fy "/dev/mapper/${VG}_${VOLUME}-snap"; then
echo 'Background scrubbing succeeded!'
tune2fs -T "${START}" "/dev/mapper/${VG}_${VOLUME}"
else
echo 'Background scrubbing failed!  Reboot to fsck soon!'
tune2fs -C 16383 -T "19000101" "/dev/mapper/${VG}_${VOLUME}"
fi
lvremove "${VG}/${VOLUME}-snap"

Basically you can fsck the offline snapshot in the background.  If it  
succeeds you can adjust the "last checked" date to the time when the  
snapshot was taken and if it fails you can schedule an FSCK at next  
reboot (and possibly remount the filesystem read-only or reboot  
immediately).


You can do the same thing for your /boot volume, although you  
probably have to manually use dmsetup since most bootloaders can't  
interpret LVM volumes.


I've always been surprised that distros like RedHat which  
automatically use LVM don't stuff this in their weekly or monthly  
checks on desktop systems.  User experience could also be  
dramatically improved with automated smartd configuration and user- 
interactive logging and warning messages.



But since fsck is so slow and disks are so big this whole thing is  
a ticking time bomb now. e.g. it is not uncommon to require tens of  
minutes or even hours of fsck time and some server that reboots  
only every few months will eat that when it happens to reboot. This  
means you get a quite long downtime.


My servers all have an "interval-between-checks" of 2-6 weeks and are  
configured to run nice +20 background "fsck" checks during off-hours  
between once every few days and once every few weeks.  I also have  
the "max mount count" numbers set to primes between 7 and 37  
(depending on the filesystem) so that troubled or frequently-rebooted  
systems are more frequently verified.  The end result is that I  
almost never have the dreaded 4-hour-fsck-on-boot problem.  A drive  
has certainly been fscked within the last few weeks of operation, and  
I will only ever have multiple large filesystems all fscked at the  
same time very rarely (gcd of their max-mount-counts).


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2008-01-04 Thread Kyle Moffett

On Jan 04, 2008, at 15:54:06, Oliver Neukum wrote:

Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham:

Hi.

a) mount fuse on /tmp/first
b) mount fuse on /tmp/second

Then the server task for (a) does "ls /tmp/second". So it will be  
frozen, right? How do you then freeze (a)? And keep in mind that  
the server task may have forked.


I guess I should first ask, is this a real life problem or a  
hypothetical twisted web? I don't see why you would want to make  
two filesystems interdependent - it sounds like the way to create  
livelock and deadlocks in normal use, before we even begin to  
think about hibernating.


Good questions. I personally don't use fuse, but I do care about  
power management. The problem I see is that an unprivileged user  
could make that dependency, even inadvertedly.


I don't think it makes sense for the kernel to try to keep track of  
hard data dependencies for FUSE filesystems, or to even *attempt* to  
auto-suspend them.  You should instead allow a privileged program to  
initiate a "freeze-and-flush" operation on a particular FUSE  
filesystem and optionally wait for it to finish.  Then your userspace  
would be configured with the appropriate data dependencies and would  
stop FUSE filesystems in the appropriate order.


In addition, the kernel would automatically understand  
ext3=>loopback=>fuse, and when asked to freeze the "fuse" part, it  
would first freeze the "ext3" and the "loopback" parts using similar  
mechanisms as device-mapper currently uses when you do "dmsetup  
suspend mydev" followed by "echo 0 $SIZE snapshot /dev/mapper/mydev- 
base /dev/mapper/mydev-snap-back p 8 | dmsetup load mydev"  (IE: when  
you create a snapshot of a given device).


Naturally userspace could deadlock itself (although not the kernel)  
by freezing a block device and then attempting to access it, but  
since the "freeze" operation is limited to root this is not a big  
issue.  The way to freeze all filesystems safely would be to clone a  
new mount namespace, mlockall(), mount a tmpfs, pivot_root() into the  
tmpfs, bind-mount the filesystems you want to freeze directly onto  
subdirectories of the tmpfs, and then freeze them in an appropriate  
order.


Besides which the worst-case is a pretty straightforward non-critical  
failure; you might fail to fully sync a FUSE filesystem because its  
daemon is asleep waiting on something (possibly even just sitting in  
a "sleep(1)" call with all signals masked).  You simply need to  
make sure that all tasks are asleep outside of driver critical  
sections so that you can properly suspend your device tree.


Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2008-01-04 Thread Kyle Moffett

On Jan 04, 2008, at 15:54:06, Oliver Neukum wrote:

Am Donnerstag, 3. Januar 2008 23:06:07 schrieb Nigel Cunningham:

Hi.

a) mount fuse on /tmp/first
b) mount fuse on /tmp/second

Then the server task for (a) does ls /tmp/second. So it will be  
frozen, right? How do you then freeze (a)? And keep in mind that  
the server task may have forked.


I guess I should first ask, is this a real life problem or a  
hypothetical twisted web? I don't see why you would want to make  
two filesystems interdependent - it sounds like the way to create  
livelock and deadlocks in normal use, before we even begin to  
think about hibernating.


Good questions. I personally don't use fuse, but I do care about  
power management. The problem I see is that an unprivileged user  
could make that dependency, even inadvertedly.


I don't think it makes sense for the kernel to try to keep track of  
hard data dependencies for FUSE filesystems, or to even *attempt* to  
auto-suspend them.  You should instead allow a privileged program to  
initiate a freeze-and-flush operation on a particular FUSE  
filesystem and optionally wait for it to finish.  Then your userspace  
would be configured with the appropriate data dependencies and would  
stop FUSE filesystems in the appropriate order.


In addition, the kernel would automatically understand  
ext3=loopback=fuse, and when asked to freeze the fuse part, it  
would first freeze the ext3 and the loopback parts using similar  
mechanisms as device-mapper currently uses when you do dmsetup  
suspend mydev followed by echo 0 $SIZE snapshot /dev/mapper/mydev- 
base /dev/mapper/mydev-snap-back p 8 | dmsetup load mydev  (IE: when  
you create a snapshot of a given device).


Naturally userspace could deadlock itself (although not the kernel)  
by freezing a block device and then attempting to access it, but  
since the freeze operation is limited to root this is not a big  
issue.  The way to freeze all filesystems safely would be to clone a  
new mount namespace, mlockall(), mount a tmpfs, pivot_root() into the  
tmpfs, bind-mount the filesystems you want to freeze directly onto  
subdirectories of the tmpfs, and then freeze them in an appropriate  
order.


Besides which the worst-case is a pretty straightforward non-critical  
failure; you might fail to fully sync a FUSE filesystem because its  
daemon is asleep waiting on something (possibly even just sitting in  
a sleep(1) call with all signals masked).  You simply need to  
make sure that all tasks are asleep outside of driver critical  
sections so that you can properly suspend your device tree.


Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Get physical MAC address

2008-01-01 Thread Kyle Moffett

On Jan 01, 2008, at 21:42:18, Jon Masters wrote:

On Mon, 2007-12-31 at 12:39 +0700, Theewara Vorakosit wrote:
I get MAC address from ioctl. However, ifconfig can change this   
MAC address. Can I get a real physical MAC address of the NIC?


Forgive me reading into your mail...this smells a bit like some  
kind of licensing/compliance thing. Just bear in mind that using  
the MAC to verify the identity of a machine is utterly useless and  
pointless - anyone can trivially fool your software[0] to see what  
it "wants".


Not necessarily;  I can easily see distros wanting to have a "Restore  
defaults" button in their network config windows which also includes  
restoring the default MAC address to the NIC.  It should also be  
pointed out that anybody with one of a selection of re-flashable NICS  
(or NICS with removable EEPROMS) can easily change the MAC address on  
their NIC.  Other alternatives includes renaming eth0 to mynet0 and  
creating a downed dummy interface called "eth0" with the desired MAC  
addr.



[0] We used to have to do far worse kludgery in college, in order  
to prevent the silly powers that be who "banned" network cards  
other than those made by one manufacturer from being used on their  
little network.


Well for basically any userspace-level check, all it takes is  
somebody who knows ASM and has about 5 minutes to track down the  
problematic branch instructions.  Then they just have to write a 10- 
line GDB script which starts the program, traps the appropriate  
instructions, and then changes a "0" to a "1" (or vice versa) before  
the conditional branch.  On Windows it's vaguely practical (albeit  
crash-prone) to load a kernel hack which prevents your program from  
being debugged, but under Linux it's effectively impossible


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Get physical MAC address

2008-01-01 Thread Kyle Moffett

On Jan 01, 2008, at 21:42:18, Jon Masters wrote:

On Mon, 2007-12-31 at 12:39 +0700, Theewara Vorakosit wrote:
I get MAC address from ioctl. However, ifconfig can change this   
MAC address. Can I get a real physical MAC address of the NIC?


Forgive me reading into your mail...this smells a bit like some  
kind of licensing/compliance thing. Just bear in mind that using  
the MAC to verify the identity of a machine is utterly useless and  
pointless - anyone can trivially fool your software[0] to see what  
it wants.


Not necessarily;  I can easily see distros wanting to have a Restore  
defaults button in their network config windows which also includes  
restoring the default MAC address to the NIC.  It should also be  
pointed out that anybody with one of a selection of re-flashable NICS  
(or NICS with removable EEPROMS) can easily change the MAC address on  
their NIC.  Other alternatives includes renaming eth0 to mynet0 and  
creating a downed dummy interface called eth0 with the desired MAC  
addr.



[0] We used to have to do far worse kludgery in college, in order  
to prevent the silly powers that be who banned network cards  
other than those made by one manufacturer from being used on their  
little network.


Well for basically any userspace-level check, all it takes is  
somebody who knows ASM and has about 5 minutes to track down the  
problematic branch instructions.  Then they just have to write a 10- 
line GDB script which starts the program, traps the appropriate  
instructions, and then changes a 0 to a 1 (or vice versa) before  
the conditional branch.  On Windows it's vaguely practical (albeit  
crash-prone) to load a kernel hack which prevents your program from  
being debugged, but under Linux it's effectively impossible


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: yield API

2007-12-12 Thread Kyle Moffett

On Dec 12, 2007, at 17:39:15, Jesper Juhl wrote:

On 02/10/2007, Ingo Molnar <[EMAIL PROTECTED]> wrote:
sched_yield() has been around for a decade (about three times  
longer than futexes were around), so if it's useful, it sure  
should have grown some 'crown jewel' app that uses it and shows  
off its advantages, compared to other locking approaches, right?


I have one example of sched_yield() use in a real app.  
Unfortunately it's proprietary so I can't show you the source, but  
I can tell you how it's used.


The case is this:  Process A forks process B. Process B does some  
work that takes aproximately between 50 and 1000ms to complete  
(varies), then it creates a file and continues to do other work.   
Process A needs to wait for the file B creates before it can  
continue. Process A *could* immediately go into some kind of "check  
for file; sleep n ms" loop, but instead it starts off by calling  
sched_yield() to give process B a chance to run and hopefully get  
to the point where it has created the file before process A is  
again scheduled and starts to look for it - after the single sched  
yield call, process A does indeed go into a "check for file; sleep  
250ms;" loop, but most of the time the initial sched_yield() call  
actually results in the file being present without having to loop  
like that.


That is a *terrible* disgusting way to use yield.  Better options:
  (1) inotify/dnotify
  (2) create a "foo.lock" file and put the mutex in that
  (3) just start with the check-file-and-sleep loop.


Now is this the best way to handle this situation? No.  Does it  
work better than just doing the wait loop from the start? Yes.


It works better than doing the wait-loop from the start?  What  
evidence do you provide to support this assertion?  Specifically, in  
the first case you tell the kernel "I'm waiting for something but I  
don't know what it is or how long it will take"; while in the second  
case you tell the kernel "I'm waiting for something that will take  
exactly X milliseconds, even though I don't know what it is.  If you  
really want something similar to the old behavior then just replace  
the "sched_yield()" call with a proper sleep for the estimated time  
it will take the program to create the file.



Is this a good way to use sched_yield()? Maybe, maybe not.  But it  
*is* an actual use of the API in a real app.


We weren't looking for "actual uses", especially not in binary-only  
apps.  What we are looking for is optimal uses of sched_yield(); ones  
where that is the best alternative.  This... certainly isn't.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: yield API

2007-12-12 Thread Kyle Moffett

On Dec 12, 2007, at 17:39:15, Jesper Juhl wrote:

On 02/10/2007, Ingo Molnar [EMAIL PROTECTED] wrote:
sched_yield() has been around for a decade (about three times  
longer than futexes were around), so if it's useful, it sure  
should have grown some 'crown jewel' app that uses it and shows  
off its advantages, compared to other locking approaches, right?


I have one example of sched_yield() use in a real app.  
Unfortunately it's proprietary so I can't show you the source, but  
I can tell you how it's used.


The case is this:  Process A forks process B. Process B does some  
work that takes aproximately between 50 and 1000ms to complete  
(varies), then it creates a file and continues to do other work.   
Process A needs to wait for the file B creates before it can  
continue. Process A *could* immediately go into some kind of check  
for file; sleep n ms loop, but instead it starts off by calling  
sched_yield() to give process B a chance to run and hopefully get  
to the point where it has created the file before process A is  
again scheduled and starts to look for it - after the single sched  
yield call, process A does indeed go into a check for file; sleep  
250ms; loop, but most of the time the initial sched_yield() call  
actually results in the file being present without having to loop  
like that.


That is a *terrible* disgusting way to use yield.  Better options:
  (1) inotify/dnotify
  (2) create a foo.lock file and put the mutex in that
  (3) just start with the check-file-and-sleep loop.


Now is this the best way to handle this situation? No.  Does it  
work better than just doing the wait loop from the start? Yes.


It works better than doing the wait-loop from the start?  What  
evidence do you provide to support this assertion?  Specifically, in  
the first case you tell the kernel I'm waiting for something but I  
don't know what it is or how long it will take; while in the second  
case you tell the kernel I'm waiting for something that will take  
exactly X milliseconds, even though I don't know what it is.  If you  
really want something similar to the old behavior then just replace  
the sched_yield() call with a proper sleep for the estimated time  
it will take the program to create the file.



Is this a good way to use sched_yield()? Maybe, maybe not.  But it  
*is* an actual use of the API in a real app.


We weren't looking for actual uses, especially not in binary-only  
apps.  What we are looking for is optimal uses of sched_yield(); ones  
where that is the best alternative.  This... certainly isn't.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Kyle Moffett

On Dec 06, 2007, at 00:30:16, Renzo Davoli wrote:
AF_IPN is different.  AF_IPN is the broadcast and peer-to-peer  
extension of AF_UNIX. It supports communication among *user*  
processes.


Ok, you say it's different, but then you describe how IP unicast and  
broadcast work.  Both are frequently used for communication among  
"*user* processes".  Please provide significantly more details about  
exactly *how* it's different.




Example:

Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an  
Ethernet Hub and communicate among themselves with the hosting  
computer and the world by a tap like interface.


You say "tap like" interface, but people do this already with  
existing infrastructure.  You can connect Qemu, UML, and KVM to a  
standard linus "tap" interface, and then use the standard Linux  
bridging code to connect the "tap" interface to your existing network  
interfaces.  Alternatively you could use the standard and well-tested  
IP routing/firewalling/NAT code to move your packets around.  None of  
this requires new network infrastructure in the slightest.  If you  
have problems with the existing code, please improve it instead of  
creating a slightly incompatible replacement which has different bugs  
and workarounds.



You can also grab an interface (say eth1) and use eth0 for your  
hosting computer and eth1 for the IPN network of virtual machines.


You can do that already with the bridging code.


If you load the kvde_switch submodule IPN can be a virtual Ethernet  
switch.


As I described above, this can be done with the existing bridging and  
tun/tap code.




Another Example:

You have a continuous stream of data packets generated by a  
process, and you want to send this data to many processes.  Maybe  
the set of processes is not known in advance, you want to send the  
data to any interested process. Some kind of publish  
communication service (among unix processes not on TCP-IP). Without  
IPN you need a server. With IPN the sender creates the socket  
connects to it and feed it with data packets. All the interested  
receivers connects to it and start reading. That's all.


This is already done frequently in userspace.  Just register a port  
number with IANA on which to implement a "registration" server and  
write a little daemon to listen on 127.0.0.1:${YOUR_PORT}.  Your  
interconnecting programs then use either unicast or multicast sockets  
to bind, then report to the registration server what service you are  
offering and what port it's on.  Your "receivers" then connect to the  
registration server, ask what port a given service is on, and then  
multicast-listen or unicast-connect to access that service.  The best  
part is that all of the performance implications are already  
thoroughly understood.  Furthermore, if you want to extend your  
communication protocol to other hosts as well, you just have to  
replace the 127.0.0.1 bind with a global bind.  This is exactly how  
the standard-specified multiple-participant "SIP" protocol works, for  
example.



So if you really think this is something that belongs in the kernel  
you need to provide much more detailed descriptions and use-cases for  
why it cannot be implemented in user-space or with small  
modifications to existing UDP/TCP networking.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Reduce stack used by lib/hexdump.c

2007-12-05 Thread Kyle Moffett

On Dec 05, 2007, at 21:42:35, Joe Perches wrote:

On Wed, 2007-12-05 at 18:18 -0800, Randy Dunlap wrote:

Joe Perches wrote:
Maybe just eliminate the 16 or 32 byte width option and force it  
to only 16 byte widths.
Have you checked users (callers)?  I'm pretty sure that one of the  
callers wanted 32 and that's why it's there.


I did.  There is only 1 subsystem.  That's easy to change.

drivers/mtd/ubi/debug.c:  print_hex_dump(KERN_DEBUG, "",  
DUMP_PREFIX_OFFSET, 32, 1,
drivers/mtd/ubi/io.c: print_hex_dump(KERN_DEBUG, "",  
DUMP_PREFIX_OFFSET, 32, 1,


Long lines in the log file are not too easy to read anyway.  Using  
16 byte dumps per line instead of 32 isn't painful.


It gets rid of the allocation, reduces the argument count and makes  
the kernel smaller.  I think it's all good.


Every current caller would have to change though.


Alternatively, since print_hex_dump is not a performance-critical  
path (and usually indicates an error/debug condition), you could  
probably just make a static "hexdump_lock" spinlock and  
spin_lock_irqsave()/spin_unlock_irqrestore().  It would always nest  
inside any other lock (except during crash, where we break locks  
already for printk()), and I doubt any of the callers would notice  
the serialization since they're already serialized on the printk buffer.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Reduce stack used by lib/hexdump.c

2007-12-05 Thread Kyle Moffett

On Dec 05, 2007, at 21:42:35, Joe Perches wrote:

On Wed, 2007-12-05 at 18:18 -0800, Randy Dunlap wrote:

Joe Perches wrote:
Maybe just eliminate the 16 or 32 byte width option and force it  
to only 16 byte widths.
Have you checked users (callers)?  I'm pretty sure that one of the  
callers wanted 32 and that's why it's there.


I did.  There is only 1 subsystem.  That's easy to change.

drivers/mtd/ubi/debug.c:  print_hex_dump(KERN_DEBUG, ,  
DUMP_PREFIX_OFFSET, 32, 1,
drivers/mtd/ubi/io.c: print_hex_dump(KERN_DEBUG, ,  
DUMP_PREFIX_OFFSET, 32, 1,


Long lines in the log file are not too easy to read anyway.  Using  
16 byte dumps per line instead of 32 isn't painful.


It gets rid of the allocation, reduces the argument count and makes  
the kernel smaller.  I think it's all good.


Every current caller would have to change though.


Alternatively, since print_hex_dump is not a performance-critical  
path (and usually indicates an error/debug condition), you could  
probably just make a static hexdump_lock spinlock and  
spin_lock_irqsave()/spin_unlock_irqrestore().  It would always nest  
inside any other lock (except during crash, where we break locks  
already for printk()), and I doubt any of the callers would notice  
the serialization since they're already serialized on the printk buffer.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Kyle Moffett

On Dec 06, 2007, at 00:30:16, Renzo Davoli wrote:
AF_IPN is different.  AF_IPN is the broadcast and peer-to-peer  
extension of AF_UNIX. It supports communication among *user*  
processes.


Ok, you say it's different, but then you describe how IP unicast and  
broadcast work.  Both are frequently used for communication among  
*user* processes.  Please provide significantly more details about  
exactly *how* it's different.




Example:

Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an  
Ethernet Hub and communicate among themselves with the hosting  
computer and the world by a tap like interface.


You say tap like interface, but people do this already with  
existing infrastructure.  You can connect Qemu, UML, and KVM to a  
standard linus tap interface, and then use the standard Linux  
bridging code to connect the tap interface to your existing network  
interfaces.  Alternatively you could use the standard and well-tested  
IP routing/firewalling/NAT code to move your packets around.  None of  
this requires new network infrastructure in the slightest.  If you  
have problems with the existing code, please improve it instead of  
creating a slightly incompatible replacement which has different bugs  
and workarounds.



You can also grab an interface (say eth1) and use eth0 for your  
hosting computer and eth1 for the IPN network of virtual machines.


You can do that already with the bridging code.


If you load the kvde_switch submodule IPN can be a virtual Ethernet  
switch.


As I described above, this can be done with the existing bridging and  
tun/tap code.




Another Example:

You have a continuous stream of data packets generated by a  
process, and you want to send this data to many processes.  Maybe  
the set of processes is not known in advance, you want to send the  
data to any interested process. Some kind of publishsubscribe  
communication service (among unix processes not on TCP-IP). Without  
IPN you need a server. With IPN the sender creates the socket  
connects to it and feed it with data packets. All the interested  
receivers connects to it and start reading. That's all.


This is already done frequently in userspace.  Just register a port  
number with IANA on which to implement a registration server and  
write a little daemon to listen on 127.0.0.1:${YOUR_PORT}.  Your  
interconnecting programs then use either unicast or multicast sockets  
to bind, then report to the registration server what service you are  
offering and what port it's on.  Your receivers then connect to the  
registration server, ask what port a given service is on, and then  
multicast-listen or unicast-connect to access that service.  The best  
part is that all of the performance implications are already  
thoroughly understood.  Furthermore, if you want to extend your  
communication protocol to other hosts as well, you just have to  
replace the 127.0.0.1 bind with a global bind.  This is exactly how  
the standard-specified multiple-participant SIP protocol works, for  
example.



So if you really think this is something that belongs in the kernel  
you need to provide much more detailed descriptions and use-cases for  
why it cannot be implemented in user-space or with small  
modifications to existing UDP/TCP networking.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Relax permissions for reading hard drive serial number?

2007-12-04 Thread Kyle Moffett

On Dec 02, 2007, at 13:45:44, Matti Aarnio wrote:
This lack of having stable(*) unique system identifier available to  
applications is one of the small details that make node locked  
commercial software delivery challenging thing in UNIX environments..


*) "stable" as both stable data, and stable API to get it.


Well... There's that.  There's also the fact that anybody with a  
modicum of ASM programming skills can get clever with GDB and traces  
from "Correct HW serial" and "Incorrect HW serial" can write a 10- 
line GDB script to make it work regardless.  I did something similar  
with a popular FPS (which I legitimately own) on one of my Mac  
systems after having left the DVD behind when going to a LAN party.   
Addresses removed to protect the innocent^Wguilty, but they took  
maybe 15 minutes to acquire:


break *END_OF_CDKEY_CODE_DECRYPTION
run
delete 1
advance *JUST_AFTER_CDKEY_CHECK
set $r3 = 0
detach

At some point every such "locked" computer program has code like this:

if (program_is_not_authorized()) {
display_nasty_dialog();
exit(1);
}


All it takes for somebody with a debugger is to identify the last  
instruction of the "program_is_authorized()" function and change $r3  
(or whatever return register your system uses) from a 1 to a 0.  The  
fact remains that once the software is running on *THEIR* computer  
there is nothing you can practically do to forcibly prevent them from  
using it in whatever fashion they desire.  Typically if you price  
your software reasonably people will be willing to pay for multiple  
copies but there are no foolproof technical measures to enforce that  
they do so.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Relax permissions for reading hard drive serial number?

2007-12-04 Thread Kyle Moffett

On Dec 02, 2007, at 13:45:44, Matti Aarnio wrote:
This lack of having stable(*) unique system identifier available to  
applications is one of the small details that make node locked  
commercial software delivery challenging thing in UNIX environments..


*) stable as both stable data, and stable API to get it.


Well... There's that.  There's also the fact that anybody with a  
modicum of ASM programming skills can get clever with GDB and traces  
from Correct HW serial and Incorrect HW serial can write a 10- 
line GDB script to make it work regardless.  I did something similar  
with a popular FPS (which I legitimately own) on one of my Mac  
systems after having left the DVD behind when going to a LAN party.   
Addresses removed to protect the innocent^Wguilty, but they took  
maybe 15 minutes to acquire:


break *END_OF_CDKEY_CODE_DECRYPTION
run
delete 1
advance *JUST_AFTER_CDKEY_CHECK
set $r3 = 0
detach

At some point every such locked computer program has code like this:

if (program_is_not_authorized()) {
display_nasty_dialog();
exit(1);
}


All it takes for somebody with a debugger is to identify the last  
instruction of the program_is_authorized() function and change $r3  
(or whatever return register your system uses) from a 1 to a 0.  The  
fact remains that once the software is running on *THEIR* computer  
there is nothing you can practically do to forcibly prevent them from  
using it in whatever fashion they desire.  Typically if you price  
your software reasonably people will be willing to pay for multiple  
copies but there are no foolproof technical measures to enforce that  
they do so.


Cheers,
Kyle Moffett

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Development & Objective-C

2007-11-30 Thread Kyle Moffett

On Nov 30, 2007, at 13:40:07, H. Peter Anvin wrote:

Kyle Moffett wrote:
With that said, there is a significant performance penalty as all  
Objective-C method calls are looked up symbolically at runtime for  
every single call.


GACK!

At least C++ has vtables.


In a tight loop there is a way to do a single symbolic lookup and  
just call directly through a function pointer, but typically it isn't  
necessary for GUI programs and the like.  The flexibility of being  
able to dynamically add new methods to an existing class (at least  
for desktop user interfaces) significantly outweighs the performance  
cost.  Any performance-sensitive code is typically written in  
straight C anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Development & Objective-C

2007-11-30 Thread Kyle Moffett

On Nov 30, 2007, at 09:34:45, Lennart Sorensen wrote:

On Thu, Nov 29, 2007 at 12:14:16PM +, Ben Crowhurst wrote:

Has Objective-C ever been considered for kernel development?


Doesn't objective C essentially require a runtime to provide a lot  
of the features of the language?  If it does (as I suspect) then it  
is totally unsiatable for kernel development.


That and object oriented languages in general are badly designed  
and a bad idea.  Having not used objective C I have no idea if it  
qualifies as badly designed or not.  Certainly C++ and java are  
both very badly designed.


Objective-C is actually a pretty minimal wrapper around C; it was  
originally implemented as a C preprocessor.  It generally does not  
have any kind of memory management, garbage collection, or anything  
else (although typically a "runtime" will provide those features).   
There are no first-class exceptions, so there would be nothing to  
worry about there (the exceptions used in GUI programs are built  
around the setjmp/longjmp primitives).  Objective-C is also almost  
completely backwards-compatible with C, much more so than C++ ever  
was.  As far as the runtime goes the kernel would be expected to  
write its own, the same way that it implements "kmalloc()" as part of  
a "C runtime".  Since the runtime itself never does any implicit  
memory allocation, I think it would conceivably even be relatively  
safe for kernel usage.


With that said, there is a significant performance penalty as all  
Objective-C method calls are looked up symbolically at runtime for  
every single call.  For GUI programs where large chunks of the code  
are event-loops and not performance-sensitive that provides a huge  
amount of extra flexibility.  In the kernel though, there are many  
codepaths where *every* *single* instruction counts; that could be a  
serious performance hit.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Development Objective-C

2007-11-30 Thread Kyle Moffett

On Nov 30, 2007, at 13:40:07, H. Peter Anvin wrote:

Kyle Moffett wrote:
With that said, there is a significant performance penalty as all  
Objective-C method calls are looked up symbolically at runtime for  
every single call.


GACK!

At least C++ has vtables.


In a tight loop there is a way to do a single symbolic lookup and  
just call directly through a function pointer, but typically it isn't  
necessary for GUI programs and the like.  The flexibility of being  
able to dynamically add new methods to an existing class (at least  
for desktop user interfaces) significantly outweighs the performance  
cost.  Any performance-sensitive code is typically written in  
straight C anyways.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel Development Objective-C

2007-11-30 Thread Kyle Moffett

On Nov 30, 2007, at 09:34:45, Lennart Sorensen wrote:

On Thu, Nov 29, 2007 at 12:14:16PM +, Ben Crowhurst wrote:

Has Objective-C ever been considered for kernel development?


Doesn't objective C essentially require a runtime to provide a lot  
of the features of the language?  If it does (as I suspect) then it  
is totally unsiatable for kernel development.


That and object oriented languages in general are badly designed  
and a bad idea.  Having not used objective C I have no idea if it  
qualifies as badly designed or not.  Certainly C++ and java are  
both very badly designed.


Objective-C is actually a pretty minimal wrapper around C; it was  
originally implemented as a C preprocessor.  It generally does not  
have any kind of memory management, garbage collection, or anything  
else (although typically a runtime will provide those features).   
There are no first-class exceptions, so there would be nothing to  
worry about there (the exceptions used in GUI programs are built  
around the setjmp/longjmp primitives).  Objective-C is also almost  
completely backwards-compatible with C, much more so than C++ ever  
was.  As far as the runtime goes the kernel would be expected to  
write its own, the same way that it implements kmalloc() as part of  
a C runtime.  Since the runtime itself never does any implicit  
memory allocation, I think it would conceivably even be relatively  
safe for kernel usage.


With that said, there is a significant performance penalty as all  
Objective-C method calls are looked up symbolically at runtime for  
every single call.  For GUI programs where large chunks of the code  
are event-loops and not performance-sensitive that provides a huge  
amount of extra flexibility.  In the kernel though, there are many  
codepaths where *every* *single* instruction counts; that could be a  
serious performance hit.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: git guidance

2007-11-29 Thread Kyle Moffett

On Nov 29, 2007, at 00:27:04, Al Boldi wrote:

Jakub Narebski wrote:
Besides, you can always use "git show :". For  
example gitweb (and I think other web interfaces) can show any  
version of a file or a directory, accessing only repository.


Sure, browsing is the easy part, but Version Control starts when  
things become writable.


But... git history is very inherently completely immutable once  
created... that's the only way you can index everything with a simple  
SHA-1.  If you want to write to the "git filesystem" by adding new  
commits then you need to use the appropriate commands, same as every  
other VCS on the planet.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: git guidance

2007-11-29 Thread Kyle Moffett

On Nov 29, 2007, at 00:27:04, Al Boldi wrote:

Jakub Narebski wrote:
Besides, you can always use git show revision:file. For  
example gitweb (and I think other web interfaces) can show any  
version of a file or a directory, accessing only repository.


Sure, browsing is the easy part, but Version Control starts when  
things become writable.


But... git history is very inherently completely immutable once  
created... that's the only way you can index everything with a simple  
SHA-1.  If you want to write to the git filesystem by adding new  
commits then you need to use the appropriate commands, same as every  
other VCS on the planet.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2007-11-27 Thread Kyle Moffett

On Nov 27, 2007, at 17:49:18, Jeremy Fitzhardinge wrote:

Rafael J. Wysocki wrote:
Well, this is more-or-less how we all imagine that should be done  
eventually.


The main problem is how to implement it without causing too much  
breakage.  Also, there are some dirty details that need to be  
taken into consideration.


For Xen suspend/resume, I'd like to use the freezer to get all  
threads into a known consistent state (where, specifically, they  
don't have any outstanding pagetable updates pending).  In other  
words, the freezer as it currently stands is what I want, modulo  
some of these issues where it gets caught up unexpectedly.  If  
threads end up getting frozen anywhere preempt isn't explicitly  
disabled, it wouldn't work for me.


The problem with "one freezer" is that "known consistent state" means  
something completely different to every single driver and subsystem.   
Xen wants it to mean "No pending page table updates and no more  
updates from this point forward".  A network driver wants it to mean  
"All pending network packets DMAed out or in and the device shut down  
with all remaining packets queued.  A SATA controller wants it to  
mean "All DMA quiesced and no more commands", etc.


The only way to have that work is to put minimal definitions of what  
state you care about in the drivers themselves.  For Xen this means  
that you need to have an appropriately-timed suspend handler which  
hooks into Xen code very precisely to create and preserve the "No  
pending page table updates" state that you care about.  It will be  
more work in the short term but it's the only maintainable solution  
in the long term IMO.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2007-11-27 Thread Kyle Moffett

On Nov 27, 2007, at 12:40:24, Rafael J. Wysocki wrote:

On Tuesday, 27 of November 2007, Matthew Garrett wrote:

On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:

On Monday, 26 of November 2007, David Chinner wrote:
So how do you handle threads that are blocked on I/O or a lock  
during the system freeze process, then?


We wait until they can continue.


So if I have a process blocked on an unavilable NFS mount, I can't
suspend?


That's correct, you can't.

[And I know what you're going to say. ;-)]


Why exactly does suspend/hibernation depend on "TASK_INTERRUPTIBLE"  
instead of a zero preempt_count()?  Really what we should do is just  
iterate over all of the actual physical devices and tell each one  
"Block new IO requests preemptably, finish pending DMA, put the  
hardware in low-power mode, and prepare for suspend/hibernate".  As  
long as each driver knows how to do those simple things we can have  
an entirely consistent kernel image for both suspend and for  
hibernation.


When all tasks are preemptable we can very trivially rely on the  
drivers to enforce the "Stop new IO submission" with a dirt-simple  
semaphore or waitqueue.  The sleep itself will be  
TASK_UNINTERRUPTIBLE, but it will be done from a preemptible context.


That way the system suspend time is the sum of the suspend times of  
the devices on the system, and the suspend time of any given device  
is the sum of its maximum non-preemptible critical section and the  
time to flush all of its remaining pending DMA/etc.  This is almost  
completely independent of the load-level of the machine, and it does  
not depend on things like NFS filesystems.  The one gotcha is that it  
does not flush dirty filesystem pages to disk first, although that  
could be fixed with a few VFS and blockdev hooks which hierarchically  
flush and "freeze" block devices and filesystems before actually  
disabling devices much the way that device-mapper can pause a device  
to take a snapshot and end up with a clean journal on the filesystem  
afterwards.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2007-11-27 Thread Kyle Moffett

On Nov 27, 2007, at 12:40:24, Rafael J. Wysocki wrote:

On Tuesday, 27 of November 2007, Matthew Garrett wrote:

On Mon, Nov 26, 2007 at 10:53:34PM +0100, Rafael J. Wysocki wrote:

On Monday, 26 of November 2007, David Chinner wrote:
So how do you handle threads that are blocked on I/O or a lock  
during the system freeze process, then?


We wait until they can continue.


So if I have a process blocked on an unavilable NFS mount, I can't
suspend?


That's correct, you can't.

[And I know what you're going to say. ;-)]


Why exactly does suspend/hibernation depend on TASK_INTERRUPTIBLE  
instead of a zero preempt_count()?  Really what we should do is just  
iterate over all of the actual physical devices and tell each one  
Block new IO requests preemptably, finish pending DMA, put the  
hardware in low-power mode, and prepare for suspend/hibernate.  As  
long as each driver knows how to do those simple things we can have  
an entirely consistent kernel image for both suspend and for  
hibernation.


When all tasks are preemptable we can very trivially rely on the  
drivers to enforce the Stop new IO submission with a dirt-simple  
semaphore or waitqueue.  The sleep itself will be  
TASK_UNINTERRUPTIBLE, but it will be done from a preemptible context.


That way the system suspend time is the sum of the suspend times of  
the devices on the system, and the suspend time of any given device  
is the sum of its maximum non-preemptible critical section and the  
time to flush all of its remaining pending DMA/etc.  This is almost  
completely independent of the load-level of the machine, and it does  
not depend on things like NFS filesystems.  The one gotcha is that it  
does not flush dirty filesystem pages to disk first, although that  
could be fixed with a few VFS and blockdev hooks which hierarchically  
flush and freeze block devices and filesystems before actually  
disabling devices much the way that device-mapper can pause a device  
to take a snapshot and end up with a clean journal on the filesystem  
afterwards.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: freeze vs freezer

2007-11-27 Thread Kyle Moffett

On Nov 27, 2007, at 17:49:18, Jeremy Fitzhardinge wrote:

Rafael J. Wysocki wrote:
Well, this is more-or-less how we all imagine that should be done  
eventually.


The main problem is how to implement it without causing too much  
breakage.  Also, there are some dirty details that need to be  
taken into consideration.


For Xen suspend/resume, I'd like to use the freezer to get all  
threads into a known consistent state (where, specifically, they  
don't have any outstanding pagetable updates pending).  In other  
words, the freezer as it currently stands is what I want, modulo  
some of these issues where it gets caught up unexpectedly.  If  
threads end up getting frozen anywhere preempt isn't explicitly  
disabled, it wouldn't work for me.


The problem with one freezer is that known consistent state means  
something completely different to every single driver and subsystem.   
Xen wants it to mean No pending page table updates and no more  
updates from this point forward.  A network driver wants it to mean  
All pending network packets DMAed out or in and the device shut down  
with all remaining packets queued.  A SATA controller wants it to  
mean All DMA quiesced and no more commands, etc.


The only way to have that work is to put minimal definitions of what  
state you care about in the drivers themselves.  For Xen this means  
that you need to have an appropriately-timed suspend handler which  
hooks into Xen code very precisely to create and preserve the No  
pending page table updates state that you care about.  It will be  
more work in the short term but it's the only maintainable solution  
in the long term IMO.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree

2007-11-26 Thread Kyle Moffett

On Nov 24, 2007, at 22:36:43, Crispin Cowan wrote:

Kyle Moffett wrote:
Actually, a fully-secured strict-mode SELinux system will have no  
unconfined_t processes; none of my test systems have any.   
Generally "unconfined_t" is used for situations similar to what  
AppArmor was designed for, where the only "interesting" security  
is that of the daemon (which is properly labelled) and one or more  
of the users are unconfined.


Interesting. In a Targeted Policy, you do your policy  
administration from unconfined_t. But how do you administer a  
Strict Policy machine? I can think of 2 ways:


[snip]


* there is some type that is tighter than unconfined_t but none the
  less has sufficient privilege to change policy

To me, this would be semantically equivalent to unconfined_t,  
because any rogue code or user with this type could then fabricate  
unconfined_t and do what they want


Well, in a strict SELinux system, someone who has been permitted the  
"Security Administrator" role (secadm_r) and who has logged in  
through a "login_t" process may modify and reload the policy.  They  
are also permitted to view all files up to their clearance, write  
files below their level, and relabel files.  On the other hand, they  
do not have any system-administration privileges (those are reserve  
for sysadm_r).


Under the default policy the security administrator may disable  
SELinux completely, although that too can be adjusted as "load  
policy" is yet another specialized permission.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree

2007-11-26 Thread Kyle Moffett

On Nov 24, 2007, at 22:36:43, Crispin Cowan wrote:

Kyle Moffett wrote:
Actually, a fully-secured strict-mode SELinux system will have no  
unconfined_t processes; none of my test systems have any.   
Generally unconfined_t is used for situations similar to what  
AppArmor was designed for, where the only interesting security  
is that of the daemon (which is properly labelled) and one or more  
of the users are unconfined.


Interesting. In a Targeted Policy, you do your policy  
administration from unconfined_t. But how do you administer a  
Strict Policy machine? I can think of 2 ways:


[snip]


* there is some type that is tighter than unconfined_t but none the
  less has sufficient privilege to change policy

To me, this would be semantically equivalent to unconfined_t,  
because any rogue code or user with this type could then fabricate  
unconfined_t and do what they want


Well, in a strict SELinux system, someone who has been permitted the  
Security Administrator role (secadm_r) and who has logged in  
through a login_t process may modify and reload the policy.  They  
are also permitted to view all files up to their clearance, write  
files below their level, and relabel files.  On the other hand, they  
do not have any system-administration privileges (those are reserve  
for sysadm_r).


Under the default policy the security administrator may disable  
SELinux completely, although that too can be adjusted as load  
policy is yet another specialized permission.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree

2007-11-24 Thread Kyle Moffett

On Nov 24, 2007, at 06:39:34, Crispin Cowan wrote:

Andrew Morgan wrote:
It feels to me as if a MAC "override capability" is, if true to  
its name, extra to the MAC model; any MAC model that needs an  
'override' to function seems under-specified... SELinux clearly  
feels no need for one,


That's not quite right. More specifically, it already has one in  
the form of unconfined_t. AppArmor has a similar escape hatch in  
the "Ux" permission. Its not that they don't need one, it is that  
they already have one. They get to have one because they allow you  
to actually write a policy that is more nuanced than "process label  
must dominate object label".


Actually, a fully-secured strict-mode SELinux system will have no  
unconfined_t processes; none of my test systems have any.  Generally  
"unconfined_t" is used for situations similar to what AppArmor was  
designed for, where the only "interesting" security is that of the  
daemon (which is properly labelled) and one or more of the users are  
unconfined.


Even then "unconfined_t" is not an implicit part of the policy, it is  
explicitly given the ability to take any action on any object by  
rules in the policy, and it typically still falls under a few MLS  
labeling restrictions even in the targeted policy.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + smack-version-11c-simplified-mandatory-access-control-kernel.patch added to -mm tree

2007-11-24 Thread Kyle Moffett

On Nov 24, 2007, at 06:39:34, Crispin Cowan wrote:

Andrew Morgan wrote:
It feels to me as if a MAC override capability is, if true to  
its name, extra to the MAC model; any MAC model that needs an  
'override' to function seems under-specified... SELinux clearly  
feels no need for one,


That's not quite right. More specifically, it already has one in  
the form of unconfined_t. AppArmor has a similar escape hatch in  
the Ux permission. Its not that they don't need one, it is that  
they already have one. They get to have one because they allow you  
to actually write a policy that is more nuanced than process label  
must dominate object label.


Actually, a fully-secured strict-mode SELinux system will have no  
unconfined_t processes; none of my test systems have any.  Generally  
unconfined_t is used for situations similar to what AppArmor was  
designed for, where the only interesting security is that of the  
daemon (which is properly labelled) and one or more of the users are  
unconfined.


Even then unconfined_t is not an implicit part of the policy, it is  
explicitly given the ability to take any action on any object by  
rules in the policy, and it typically still falls under a few MLS  
labeling restrictions even in the targeted policy.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Documentation about unaligned memory access

2007-11-22 Thread Kyle Moffett

On Nov 22, 2007, at 20:29:11, Alan Cox wrote:
Most architectures are unable to perform unaligned memory  
accesses. Any unaligned access causes a processor exception.


Not all. Some simply produce the wrong answer - thats oh so much  
more exciting.


As one example, the MicroBlaze soft-core processor family designed  
for use on Xilinx FPGAs will (by default) simply forcibly zero the  
lower bits of the unaligned address, such that the following code  
will fail mysteriously:


const char foo[] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 };
printf("0x%08lx 0x%08lx 0x%08lx 0x%08lx\n",
*((u32 *)(foo+0)),
*((u32 *)(foo+1)),
*((u32 *)(foo+2)),
*((u32 *)(foo+3)));

Instead of outputting:
0x00010203 0x01020304 0x02030405 0x03040506

It will output:
0x00010203 0x00010203 0x00010203 0x00010203

Other embedded architectures have very similar problems.  Some may  
provide an "unaligned data access" exception, but offer insufficient  
information to repair the damage and resume execution.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Documentation about unaligned memory access

2007-11-22 Thread Kyle Moffett

On Nov 22, 2007, at 20:29:11, Alan Cox wrote:
Most architectures are unable to perform unaligned memory  
accesses. Any unaligned access causes a processor exception.


Not all. Some simply produce the wrong answer - thats oh so much  
more exciting.


As one example, the MicroBlaze soft-core processor family designed  
for use on Xilinx FPGAs will (by default) simply forcibly zero the  
lower bits of the unaligned address, such that the following code  
will fail mysteriously:


const char foo[] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 };
printf(0x%08lx 0x%08lx 0x%08lx 0x%08lx\n,
*((u32 *)(foo+0)),
*((u32 *)(foo+1)),
*((u32 *)(foo+2)),
*((u32 *)(foo+3)));

Instead of outputting:
0x00010203 0x01020304 0x02030405 0x03040506

It will output:
0x00010203 0x00010203 0x00010203 0x00010203

Other embedded architectures have very similar problems.  Some may  
provide an unaligned data access exception, but offer insufficient  
information to repair the damage and resume execution.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Futexes and network filesystems.

2007-11-20 Thread Kyle Moffett

On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
I had a chance to think about this a bit more, and realized that  
the problem is that futexes don't appear to work on network  
filesystems, even if the network filesystems provide coherent  
shared memory.


It seems to me that we need to have a call that gets a unique token  
for a process for each filesystem per filesystem for use in futexes  
(especially robust futexes).  Say get_fs_task_id(const char *path);


On local filesystems this could just be the pid as we use today,  
but for filesystems that can be accessed from contexts with  
potentially overlapping pid values this could be something else.   
It is an extra syscall in the preparation path, but it should be  
hardly more expensive the current getpid().


Once we have fixed the futex infrastructure to be able to handle  
futexes on network filesystems, the pid namespace case will be  
trivial to implement.


Actually, I would think that get_vm_task_id(void *addr) would be a  
more useful interface.  The call would still be a relatively simple  
lookup to find the struct file associated with the particular virtual  
mapping, but it would be race-free from the perspective of userspace  
and would not require that we somehow figure out the file descriptor  
associated with a particular mmap() (which may be closed by this  
point in time).  Useful extension would be the get_fd_task_id(int fd)  
and get_fs_task_id(const char *path), but those are less important.


The other important thing is to ensure that somehow the numbers are  
considered unique only within the particular domain of a container,  
such that you can migrate a container from one system to another even  
using a simple local ext3 filesystem (on a networked block device)  
and still be able to have things work properly even after the  
migration.  Naturally this would only work with an upgraded libc but  
I think that's a reasonable requirement to enforce for migration of  
futexes and cross-network futexes.


Even for network filesystems which don't implement coherent shared  
memory, you might add a memexcl() system call which (when used by  
multiple cooperating processes) ensures that a given page is only  
ever mapped by at most one computer accessing a given network  
filesystem.  The page-outs and page-ins when shuttling that page  
across the network would be expensive, but I believe the cost would  
be reasonable for many applications and it would allow traditional  
atomic ops on the mapped pages to take and release futexes in the  
uncontended case.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Futexes and network filesystems.

2007-11-20 Thread Kyle Moffett

On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:
I had a chance to think about this a bit more, and realized that  
the problem is that futexes don't appear to work on network  
filesystems, even if the network filesystems provide coherent  
shared memory.


It seems to me that we need to have a call that gets a unique token  
for a process for each filesystem per filesystem for use in futexes  
(especially robust futexes).  Say get_fs_task_id(const char *path);


On local filesystems this could just be the pid as we use today,  
but for filesystems that can be accessed from contexts with  
potentially overlapping pid values this could be something else.   
It is an extra syscall in the preparation path, but it should be  
hardly more expensive the current getpid().


Once we have fixed the futex infrastructure to be able to handle  
futexes on network filesystems, the pid namespace case will be  
trivial to implement.


Actually, I would think that get_vm_task_id(void *addr) would be a  
more useful interface.  The call would still be a relatively simple  
lookup to find the struct file associated with the particular virtual  
mapping, but it would be race-free from the perspective of userspace  
and would not require that we somehow figure out the file descriptor  
associated with a particular mmap() (which may be closed by this  
point in time).  Useful extension would be the get_fd_task_id(int fd)  
and get_fs_task_id(const char *path), but those are less important.


The other important thing is to ensure that somehow the numbers are  
considered unique only within the particular domain of a container,  
such that you can migrate a container from one system to another even  
using a simple local ext3 filesystem (on a networked block device)  
and still be able to have things work properly even after the  
migration.  Naturally this would only work with an upgraded libc but  
I think that's a reasonable requirement to enforce for migration of  
futexes and cross-network futexes.


Even for network filesystems which don't implement coherent shared  
memory, you might add a memexcl() system call which (when used by  
multiple cooperating processes) ensures that a given page is only  
ever mapped by at most one computer accessing a given network  
filesystem.  The page-outs and page-ins when shuttling that page  
across the network would be expensive, but I believe the cost would  
be reasonable for many applications and it would allow traditional  
atomic ops on the mapped pages to take and release futexes in the  
uncontended case.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High priority tasks break SMP balancer?

2007-11-15 Thread Kyle Moffett
First of all, since Ingo Molnar seems to be one of the head scheduler  
gurus, you might CC him on this.  Also added a couple other useful  
CCs for regression reports.


On Nov 09, 2007, at 19:11:03, Micah Dowty wrote:
As I said, YMMV. I haven't been able to find a single set of  
parameters for the demo program which cause the problem to occur  
100% of the time on all systems.


In general, boosting the MAINTHREAD_PRIORITY even more and  
increasing the WAKE_HZ should exaggerate the problem. These  
parameters reproduce the problem very reliably on my system:


#define NUM_BUSY_THREADS2
#define MAINTHREAD_PRIORITY   -20
#define MAINTHREAD_WAKE_HZ   1024
#define MAINTHREAD_LOAD_PERCENT 5
#define MAINTHREAD_LOAD_CYCLES  2


Well from these statistics; if you are requesting wakeups that often  
then it is probably *not* correct to try to move another thread to  
that CPU in the mean-time.  Essentially the migration cost will  
likely far outweigh the advantage of letting it run a little bit of  
extra time, and in addition it will dump out cache from the high- 
priority thread.  As per the description I think that an increased a  
priority and increased WAKE_HZ will certainly cause the "problem" to  
occur more, simply because it reduces the time between wakeups of the  
high-priority process and makes it less helpful to migrate another  
process over to that CPU during the sleep periods.  This will also  
depend on your hardware and possibly other configuration parameters.


I'm not really that much of an expert in this particular area,  
though, so it's entirely possible that one of the above-mentioned  
scheduler head-honchos will poke holes in my argument and give a  
better explanation or a possible patch.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High priority tasks break SMP balancer?

2007-11-15 Thread Kyle Moffett
First of all, since Ingo Molnar seems to be one of the head scheduler  
gurus, you might CC him on this.  Also added a couple other useful  
CCs for regression reports.


On Nov 09, 2007, at 19:11:03, Micah Dowty wrote:
As I said, YMMV. I haven't been able to find a single set of  
parameters for the demo program which cause the problem to occur  
100% of the time on all systems.


In general, boosting the MAINTHREAD_PRIORITY even more and  
increasing the WAKE_HZ should exaggerate the problem. These  
parameters reproduce the problem very reliably on my system:


#define NUM_BUSY_THREADS2
#define MAINTHREAD_PRIORITY   -20
#define MAINTHREAD_WAKE_HZ   1024
#define MAINTHREAD_LOAD_PERCENT 5
#define MAINTHREAD_LOAD_CYCLES  2


Well from these statistics; if you are requesting wakeups that often  
then it is probably *not* correct to try to move another thread to  
that CPU in the mean-time.  Essentially the migration cost will  
likely far outweigh the advantage of letting it run a little bit of  
extra time, and in addition it will dump out cache from the high- 
priority thread.  As per the description I think that an increased a  
priority and increased WAKE_HZ will certainly cause the problem to  
occur more, simply because it reduces the time between wakeups of the  
high-priority process and makes it less helpful to migrate another  
process over to that CPU during the sleep periods.  This will also  
depend on your hardware and possibly other configuration parameters.


I'm not really that much of an expert in this particular area,  
though, so it's entirely possible that one of the above-mentioned  
scheduler head-honchos will poke holes in my argument and give a  
better explanation or a possible patch.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix isspace() and other ctype.h functions to ignore chars 128-255

2007-11-07 Thread Kyle Moffett
Originally isspace() and other similar functions in ctype.h ignored  
any character with the high bit set; however this was changed during  
the linux 2.1 days to map Latin-1.  As following Latin-1 will most  
likely break UTF-8 any any *other* encoding that is backwards- 
compatible with 7-bit-ASCII, change ctype.c to ignore such characters  
completely (the way they were before).  Linus seems to think this is  
a good thing, and he's the one that wrote the code in the first place.


Signed-off-by: Kyle Moffett <[EMAIL PROTECTED]>

---

On Nov 06, 2007, at 10:53:08, Linus Torvalds wrote:

On Tue, 6 Nov 2007, Kyle Moffett wrote:

Personally I think that isspace() accepting character 0xA0 is a bug


I think I agree with you. As far as the kernel is concerned,  
"isspace()" should just accept the obvious spaces (hardspace, tab,  
newline), and *perhaps* the VT/FF kind of things.


You should realize that the kernel  thing is *ancient*.  
It's basically there from v0.01, and while the really original one  
(I just checked) had all the non-ascii characters not trigger  
anything, it was converted to be latin1 in the 2.1.x timeframe.


That's a *loong* time ago. Way before UTF-8 and other things were  
really common.


So we should probably just make all the upper 128 bytes go back to  
"don't trigger anything in ctype.h" - they'd not be spaces, but  
they'd not be control characters or anything else either.


 lib/ctype.c |   17 +++--
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/ctype.c b/lib/ctype.c
index d02ace1..ce2807a 100644
--- a/lib/ctype.c
+++ b/lib/ctype.c
@@ -24,13 +24,18 @@ _P,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L,  /* 96-103 */
 _L,_L,_L,_L,_L,_L,_L,_L,   /* 104-111 */
 _L,_L,_L,_L,_L,_L,_L,_L,   /* 112-119 */
 _L,_L,_L,_P,_P,_P,_P,_C,   /* 120-127 */
+
+/*
+ * None of these match any type bits to avoid screwing up UTF-8 or any other
+ * 7-bit-ASCII-compatible encoding.
+ */
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 128-143 */
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 144-159 */
-_S|_SP,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,   /* 160-175 */
-_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,   /* 176-191 */
-_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,   /* 192-207 */
-_U,_U,_U,_U,_U,_U,_U,_P,_U,_U,_U,_U,_U,_U,_U,_L,   /* 208-223 */
-_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,   /* 224-239 */
-_L,_L,_L,_L,_L,_L,_L,_P,_L,_L,_L,_L,_L,_L,_L,_L};  /* 240-255 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 160-175 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 176-191 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 192-207 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 208-223 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 224-239 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};  /* 240-255 */
 
 EXPORT_SYMBOL(_ctype);




[PATCH] Fix isspace() and other ctype.h functions to ignore chars 128-255

2007-11-07 Thread Kyle Moffett
Originally isspace() and other similar functions in ctype.h ignored  
any character with the high bit set; however this was changed during  
the linux 2.1 days to map Latin-1.  As following Latin-1 will most  
likely break UTF-8 any any *other* encoding that is backwards- 
compatible with 7-bit-ASCII, change ctype.c to ignore such characters  
completely (the way they were before).  Linus seems to think this is  
a good thing, and he's the one that wrote the code in the first place.


Signed-off-by: Kyle Moffett [EMAIL PROTECTED]

---

On Nov 06, 2007, at 10:53:08, Linus Torvalds wrote:

On Tue, 6 Nov 2007, Kyle Moffett wrote:

Personally I think that isspace() accepting character 0xA0 is a bug


I think I agree with you. As far as the kernel is concerned,  
isspace() should just accept the obvious spaces (hardspace, tab,  
newline), and *perhaps* the VT/FF kind of things.


You should realize that the kernel ctype.h thing is *ancient*.  
It's basically there from v0.01, and while the really original one  
(I just checked) had all the non-ascii characters not trigger  
anything, it was converted to be latin1 in the 2.1.x timeframe.


That's a *loong* time ago. Way before UTF-8 and other things were  
really common.


So we should probably just make all the upper 128 bytes go back to  
don't trigger anything in ctype.h - they'd not be spaces, but  
they'd not be control characters or anything else either.


 lib/ctype.c |   17 +++--
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/lib/ctype.c b/lib/ctype.c
index d02ace1..ce2807a 100644
--- a/lib/ctype.c
+++ b/lib/ctype.c
@@ -24,13 +24,18 @@ _P,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L|_X,_L,  /* 96-103 */
 _L,_L,_L,_L,_L,_L,_L,_L,   /* 104-111 */
 _L,_L,_L,_L,_L,_L,_L,_L,   /* 112-119 */
 _L,_L,_L,_P,_P,_P,_P,_C,   /* 120-127 */
+
+/*
+ * None of these match any type bits to avoid screwing up UTF-8 or any other
+ * 7-bit-ASCII-compatible encoding.
+ */
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 128-143 */
 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 144-159 */
-_S|_SP,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,   /* 160-175 */
-_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,_P,   /* 176-191 */
-_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,_U,   /* 192-207 */
-_U,_U,_U,_U,_U,_U,_U,_P,_U,_U,_U,_U,_U,_U,_U,_L,   /* 208-223 */
-_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,_L,   /* 224-239 */
-_L,_L,_L,_L,_L,_L,_L,_P,_L,_L,_L,_L,_L,_L,_L,_L};  /* 240-255 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 160-175 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 176-191 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 192-207 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 208-223 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   /* 224-239 */
+0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};  /* 240-255 */
 
 EXPORT_SYMBOL(_ctype);




Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-06 Thread Kyle Moffett

On Nov 06, 2007, at 07:23:36, Ahmed S. Darwish wrote:

On 11/6/07, Adrian Bunk <[EMAIL PROTECTED]> wrote:

On Tue, Nov 06, 2007 at 01:34:05PM +0200, Ahmed S. Darwish wrote:
As far as I understand the problem now, isspace() accepts the  
0xa0 character which might collide with some of UTF-8 encoded  
characters cause the high bit is set.


I admit I'm not experienced in such encoding stuff, but shouldn't  
the ASCII and the ASCII-compatible UTF-8 encodings be enough for  
the labels?


It would not work if someone would e.g. give you UTF-16 encoded  
strings, but I don't see this happening in practice.


Won't this complicate the code too much ?


Well the VFS (for example) certainly doesn't support any encodings  
other than various extended-ASCII forms (which includes UTF-8).   
Something like UTF-16 has extra null characters in-between every  
normal character, and as such would fail completely if passed to the  
VFS.


Personally I think that isspace() accepting character 0xA0 is a bug,  
as there are several variants of extended ASCII only one of which has  
that character as a space.  Others have it as á (accented A), etc.   
In addition the "canonical" internal text format of the kernel is  
UTF-8 as that encoding can represent any character in any other  
encoding and it is backwards-compatible with traditional ASCII.


Cheers,
Kyle Moffett-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-06 Thread Kyle Moffett

On Nov 06, 2007, at 01:33:05, Adrian Bunk wrote:

Can you limit this to 7bit ASCII and use isascii() somewhere?

Otherwise I'd expect funny things to happen when you e.g. use  
isspace() on the UTF-8 encoded character à.


Actually, you don't need to.  You tell them it expects UTF-8 encoded  
strings and be done with it.  All US-ASCII characters from 0 through  
127 (IE: high bit clear) are exactly the same in UTF-8, and UTF-8  
special characters have the high bit set in all bytes.  Therefore you  
just assume that anything with the high bit set is part of a word and  
you can handle basic UTF-8.  (It doesn't work on special UTF-8 space  
characters like nonbreaking space and similar, but handling those is  
significantly more complicated).


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-06 Thread Kyle Moffett

On Nov 06, 2007, at 01:33:05, Adrian Bunk wrote:

Can you limit this to 7bit ASCII and use isascii() somewhere?

Otherwise I'd expect funny things to happen when you e.g. use  
isspace() on the UTF-8 encoded character à.


Actually, you don't need to.  You tell them it expects UTF-8 encoded  
strings and be done with it.  All US-ASCII characters from 0 through  
127 (IE: high bit clear) are exactly the same in UTF-8, and UTF-8  
special characters have the high bit set in all bytes.  Therefore you  
just assume that anything with the high bit set is part of a word and  
you can handle basic UTF-8.  (It doesn't work on special UTF-8 space  
characters like nonbreaking space and similar, but handling those is  
significantly more complicated).


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-06 Thread Kyle Moffett

On Nov 06, 2007, at 07:23:36, Ahmed S. Darwish wrote:

On 11/6/07, Adrian Bunk [EMAIL PROTECTED] wrote:

On Tue, Nov 06, 2007 at 01:34:05PM +0200, Ahmed S. Darwish wrote:
As far as I understand the problem now, isspace() accepts the  
0xa0 character which might collide with some of UTF-8 encoded  
characters cause the high bit is set.


I admit I'm not experienced in such encoding stuff, but shouldn't  
the ASCII and the ASCII-compatible UTF-8 encodings be enough for  
the labels?


It would not work if someone would e.g. give you UTF-16 encoded  
strings, but I don't see this happening in practice.


Won't this complicate the code too much ?


Well the VFS (for example) certainly doesn't support any encodings  
other than various extended-ASCII forms (which includes UTF-8).   
Something like UTF-16 has extra null characters in-between every  
normal character, and as such would fail completely if passed to the  
VFS.


Personally I think that isspace() accepting character 0xA0 is a bug,  
as there are several variants of extended ASCII only one of which has  
that character as a space.  Others have it as á (accented A), etc.   
In addition the canonical internal text format of the kernel is  
UTF-8 as that encoding can represent any character in any other  
encoding and it is backwards-compatible with traditional ASCII.


Cheers,
Kyle Moffett-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-03 Thread Kyle Moffett

On Nov 03, 2007, at 12:43:06, Ahmed S. Darwish wrote:
Bashv3 builtin "echo" behaves very strangely to -EINVAL. It sends  
all the buffers that causes -EINVAL again in subsequent echo  
invocations.


i.e.
echo "Invalid Rule" > /smack/load  # -EINVAL returned
echo "Valid Rule" > /smack/load

In seconod iteration, echo sends the first invalid buffer again  
then sends the new one. This causes a "Invalid Rule\nValid Rule"  
buffer sent to write().


IMHO, this is a bug in builtin echo. The external /bin/echo doesn't  
cause such strange behaviour.


Actually, what causes problems here is something between a bug and a  
feature in libc's buffering.  Basically the -EINVAL error causes libc  
to leave its data in the file-output buffer despite the file being  
closed and reopened. Since a standalone echo just exits that buffer  
is discarded, but for the bash builtin it hangs around in the buffer  
for a while and ends up getting prepended to the following echo  
statement.  There's actually multiple ways to make this fail; this is  
just the simplest.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser

2007-11-03 Thread Kyle Moffett

On Nov 03, 2007, at 12:43:06, Ahmed S. Darwish wrote:
Bashv3 builtin echo behaves very strangely to -EINVAL. It sends  
all the buffers that causes -EINVAL again in subsequent echo  
invocations.


i.e.
echo Invalid Rule  /smack/load  # -EINVAL returned
echo Valid Rule  /smack/load

In seconod iteration, echo sends the first invalid buffer again  
then sends the new one. This causes a Invalid Rule\nValid Rule  
buffer sent to write().


IMHO, this is a bug in builtin echo. The external /bin/echo doesn't  
cause such strange behaviour.


Actually, what causes problems here is something between a bug and a  
feature in libc's buffering.  Basically the -EINVAL error causes libc  
to leave its data in the file-output buffer despite the file being  
closed and reopened. Since a standalone echo just exits that buffer  
is discarded, but for the bash builtin it hangs around in the buffer  
for a while and ends up getting prepended to the following echo  
statement.  There's actually multiple ways to make this fail; this is  
just the simplest.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Security *Module* Framework (Was: LSM conversion to static interface)

2007-10-24 Thread Kyle Moffett

On Oct 24, 2007, at 17:37:04, Serge E. Hallyn wrote:
The scariest thing to consider is programs which don't  
appropriately handle failure.  So I don't know, maybe the system  
runs a remote logger to which the multiadm policy gives some extra  
privs, but now the portac module prevents it from sending its  
data.  And maybe, since the authors never saw this failure as  
possible, the program happens to dump sensitive data in a public  
readable place.  I *could* be more vague but it'd be tough :)  But  
you get the idea.


Well, there *was* that problem with sendmail where it did not  
properly check the result of setuid() and just assumed it had  
succeeded.  So instead of running as "smtpd" it was running as  
"root".   Not a happy memory.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-24 Thread Kyle Moffett

On Oct 24, 2007, at 17:21:10, Matthew Wilcox wrote:

On Wed, Oct 24, 2007 at 04:59:48PM -0400, Kyle Moffett wrote:
This seems unlikely to work reliably as the various "v*printf"  
functions modify the va_list argument they are passed.  It may  
happen to work on your particular architecture depending on how  
that argument data is passed and stored, but you probably actually  
want to make a copy of the varargs list for the first vsnprintf call.


I based what I did on how printk works:

va_start(args, fmt);
r = vprintk(fmt, args);
va_end(args);

It doesn't call va_* anywhere else.  I don't claim to be a varargs  
expert, but if I'm wrong, I'm at least wrong the same way that  
printk is, so not in any way that's significant for any other  
architecture Linux runs on.


No, the problem is what happens when you don't have enough space  
allocated:  you call "vsnprintf(s, len, format, args);" and then  
later call "vsprintf(s, format, args);" with the *SAME* "args".   
That's what's broken.


So this is wrong:

va_list args;
va_start(args, fmt);
r1 = vprintk(fmt, args);
r2 = vprintk(fmt, args);
va_end(args);


To fix it, you have 2 options.

Option 1:

va_list args;
va_start(args, fmt);
r1 = vprintk(fmt, args);
va_end(args);
va_start(args, fmt);
r2 = vprintk(fmt, args);
va_end(args);


Option 2:

va_list args, argscopy;
va_start(args, fmt);
va_copy(argscopy, args);
r1 = vprintk(fmt, argscopy);
va_end(argscopy);
r2 = vprintk(fmt, args);
va_end(args);


Now in a function which *receives* a va_list from one of its callers,  
"Option 1" isn't an option because you don't have the original stack  
frame, so the result looks like this:



void func1(const char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
func2(fmt, ap);
va_end(ap);
}

void func2(const char *fmt, va_list ap)
{
va_list ap2;
va_copy(ap2, ap);
vprintk(fmt, ap2);
va_end(ap2);
vprintk(fmt, ap);
}


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-24 Thread Kyle Moffett

On Oct 24, 2007, at 15:59:49, Matthew Wilcox wrote:
+static void sb_vprintf(struct stringbuf *sb, gfp_t gfp, const char  
*format, va_list args)

+{


[...]


+   s = sb->buf + sb->len;
+   size = vsnprintf(s, sb->alloc - sb->len, format, args);


[...]

+	/* Point to the end of the old string since we already updated - 
>len */

+   s += sb->len - size;
+   vsprintf(s, format, args);


[...]

+void sb_printf(struct stringbuf *sb, gfp_t gfp, const char  
*format, ...)

+{
+   va_list args;
+
+   va_start(args, format);
+   sb_vprintf(sb, gfp, format, args);
+   va_end(args);
+}


This seems unlikely to work reliably as the various "v*printf"  
functions modify the va_list argument they are passed.  It may happen  
to work on your particular architecture depending on how that  
argument data is passed and stored, but you probably actually want to  
make a copy of the varargs list for the first vsnprintf call.   
Example below:



va_list argscopy;
va_copy(argscopy, args);

[...]

size = vsnprintf(s, sb->alloc - sb->len, format, argscopy)

[...]

s += sb->len - size;
vsprintf(s, format, args);

[...]

va_end(argscopy);


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-24 Thread Kyle Moffett

On Oct 24, 2007, at 17:21:10, Matthew Wilcox wrote:

On Wed, Oct 24, 2007 at 04:59:48PM -0400, Kyle Moffett wrote:
This seems unlikely to work reliably as the various v*printf  
functions modify the va_list argument they are passed.  It may  
happen to work on your particular architecture depending on how  
that argument data is passed and stored, but you probably actually  
want to make a copy of the varargs list for the first vsnprintf call.


I based what I did on how printk works:

va_start(args, fmt);
r = vprintk(fmt, args);
va_end(args);

It doesn't call va_* anywhere else.  I don't claim to be a varargs  
expert, but if I'm wrong, I'm at least wrong the same way that  
printk is, so not in any way that's significant for any other  
architecture Linux runs on.


No, the problem is what happens when you don't have enough space  
allocated:  you call vsnprintf(s, len, format, args); and then  
later call vsprintf(s, format, args); with the *SAME* args.   
That's what's broken.


So this is wrong:

va_list args;
va_start(args, fmt);
r1 = vprintk(fmt, args);
r2 = vprintk(fmt, args);
va_end(args);


To fix it, you have 2 options.

Option 1:

va_list args;
va_start(args, fmt);
r1 = vprintk(fmt, args);
va_end(args);
va_start(args, fmt);
r2 = vprintk(fmt, args);
va_end(args);


Option 2:

va_list args, argscopy;
va_start(args, fmt);
va_copy(argscopy, args);
r1 = vprintk(fmt, argscopy);
va_end(argscopy);
r2 = vprintk(fmt, args);
va_end(args);


Now in a function which *receives* a va_list from one of its callers,  
Option 1 isn't an option because you don't have the original stack  
frame, so the result looks like this:



void func1(const char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
func2(fmt, ap);
va_end(ap);
}

void func2(const char *fmt, va_list ap)
{
va_list ap2;
va_copy(ap2, ap);
vprintk(fmt, ap2);
va_end(ap2);
vprintk(fmt, ap);
}


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Security *Module* Framework (Was: LSM conversion to static interface)

2007-10-24 Thread Kyle Moffett

On Oct 24, 2007, at 17:37:04, Serge E. Hallyn wrote:
The scariest thing to consider is programs which don't  
appropriately handle failure.  So I don't know, maybe the system  
runs a remote logger to which the multiadm policy gives some extra  
privs, but now the portac module prevents it from sending its  
data.  And maybe, since the authors never saw this failure as  
possible, the program happens to dump sensitive data in a public  
readable place.  I *could* be more vague but it'd be tough :)  But  
you get the idea.


Well, there *was* that problem with sendmail where it did not  
properly check the result of setuid() and just assumed it had  
succeeded.  So instead of running as smtpd it was running as  
root.   Not a happy memory.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-24 Thread Kyle Moffett

On Oct 24, 2007, at 15:59:49, Matthew Wilcox wrote:
+static void sb_vprintf(struct stringbuf *sb, gfp_t gfp, const char  
*format, va_list args)

+{


[...]


+   s = sb-buf + sb-len;
+   size = vsnprintf(s, sb-alloc - sb-len, format, args);


[...]

+	/* Point to the end of the old string since we already updated - 
len */

+   s += sb-len - size;
+   vsprintf(s, format, args);


[...]

+void sb_printf(struct stringbuf *sb, gfp_t gfp, const char  
*format, ...)

+{
+   va_list args;
+
+   va_start(args, format);
+   sb_vprintf(sb, gfp, format, args);
+   va_end(args);
+}


This seems unlikely to work reliably as the various v*printf  
functions modify the va_list argument they are passed.  It may happen  
to work on your particular architecture depending on how that  
argument data is passed and stored, but you probably actually want to  
make a copy of the varargs list for the first vsnprintf call.   
Example below:



va_list argscopy;
va_copy(argscopy, args);

[...]

size = vsnprintf(s, sb-alloc - sb-len, format, argscopy)

[...]

s += sb-len - size;
vsprintf(s, format, args);

[...]

va_end(argscopy);


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Reserve N process to root

2007-10-11 Thread Kyle Moffett

On Oct 12, 2007, at 01:37:23, Al Boldi wrote:

Kyle Moffett wrote:
This isn't really necessary any more with the new CFS scheduler.   
If you want to prevent excess memory usage then you limit memory  
usage, not process count, so just set the system max process count  
to something absurdly high and leave the user counts down at the  
maximum a user might run.  Then as long as the sum of the user  
processes is less than the max number of processes (which you just  
set absurdly high or unlimited), you may still log in.  With the  
per-user scheduling enabled CFS allows you to run an  
optimistically-real-time game as one user and several thousand  
busy-loops as another user and get almost picture perfect 50% CPU  
distribution between the users. To me that seems a much better DoS- 
prevention system than limits which don't scale based on how many  
people are requesting resources.


You have a point, and resource-controllers can probably control DoS  
a lot better, but the they also incur more overhead.  Think of this  
"lockout prevention" patch as a near zero overhead safety valve.


But why do you need to add "lockout prevention" if it already  
exists?  With CFS' extremely efficient per-user-scheduling (hopefully  
soon to be the default) there are only two forms of lockout by non- 
root processes:  (1) Running out of PIDs in the box's PID-space  
(think tens or hundreds of thousands of processes), or (2) Swap- 
storming the box to death.  To put it bluntly trying to reserve free  
PID slots is attacking the wrong end of the problem and your so  
called "lockout prevention" could very easily ensure that 10 PIDs are  
available even if the user has swapstormed the box with the PIDs he  
does have.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Reserve N process to root

2007-10-11 Thread Kyle Moffett

Please don't trim CC lists

On Oct 11, 2007, at 17:02:37, Al Boldi wrote:

David Newall wrote:

[EMAIL PROTECTED] wrote:
What David meant was that "root will always have a slot" doesn't  
*actually* help unless you *also* have a way to actually *spawn*  
such a process.  In order to do the ps, kill, and so on that you  
need to recover, you need to already have either a root shell  
available, or a way to *get* a root shell that doesn't rely on a  
non-root process (so /bin/su doesn't help here).


That's right, although it's worse than that.  You need to have a  
process with CAP_SYS_ADMIN.  If root processes normally have that  
capability then the reserved slots may well disappear before you  
notice a problem.  If root processes normally don't have it, then  
you need to guarantee that one is already running.


I once posted a patch to handle this DoS, but, as usual, it wasn't  
accepted.  Go figure...


This isn't really necessary any more with the new CFS scheduler.  If  
you want to prevent excess memory usage then you limit memory usage,  
not process count, so just set the system max process count to  
something absurdly high and leave the user counts down at the maximum  
a user might run.  Then as long as the sum of the user processes is  
less than the max number of processes (which you just set absurdly  
high or unlimited), you may still log in.  With the per-user  
scheduling enabled CFS allows you to run an optimistically-real-time  
game as one user and several thousand busy-loops as another user and  
get almost picture perfect 50% CPU distribution between the users.   
To me that seems a much better DoS-prevention system than limits  
which don't scale based on how many people are requesting resources.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-11 Thread Kyle Moffett

On Oct 11, 2007, at 11:41:34, Casey Schaufler wrote:

--- Kyle Moffett <[EMAIL PROTECTED]> wrote:

[snipped]


I'm still waiting to see the proposed SELinux policy that does what  
Smack does.


That *is* the SELinux policy which does what Smack does.  I keep  
having bugs in the perl-script I'm writing on account of not having  
the time to really get around to fixing it, but that is exactly the  
procedure for generating an SELinux policy from a SMACK policy.


I can accept that you don't see anything that can't be implemented  
thus, but that's not the point. You've provided some really clear  
design notes, and that's great, but it ain't the code. You said  
that you could write a 500 line perl script that would do the whole  
thing, and that left some people with an impression that Smack is a  
subset of SELinux.  Well, I'm already finding myself digging out  
from under that missunderstanding, and with people who are assuming  
that your policy has been done, "proving" the point.


I'd love to have time to finish the script but unfortunately real  
life keeps interfering and I'm going to have to go back to lurking on  
this thread.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-11 Thread Kyle Moffett
Ok, finally getting some time to work on this stuff once again (life  
gets really crazy sometimes).  I would like to postulate that you can  
restate any SMACK policy as a functionally equivalent SELinux policy  
(with a few slight technical differences, see below).  I've been  
working on a script to do this but keep getting stuck tracking down  
minor bugs and then get dragged off on other things I need to do.   
Here is the method I am presently trying to implement:


First divide the SELinux access vectors into 7 groups based on which  
ones SMACK wishes to influence:

(R) Requires "read" permissions (the 'r' bit)
(W) Requires "write" permissions (the 'w' bit)
(X) Requires "execute" permissions (the 'x' bit)
(A) Requires "append" OR "write" permissions (the 'a' bit)
(P) Requires CAP_MAC_OVERRIDE
	(K) May not be performed by a non-CAP_MAC_OVERRIDE process on a  
CAP_MAC_OVERRIDE process

(N) Does not require any special permissions

The letters in front indicate the names I will use in the rest of  
this document to describe the sets of access vectors.


Next define a single SELinux user "smack", and two independent roles,  
"priv" and "unpriv".  We create the set of SMACK equivalence-classes  
defined as various SELinux types with substitutions for "*", "^",  
"_", and "?", and then completely omit the MLS portions of the  
SELinux policy.


The next step is to establish the fundamental constraints of the  
policy.  To prevent processes from gaining CAP_MAC_OVERRIDE we  
iterate over the access vectors in (K) and add the following  
constraint for each vector:

constrain $OBJECT_CLASS $ACCESS_VECTOR ((r1 == r2) || (r1 == priv))

This also includes:
constrain process transition ((r1 == r2) || (r1 == priv))

Then we require privilege to access the (P) vectors; for each vector  
in (P) we add a constraint:

constrain $OBJECT_CLASS $ACCESS_VECTOR (r1 == priv)

At this point the only rules left to add are the between-type rules.   
Here it gets mildly complicated because SMACK is a linear-lookup  
system (each rule must be matched in order) whereas SELinux is a  
globally-unique-lookup system (all rules are mutually exclusive and  
matched simultaneously).  Essentially for each SMACK rule:

$SOURCE $DEST $PERM_BITS

We iterate over all of the classes represented in the access vector  
lists in $PERM_BITS and create rules for each one:

allow { $SOURCE } { $DEST }:$PERM_CLASS { $PERM_VECTORS };

If you need SMACK to allow subtractive permissions then you need to  
expand that further, however I believe as an initial cut that it  
sufficient.


The only other task is to prepend the auto-generated object-class and  
access-vector lists to the policy and append the initial SIDs that  
smack wants various objects to have, as well as allowing the "smack"  
user the "priv" and "nopriv" roles and allowing those two roles entry  
into all of the SMACK types.  The resulting SELinux-ified SMACK  
labels would go from:


SomeLabel (with CAP_MAC_OVERRIDE)
AnotherLabel
YetAnotherLabel

to:

smack:priv:SomeLabel
smack:nopriv:AnotherLabel
smack:nopriv:YetAnotherLabel


Casey, hopefully this gives you some ideas about how I think you  
could modify the SELinux code to compile out the "user" field and  
simplify the "role" field as needed.  I'm still not seeing anything  
which SELinux cannot directly implement without additional code, even  
the "CAP_MAC_OVERRIDE" bit.  If the semantics don't seem quite right,  
please provide details about how you think the models differ and I  
will try to address the concerns.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: "mount --bind" with user/group/mode definition?

2007-10-11 Thread Kyle Moffett

On Oct 11, 2007, at 04:35:37, Ph. Marek wrote:
is there some way to duplicate a directory somewhere else (like  
with "mount --bind"), but having different owner/group/mode bits?


I'd like to mount a directory I have no control over (think NFS, or  
floppy, ...) with clearly defined rights - like root:,  
mode 0550 for all directories, and 0440 for all files. (Here I want  
to have full *read* control, regardless of the original permissions).
[ I know that this special case can be (mostly) done by a read-only  
binding mount; the part that is missing is eg. files with a  
different owner being 0700. ]


I know that something like this is possible for eg. VFAT, which has  
no right descriptors for itself; but I'd need that for arbitrary  
directory trees, who  themselves *have* permissions set.


Is there some way to achieve that?


Not at the moment, unfortunately.  I suspect that with the recent  
developments in user container support and/or overlay mounting it  
will become possible to either write a UID/GID-translation overlay  
filesystem or grant cross-UID-container keys to achieve what you  
want.  On the other hand that probably won't fully happen for up to a  
year or so.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: idio{,ma}tic typos (was Re: + fix-vm_can_nonlinear-check-in-sys_remap_file_pages.patch added to -mm tree)

2007-10-11 Thread Kyle Moffett

On Oct 11, 2007, at 03:35:37, Alexey Dobriyan wrote:

Sadly, yes.

[PATCH] smctr: fix "|| 0x" typo

IBM_PASS_SOURCE_ADDR is 1, so logically ORing it with status bits is
pretty useless. Do bitwise OR, instead.

Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
---

 drivers/net/tokenring/smctr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/tokenring/smctr.c
+++ b/drivers/net/tokenring/smctr.c
@@ -3413,7 +3413,7 @@ static int smctr_make_tx_status_code(struct  
net_device *dev,

 tsv->svi = TRANSMIT_STATUS_CODE;
 tsv->svl = S_TRANSMIT_STATUS_CODE;

-tsv->svv[0] = ((tx_fstatus & 0x0100 >> 6) ||  
IBM_PASS_SOURCE_ADDR);
+tsv->svv[0] = ((tx_fstatus & 0x0100 >> 6) |  
IBM_PASS_SOURCE_ADDR);


 /* Stripped frame status of Transmitted Frame */
 tsv->svv[1] = tx_fstatus & 0xff;


Hmm, here's a question for you:  The old code was equivalent to "tsv- 
>svv[0] = 1;", what's your proof that we don't rely on this "bug"  
elsewhere in the code?  In other words, this is a significant  
behavior change (albeit fixing an apparent bug) from what we've done  
for a while.  You might want to do a git-blame on this bit of code to  
see who the last person to modify it was and ask them to test or  
confirm the patch first.  The same general questions apply to the  
other logical-op bugs.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: idio{,ma}tic typos (was Re: + fix-vm_can_nonlinear-check-in-sys_remap_file_pages.patch added to -mm tree)

2007-10-11 Thread Kyle Moffett

On Oct 11, 2007, at 03:35:37, Alexey Dobriyan wrote:

Sadly, yes.

[PATCH] smctr: fix || 0x typo

IBM_PASS_SOURCE_ADDR is 1, so logically ORing it with status bits is
pretty useless. Do bitwise OR, instead.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 drivers/net/tokenring/smctr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/tokenring/smctr.c
+++ b/drivers/net/tokenring/smctr.c
@@ -3413,7 +3413,7 @@ static int smctr_make_tx_status_code(struct  
net_device *dev,

 tsv-svi = TRANSMIT_STATUS_CODE;
 tsv-svl = S_TRANSMIT_STATUS_CODE;

-tsv-svv[0] = ((tx_fstatus  0x0100  6) ||  
IBM_PASS_SOURCE_ADDR);
+tsv-svv[0] = ((tx_fstatus  0x0100  6) |  
IBM_PASS_SOURCE_ADDR);


 /* Stripped frame status of Transmitted Frame */
 tsv-svv[1] = tx_fstatus  0xff;


Hmm, here's a question for you:  The old code was equivalent to tsv- 
svv[0] = 1;, what's your proof that we don't rely on this bug  
elsewhere in the code?  In other words, this is a significant  
behavior change (albeit fixing an apparent bug) from what we've done  
for a while.  You might want to do a git-blame on this bit of code to  
see who the last person to modify it was and ask them to test or  
confirm the patch first.  The same general questions apply to the  
other logical-op bugs.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mount --bind with user/group/mode definition?

2007-10-11 Thread Kyle Moffett

On Oct 11, 2007, at 04:35:37, Ph. Marek wrote:
is there some way to duplicate a directory somewhere else (like  
with mount --bind), but having different owner/group/mode bits?


I'd like to mount a directory I have no control over (think NFS, or  
floppy, ...) with clearly defined rights - like root:some group,  
mode 0550 for all directories, and 0440 for all files. (Here I want  
to have full *read* control, regardless of the original permissions).
[ I know that this special case can be (mostly) done by a read-only  
binding mount; the part that is missing is eg. files with a  
different owner being 0700. ]


I know that something like this is possible for eg. VFAT, which has  
no right descriptors for itself; but I'd need that for arbitrary  
directory trees, who  themselves *have* permissions set.


Is there some way to achieve that?


Not at the moment, unfortunately.  I suspect that with the recent  
developments in user container support and/or overlay mounting it  
will become possible to either write a UID/GID-translation overlay  
filesystem or grant cross-UID-container keys to achieve what you  
want.  On the other hand that probably won't fully happen for up to a  
year or so.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-11 Thread Kyle Moffett
Ok, finally getting some time to work on this stuff once again (life  
gets really crazy sometimes).  I would like to postulate that you can  
restate any SMACK policy as a functionally equivalent SELinux policy  
(with a few slight technical differences, see below).  I've been  
working on a script to do this but keep getting stuck tracking down  
minor bugs and then get dragged off on other things I need to do.   
Here is the method I am presently trying to implement:


First divide the SELinux access vectors into 7 groups based on which  
ones SMACK wishes to influence:

(R) Requires read permissions (the 'r' bit)
(W) Requires write permissions (the 'w' bit)
(X) Requires execute permissions (the 'x' bit)
(A) Requires append OR write permissions (the 'a' bit)
(P) Requires CAP_MAC_OVERRIDE
	(K) May not be performed by a non-CAP_MAC_OVERRIDE process on a  
CAP_MAC_OVERRIDE process

(N) Does not require any special permissions

The letters in front indicate the names I will use in the rest of  
this document to describe the sets of access vectors.


Next define a single SELinux user smack, and two independent roles,  
priv and unpriv.  We create the set of SMACK equivalence-classes  
defined as various SELinux types with substitutions for *, ^,  
_, and ?, and then completely omit the MLS portions of the  
SELinux policy.


The next step is to establish the fundamental constraints of the  
policy.  To prevent processes from gaining CAP_MAC_OVERRIDE we  
iterate over the access vectors in (K) and add the following  
constraint for each vector:

constrain $OBJECT_CLASS $ACCESS_VECTOR ((r1 == r2) || (r1 == priv))

This also includes:
constrain process transition ((r1 == r2) || (r1 == priv))

Then we require privilege to access the (P) vectors; for each vector  
in (P) we add a constraint:

constrain $OBJECT_CLASS $ACCESS_VECTOR (r1 == priv)

At this point the only rules left to add are the between-type rules.   
Here it gets mildly complicated because SMACK is a linear-lookup  
system (each rule must be matched in order) whereas SELinux is a  
globally-unique-lookup system (all rules are mutually exclusive and  
matched simultaneously).  Essentially for each SMACK rule:

$SOURCE $DEST $PERM_BITS

We iterate over all of the classes represented in the access vector  
lists in $PERM_BITS and create rules for each one:

allow { $SOURCE } { $DEST }:$PERM_CLASS { $PERM_VECTORS };

If you need SMACK to allow subtractive permissions then you need to  
expand that further, however I believe as an initial cut that it  
sufficient.


The only other task is to prepend the auto-generated object-class and  
access-vector lists to the policy and append the initial SIDs that  
smack wants various objects to have, as well as allowing the smack  
user the priv and nopriv roles and allowing those two roles entry  
into all of the SMACK types.  The resulting SELinux-ified SMACK  
labels would go from:


SomeLabel (with CAP_MAC_OVERRIDE)
AnotherLabel
YetAnotherLabel

to:

smack:priv:SomeLabel
smack:nopriv:AnotherLabel
smack:nopriv:YetAnotherLabel


Casey, hopefully this gives you some ideas about how I think you  
could modify the SELinux code to compile out the user field and  
simplify the role field as needed.  I'm still not seeing anything  
which SELinux cannot directly implement without additional code, even  
the CAP_MAC_OVERRIDE bit.  If the semantics don't seem quite right,  
please provide details about how you think the models differ and I  
will try to address the concerns.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-11 Thread Kyle Moffett

On Oct 11, 2007, at 11:41:34, Casey Schaufler wrote:

--- Kyle Moffett [EMAIL PROTECTED] wrote:

[snipped]


I'm still waiting to see the proposed SELinux policy that does what  
Smack does.


That *is* the SELinux policy which does what Smack does.  I keep  
having bugs in the perl-script I'm writing on account of not having  
the time to really get around to fixing it, but that is exactly the  
procedure for generating an SELinux policy from a SMACK policy.


I can accept that you don't see anything that can't be implemented  
thus, but that's not the point. You've provided some really clear  
design notes, and that's great, but it ain't the code. You said  
that you could write a 500 line perl script that would do the whole  
thing, and that left some people with an impression that Smack is a  
subset of SELinux.  Well, I'm already finding myself digging out  
from under that missunderstanding, and with people who are assuming  
that your policy has been done, proving the point.


I'd love to have time to finish the script but unfortunately real  
life keeps interfering and I'm going to have to go back to lurking on  
this thread.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Reserve N process to root

2007-10-11 Thread Kyle Moffett

Please don't trim CC lists

On Oct 11, 2007, at 17:02:37, Al Boldi wrote:

David Newall wrote:

[EMAIL PROTECTED] wrote:
What David meant was that root will always have a slot doesn't  
*actually* help unless you *also* have a way to actually *spawn*  
such a process.  In order to do the ps, kill, and so on that you  
need to recover, you need to already have either a root shell  
available, or a way to *get* a root shell that doesn't rely on a  
non-root process (so /bin/su doesn't help here).


That's right, although it's worse than that.  You need to have a  
process with CAP_SYS_ADMIN.  If root processes normally have that  
capability then the reserved slots may well disappear before you  
notice a problem.  If root processes normally don't have it, then  
you need to guarantee that one is already running.


I once posted a patch to handle this DoS, but, as usual, it wasn't  
accepted.  Go figure...


This isn't really necessary any more with the new CFS scheduler.  If  
you want to prevent excess memory usage then you limit memory usage,  
not process count, so just set the system max process count to  
something absurdly high and leave the user counts down at the maximum  
a user might run.  Then as long as the sum of the user processes is  
less than the max number of processes (which you just set absurdly  
high or unlimited), you may still log in.  With the per-user  
scheduling enabled CFS allows you to run an optimistically-real-time  
game as one user and several thousand busy-loops as another user and  
get almost picture perfect 50% CPU distribution between the users.   
To me that seems a much better DoS-prevention system than limits  
which don't scale based on how many people are requesting resources.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Reserve N process to root

2007-10-11 Thread Kyle Moffett

On Oct 12, 2007, at 01:37:23, Al Boldi wrote:

Kyle Moffett wrote:
This isn't really necessary any more with the new CFS scheduler.   
If you want to prevent excess memory usage then you limit memory  
usage, not process count, so just set the system max process count  
to something absurdly high and leave the user counts down at the  
maximum a user might run.  Then as long as the sum of the user  
processes is less than the max number of processes (which you just  
set absurdly high or unlimited), you may still log in.  With the  
per-user scheduling enabled CFS allows you to run an  
optimistically-real-time game as one user and several thousand  
busy-loops as another user and get almost picture perfect 50% CPU  
distribution between the users. To me that seems a much better DoS- 
prevention system than limits which don't scale based on how many  
people are requesting resources.


You have a point, and resource-controllers can probably control DoS  
a lot better, but the they also incur more overhead.  Think of this  
lockout prevention patch as a near zero overhead safety valve.


But why do you need to add lockout prevention if it already  
exists?  With CFS' extremely efficient per-user-scheduling (hopefully  
soon to be the default) there are only two forms of lockout by non- 
root processes:  (1) Running out of PIDs in the box's PID-space  
(think tens or hundreds of thousands of processes), or (2) Swap- 
storming the box to death.  To put it bluntly trying to reserve free  
PID slots is attacking the wrong end of the problem and your so  
called lockout prevention could very easily ensure that 10 PIDs are  
available even if the user has swapstormed the box with the PIDs he  
does have.


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Replace __attribute_pure__ with __pure

2007-10-06 Thread Kyle Moffett

Trimmed the CC list a bit

On Oct 05, 2007, at 20:51:21, H. Peter Anvin wrote:

Ralf Baechle wrote:
To be consistent with the use of attributes in the rest of the  
kernel replace all use of __attribute_pure__ with __pure and  
delete the definition of __attribute_pure__.


Concern: __attribute_pure__ is very similar to __attribute_const__,  
which is almost completely, but not totally unlike the keyword  
"const"...


Yes, there's also the fact that __pure is a reserved GCC keyword.   
Essentially according to GCC docs all of the GCC-specific keywords  
are equivalently defined as "keyword", "__keyword", and  
"__keyword__", with only the latter two defined in strict-ANSI mode.   
The following is valid according to GCC docs:


static int __attribute__((__pure)) my_strlen(const char *str);

With the proposed definition of __pure, that becomes a noticeably  
invalid __attribute__((__attribute__((__pure__



Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Replace __attribute_pure__ with __pure

2007-10-06 Thread Kyle Moffett

Trimmed the CC list a bit

On Oct 05, 2007, at 20:51:21, H. Peter Anvin wrote:

Ralf Baechle wrote:
To be consistent with the use of attributes in the rest of the  
kernel replace all use of __attribute_pure__ with __pure and  
delete the definition of __attribute_pure__.


Concern: __attribute_pure__ is very similar to __attribute_const__,  
which is almost completely, but not totally unlike the keyword  
const...


Yes, there's also the fact that __pure is a reserved GCC keyword.   
Essentially according to GCC docs all of the GCC-specific keywords  
are equivalently defined as keyword, __keyword, and  
__keyword__, with only the latter two defined in strict-ANSI mode.   
The following is valid according to GCC docs:


static int __attribute__((__pure)) my_strlen(const char *str);

With the proposed definition of __pure, that becomes a noticeably  
invalid __attribute__((__attribute__((__pure__



Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Kyle Moffett

On Oct 05, 2007, at 00:45:17, Eric W. Biederman wrote:

Kyle Moffett <[EMAIL PROTECTED]> writes:


On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote:
SElinux is not all encompassing or it is generally  
incomprehensible I don't know which.  Or someone long ago would  
have said a better  way to implement containers was with a  
selinux ruleset, here is a  selinux ruleset that does that.   
Although it is completely possible  to implement all of the  
isolation with the existing LSM hooks as  Serge showed.


The difference between SELinux and containers is that SELinux (and  
LSM as a whole) returns -EPERM to operations outside the scope of  
the  subject, whereas containers return -ENOENT (because it's not  
even in  the same namespace).


Yes.  However if you look at what the first implementations were.   
Especially something like linux-vserver.  All they provided was  
isolation.  So perhaps you would not see every process ps but they  
all had unique pid values.


I'm pretty certain Serge at least prototyped a simplified version  
of that using the LSM hooks.  Is there something I'm not remember  
in those hooks that allows hiding of information like processes?


Yes. Currently with containers we are taking that one step farther  
as that solves a wider set of problems.


IMHO, containers have a subtly different purpose from LSM even though  
both are about information hiding.  Basically a container is  
information hiding primarily for administrative reasons; either as a  
convenience to help prevent errors or as a way of describing  
administrative boundaries.  For example, even in an environment where  
all sysadmins are trusted employees, a few head-honcho sysadmins  
would get root container access, and all others would get access to  
specific containers as a way of preventing "oops" errors.  Basically  
a container is about "full access inside this box and no access  
outside".


By contrast, LSM is more strictly about providing *limited* access to  
resources.  For an accounting business all client records would  
grouped and associated together, however those which have passed this  
year's review are read-only except by specific staff and others may  
have information restricted to some subset of the employees.


So containers are exclusive subsets of "the system" while LSM should  
be about non-exclusive information restriction.



We also have in the kernel another parallel security mechanism  
(for what is generally a different class of operations) that has  
been  quite successful, and different groups get along quite  
well, and  ordinary mortals can understand it.   The linux  
firewalling code.


Well, I wouldn't go so far as the "ordinary mortals can understand  
it" part; it's still pretty high on the obtuse-o-meter.


True.  Probably a more accurate statement is:`unix command line  
power users can and do handle it after reading the docs.  That's  
not quite ordinary mortals but it feels like it some days.  It  
might all be perception...


I have seen more *wrong* iptables firewalls than I've seen correct  
ones.  Securing TCP/IP traffic properly requires either a lot of  
training/experience or a good out-of-the-box system like Shorewall  
which structures the necessary restrictions for you based on an  
abstract description of the desired functionality.  For instance what  
percentage of admins do you think could correctly set up their  
netfilter firewalls to log christmas-tree packets, smurfs, etc  
without the help of some external tool?  Hell, I don't trust myself  
to reliably do it without a lot of reading of docs and testing, and  
I've been doing netfilter firewalls for a while.


The bottom line is that with iptables it is *CRITICAL* to have a good  
set of interface tools to take the users' "My system is set up  
like..." description in some form and turn it into the necessary set  
of efficient security rules.  The *exact* same issue applies to  
SELinux, with 2 major additional problems:


1)  Half the tools are still somewhat beta-ish and under heavy  
development.  Furthermore the semi-official reference policy is  
nowhere near comprehensive and pretty ugly to read (go back to the  
point about the tools being beta-ish).


2)  If you break your system description or translation tools then  
instead of just your network dying your entire *system* dies.



The linux firewalling codes has hooks all throughout the  
networking stack, just like the LSM has hooks all throughout the  
rest of linux  kernel.  There is a difference however.  The linux  
firewalling code in addition to hooks has tables behind those  
hooks that it  consults. There is generic code to walk those  
tables and consult with different kernel modules to decide if we  
should drop a packet.  Each of those kernel modules provides a  
different capability that can be used to generate a firewall.


This is almost *EXACTLY* what 

Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Kyle Moffett
ir amount of what we need is already done in SELinux, and  
efforts would be better spent in figuring out what seems too  
complicated in SELinux and making it simpler.  Probably a fair amount  
of that just means better tools.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Kyle Moffett
 in SELinux, and  
efforts would be better spent in figuring out what seems too  
complicated in SELinux and making it simpler.  Probably a fair amount  
of that just means better tools.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel

2007-10-04 Thread Kyle Moffett

On Oct 05, 2007, at 00:45:17, Eric W. Biederman wrote:

Kyle Moffett [EMAIL PROTECTED] writes:


On Oct 04, 2007, at 21:44:02, Eric W. Biederman wrote:
SElinux is not all encompassing or it is generally  
incomprehensible I don't know which.  Or someone long ago would  
have said a better  way to implement containers was with a  
selinux ruleset, here is a  selinux ruleset that does that.   
Although it is completely possible  to implement all of the  
isolation with the existing LSM hooks as  Serge showed.


The difference between SELinux and containers is that SELinux (and  
LSM as a whole) returns -EPERM to operations outside the scope of  
the  subject, whereas containers return -ENOENT (because it's not  
even in  the same namespace).


Yes.  However if you look at what the first implementations were.   
Especially something like linux-vserver.  All they provided was  
isolation.  So perhaps you would not see every process ps but they  
all had unique pid values.


I'm pretty certain Serge at least prototyped a simplified version  
of that using the LSM hooks.  Is there something I'm not remember  
in those hooks that allows hiding of information like processes?


Yes. Currently with containers we are taking that one step farther  
as that solves a wider set of problems.


IMHO, containers have a subtly different purpose from LSM even though  
both are about information hiding.  Basically a container is  
information hiding primarily for administrative reasons; either as a  
convenience to help prevent errors or as a way of describing  
administrative boundaries.  For example, even in an environment where  
all sysadmins are trusted employees, a few head-honcho sysadmins  
would get root container access, and all others would get access to  
specific containers as a way of preventing oops errors.  Basically  
a container is about full access inside this box and no access  
outside.


By contrast, LSM is more strictly about providing *limited* access to  
resources.  For an accounting business all client records would  
grouped and associated together, however those which have passed this  
year's review are read-only except by specific staff and others may  
have information restricted to some subset of the employees.


So containers are exclusive subsets of the system while LSM should  
be about non-exclusive information restriction.



We also have in the kernel another parallel security mechanism  
(for what is generally a different class of operations) that has  
been  quite successful, and different groups get along quite  
well, and  ordinary mortals can understand it.   The linux  
firewalling code.


Well, I wouldn't go so far as the ordinary mortals can understand  
it part; it's still pretty high on the obtuse-o-meter.


True.  Probably a more accurate statement is:`unix command line  
power users can and do handle it after reading the docs.  That's  
not quite ordinary mortals but it feels like it some days.  It  
might all be perception...


I have seen more *wrong* iptables firewalls than I've seen correct  
ones.  Securing TCP/IP traffic properly requires either a lot of  
training/experience or a good out-of-the-box system like Shorewall  
which structures the necessary restrictions for you based on an  
abstract description of the desired functionality.  For instance what  
percentage of admins do you think could correctly set up their  
netfilter firewalls to log christmas-tree packets, smurfs, etc  
without the help of some external tool?  Hell, I don't trust myself  
to reliably do it without a lot of reading of docs and testing, and  
I've been doing netfilter firewalls for a while.


The bottom line is that with iptables it is *CRITICAL* to have a good  
set of interface tools to take the users' My system is set up  
like... description in some form and turn it into the necessary set  
of efficient security rules.  The *exact* same issue applies to  
SELinux, with 2 major additional problems:


1)  Half the tools are still somewhat beta-ish and under heavy  
development.  Furthermore the semi-official reference policy is  
nowhere near comprehensive and pretty ugly to read (go back to the  
point about the tools being beta-ish).


2)  If you break your system description or translation tools then  
instead of just your network dying your entire *system* dies.



The linux firewalling codes has hooks all throughout the  
networking stack, just like the LSM has hooks all throughout the  
rest of linux  kernel.  There is a difference however.  The linux  
firewalling code in addition to hooks has tables behind those  
hooks that it  consults. There is generic code to walk those  
tables and consult with different kernel modules to decide if we  
should drop a packet.  Each of those kernel modules provides a  
different capability that can be used to generate a firewall.


This is almost *EXACTLY* what SELinux provides as an LSM module.   
The one difference is that with SELinux

Re: [RFC] New kernel-message logging API (take 2)

2007-09-28 Thread Kyle Moffett

On Sep 28, 2007, at 03:31:11, Geert Uytterhoeven wrote:
Can't you store the loglevel in the kprint_block and check it in  
all successive kprint_*() macros? If gcc knows it's constant, it  
can optimize the non-wanted code away. As other fields in struct  
kprint_block cannot be constant (they store internal state), you  
have to split it like:


struct kprint_block {
int loglevel;
struct real_kprint_block real;  /* internal state */
}

and pass () instead of  to all successive internal  
functions.  I haven't tried this, so let's hope gcc is actually  
smart enough...


Well actually, I believe you could just do:

struct kprint_block {
const int loglevel;
[...];
};

Then cast away the constness to actually set it initially:
*((int *)) = LOGLEVEL;

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] New kernel-message logging API (take 2)

2007-09-28 Thread Kyle Moffett

On Sep 28, 2007, at 03:31:11, Geert Uytterhoeven wrote:
Can't you store the loglevel in the kprint_block and check it in  
all successive kprint_*() macros? If gcc knows it's constant, it  
can optimize the non-wanted code away. As other fields in struct  
kprint_block cannot be constant (they store internal state), you  
have to split it like:


struct kprint_block {
int loglevel;
struct real_kprint_block real;  /* internal state */
}

and pass block.real() instead of block to all successive internal  
functions.  I haven't tried this, so let's hope gcc is actually  
smart enough...


Well actually, I believe you could just do:

struct kprint_block {
const int loglevel;
[...];
};

Then cast away the constness to actually set it initially:
*((int *)block.loglevel) = LOGLEVEL;

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET 4/4] sysfs: implement new features

2007-09-27 Thread Kyle Moffett

On Sep 25, 2007, at 18:50:05, Greg KH wrote:

On Thu, Sep 20, 2007 at 05:31:37PM +0900, Tejun Heo wrote:
* Name-formatting for symlinks.  e.g. symlink pointing to /dira/ 
dirb/leaf can be named as "symlink:%1-%0" and it will show up as  
"symlink:dirb-leaf".  This only applies when new interface is used.


Is this really necessary?  It looks like we are adding a "special"  
type of parser here that no one uses.


IMHO this would be nicer if it could reuse existing sprintf code to  
handle all the nice shiny sprintf format specifiers.  The only  
challenge would be how to dynamically build a varargs list from an  
array of component names although perhaps there could be an internal  
__csprintf function which took a callback for retrieving arguments.   
Also since all of the path components are strings I don't know that  
numeric specifiers could be made useful, so perhaps it's not the  
greatest idea.


I think the primary importance for this functionality is:

* Autorenaming of symlinks according to the name format string  
when target or one of its ancestors is renamed or moved.  This  
only applies when new interface is used.


Nice.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Kyle Moffett

On Sep 27, 2007, at 17:34:45, Greg KH wrote:

On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote:
That fact that sysfs is all laid out in a directory, but for which  
some directories/symlinks are OK to use, and some are NOT OK to  
use --- is why I call the sysfs interface "an open pit".


And because of the original design mistakes, we have only been able  
to change things for the better in a slow manner.  We have had  
userspace programs fixed up for _years_ before we are able to make  
the corresponding changes in the kernel, so as to not break the  
distros that are slow to upgrade packages and kernels (like Debian.)


Hey!  No poking fingers at Debian here; it's been *MUCH* improved  
lately.  I far more frequently have problems with boxes still running  
some ancient release of RHEL-4 or something than I do with those  
running Debian stable (virtually always the latest Debian stable).


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Kyle Moffett

On Sep 27, 2007, at 17:34:45, Greg KH wrote:

On Thu, Sep 27, 2007 at 02:37:42PM -0400, Theodore Tso wrote:
That fact that sysfs is all laid out in a directory, but for which  
some directories/symlinks are OK to use, and some are NOT OK to  
use --- is why I call the sysfs interface an open pit.


And because of the original design mistakes, we have only been able  
to change things for the better in a slow manner.  We have had  
userspace programs fixed up for _years_ before we are able to make  
the corresponding changes in the kernel, so as to not break the  
distros that are slow to upgrade packages and kernels (like Debian.)


Hey!  No poking fingers at Debian here; it's been *MUCH* improved  
lately.  I far more frequently have problems with boxes still running  
some ancient release of RHEL-4 or something than I do with those  
running Debian stable (virtually always the latest Debian stable).


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHSET 4/4] sysfs: implement new features

2007-09-27 Thread Kyle Moffett

On Sep 25, 2007, at 18:50:05, Greg KH wrote:

On Thu, Sep 20, 2007 at 05:31:37PM +0900, Tejun Heo wrote:
* Name-formatting for symlinks.  e.g. symlink pointing to /dira/ 
dirb/leaf can be named as symlink:%1-%0 and it will show up as  
symlink:dirb-leaf.  This only applies when new interface is used.


Is this really necessary?  It looks like we are adding a special  
type of parser here that no one uses.


IMHO this would be nicer if it could reuse existing sprintf code to  
handle all the nice shiny sprintf format specifiers.  The only  
challenge would be how to dynamically build a varargs list from an  
array of component names although perhaps there could be an internal  
__csprintf function which took a callback for retrieving arguments.   
Also since all of the path components are strings I don't know that  
numeric specifiers could be made useful, so perhaps it's not the  
greatest idea.


I think the primary importance for this functionality is:

* Autorenaming of symlinks according to the name format string  
when target or one of its ancestors is renamed or moved.  This  
only applies when new interface is used.


Nice.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/25] Unionfs: add un/likely conditionals on copyup ops

2007-09-26 Thread Kyle Moffett

On Sep 26, 2007, at 09:40:20, Erez Zadok wrote:

In message <[EMAIL PROTECTED]>, "Kok, Auke" writes:
I've been told several times that adding these is almost always  
bogus - either it messes up the CPU branch prediction or the  
compiler/CPU just does a lot better at finding the right way  
without these hints.


Adding them as a blanket seems rather strange. Have you got any  
numbers that this really improves performance?


Auke, that's a good question, but I found it hard to find any info  
about it.  There's no discussion on it in Documentation/, and very  
little I could find elsewhere.  I did see one url explaining what  
un/likely does precisely, but no guidelines.  My understanding is  
that it can improve performance, as long as it's used carefully  
(otherwise it may hurt performance).


Hmm, even still I agree with Auke, you probably use it too much.


Recently we've done a full audit of the entire code, and added un/ 
likely where we felt that the chance of succeeding is 95% or better  
(e.g., error conditions that should rarely happen, and such).


Actually due to the performance penalty on some systems I think you  
only want to use it if the chance of succeeding is 99% or better, as  
the benefit if predicted is a cycle or two and the harm if  
mispredicted can be more than 50 cycles, depending on the CPU.  You  
should also remember than in filesystems many "failures" are  
triggered by things like the ld.so library searches, where it  
literally calls access() 20 different times on various possible paths  
for library files, failing the first 19.  It does this once for each  
necessary library.


Typically you only want to add unlikely() or likely() for about 2  
reasons:
  (A)  It's a hot path and the unlikely case is just going to burn a  
bunch of CPU anyways
  (B)  It really is extremely unlikely that it fails (Think physical  
hardware failure)


Anything else is just bogus.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Chroot bug

2007-09-26 Thread Kyle Moffett

On Sep 26, 2007, at 09:11:33, Miloslav Semler wrote:
+ long directory_is_out(struct vfsmount *wdmnt, struct dentry  
*wdentry,

+   struct vfsmount *rootmnt, struct dentry *root)
+ {
+   struct nameidata oldentry, newentry;
+   long ret = 1;
+   
+ read_lock(>fs->lock);
+   oldentry.dentry = dget(wdentry);
+   oldentry.mnt = mntget(wdmnt);
+ read_unlock(>fs->lock);
+   newentry.dentry = oldentry.dentry;
+   newentry.mnt = oldentry.mnt;
+   
+   follow_dotdot();
+   /* check it */
+   if(newentry.dentry == root &&
+   newentry.mnt == rootmnt){
+   ret = 0;
+   goto out;
+   }
+   
+   while(oldentry.mnt != newentry.mnt ||
+   oldentry.dentry != newentry.dentry){
+   
+   memcpy(, , sizeof(struct nameidata));
+   follow_dotdot();
+   
+   /* check it */
+   if(newentry.dentry == root &&
+   newentry.mnt == rootmnt){
+   ret = 0;
+   goto out;
+   }
+   }
+ out:
+   dput(newentry.dentry);
+   mntput(newentry.mnt);
+   return ret;
+ }


This is basically both painfully racy and easily broken with umount  
and/or access to proc.  See this busybox-compatible example:


## Set up chroot
mkdir /root1
mount -o mode=0750 -t tmpfs tmpfs /root1
cp -a /bin/busybox /root1/busybox

## Enter chroot
chroot /root1 /busybox

## Mount proc
/busybox mkdir /proc
/busybox mount -t proc proc /proc

## Poke around root filesystem (this may be all you need)
/busybox ls /proc/1/root/

## Detach our chroot so we're no longer a sub-directory
/busybox umount -l /proc/1/root/root1

## Now we can easily chroot to the original root, since it isn't in  
our ".." path

exec /busybox chroot /proc/1/root /bin/sh


See how easy that is?  Unless you stick the above parent-directory  
check (which is still racy against directories being moved around)  
for *EVERY* directory component of *EVERY* open/chdir-ish syscall,  
you are still going to be easily worked around through many different  
methods.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Chroot bug

2007-09-26 Thread Kyle Moffett

On Sep 26, 2007, at 06:27:38, David Newall wrote:

Kyle Moffett wrote:
David, please do tell myself and Adrian how "locking down" chroot 
() the way you want will avoid letting root break out through any  
of the above ways?


As has been said, there are thousands of ways to break out of a  
chroot.  It's just that one of them should not be that chroot lets  
you walk out.  I can't explain it clearer than that.  If you don't  
see it now you probably never will.


Let me put it this way:  You *CANNOT* enforce chroot() the way you  
want to without a completely unacceptable performance penalty.  Let's  
start with the simplest example of:


fd = open("/", O_DIRECTORY);
chroot("/foo");
fchdir(fd);
chroot(".");

If you had ever actually looked at the Linux VFS, it is completely  
*impossible* to tell whether "fd" at the time of the chroot is inside  
or outside of "/foo" without tracking an enormous amount of extra  
state.  Even then, any such determination may not be valid since an  
FD may be opened to an inode which is hardlinked at multiple  
locations in the directory tree.  It could also be bind-mounted at  
multiple locations, or it may not even be mounted at all in this  
namespace (CDROM that was lazy-unmounted).  That FD may be later  
passed over an open UNIX-domain socket from another process.   
Moreover, arbitrarily closing FDs would break a huge number of  
programs.  Furthermore, since you can't fix the "trivial" case of  
'fchdir()', then there's no point in even *attempting* to fix the  
"cwd is outside of chroot" problem, although that is basically  
equivalent in difficulty to fixing the "dir-fd is outside of chroot"  
problem.


As for the nested-chroot() bit, the root user inside of a chroot is  
always allowed to chroot().  This is necessary for test-suites for  
various distro installers, chroot once to enter the installer  
playpen, installer chroots again to configure the test-installed- 
system.  Once you allow a second chroot, you're back at the "can't  
reliably and efficiently track directory sub-tree members" problem.


So if you think it can and should be fixed, then PROVIDE THE CODE.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Chroot bug

2007-09-26 Thread Kyle Moffett

On Sep 26, 2007, at 06:27:38, David Newall wrote:

Kyle Moffett wrote:
David, please do tell myself and Adrian how locking down chroot 
() the way you want will avoid letting root break out through any  
of the above ways?


As has been said, there are thousands of ways to break out of a  
chroot.  It's just that one of them should not be that chroot lets  
you walk out.  I can't explain it clearer than that.  If you don't  
see it now you probably never will.


Let me put it this way:  You *CANNOT* enforce chroot() the way you  
want to without a completely unacceptable performance penalty.  Let's  
start with the simplest example of:


fd = open(/, O_DIRECTORY);
chroot(/foo);
fchdir(fd);
chroot(.);

If you had ever actually looked at the Linux VFS, it is completely  
*impossible* to tell whether fd at the time of the chroot is inside  
or outside of /foo without tracking an enormous amount of extra  
state.  Even then, any such determination may not be valid since an  
FD may be opened to an inode which is hardlinked at multiple  
locations in the directory tree.  It could also be bind-mounted at  
multiple locations, or it may not even be mounted at all in this  
namespace (CDROM that was lazy-unmounted).  That FD may be later  
passed over an open UNIX-domain socket from another process.   
Moreover, arbitrarily closing FDs would break a huge number of  
programs.  Furthermore, since you can't fix the trivial case of  
'fchdir()', then there's no point in even *attempting* to fix the  
cwd is outside of chroot problem, although that is basically  
equivalent in difficulty to fixing the dir-fd is outside of chroot  
problem.


As for the nested-chroot() bit, the root user inside of a chroot is  
always allowed to chroot().  This is necessary for test-suites for  
various distro installers, chroot once to enter the installer  
playpen, installer chroots again to configure the test-installed- 
system.  Once you allow a second chroot, you're back at the can't  
reliably and efficiently track directory sub-tree members problem.


So if you think it can and should be fixed, then PROVIDE THE CODE.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Chroot bug

2007-09-26 Thread Kyle Moffett

On Sep 26, 2007, at 09:11:33, Miloslav Semler wrote:
+ long directory_is_out(struct vfsmount *wdmnt, struct dentry  
*wdentry,

+   struct vfsmount *rootmnt, struct dentry *root)
+ {
+   struct nameidata oldentry, newentry;
+   long ret = 1;
+   
+ read_lock(current-fs-lock);
+   oldentry.dentry = dget(wdentry);
+   oldentry.mnt = mntget(wdmnt);
+ read_unlock(current-fs-lock);
+   newentry.dentry = oldentry.dentry;
+   newentry.mnt = oldentry.mnt;
+   
+   follow_dotdot(newentry);
+   /* check it */
+   if(newentry.dentry == root 
+   newentry.mnt == rootmnt){
+   ret = 0;
+   goto out;
+   }
+   
+   while(oldentry.mnt != newentry.mnt ||
+   oldentry.dentry != newentry.dentry){
+   
+   memcpy(oldentry, newentry, sizeof(struct nameidata));
+   follow_dotdot(newentry);
+   
+   /* check it */
+   if(newentry.dentry == root 
+   newentry.mnt == rootmnt){
+   ret = 0;
+   goto out;
+   }
+   }
+ out:
+   dput(newentry.dentry);
+   mntput(newentry.mnt);
+   return ret;
+ }


This is basically both painfully racy and easily broken with umount  
and/or access to proc.  See this busybox-compatible example:


## Set up chroot
mkdir /root1
mount -o mode=0750 -t tmpfs tmpfs /root1
cp -a /bin/busybox /root1/busybox

## Enter chroot
chroot /root1 /busybox

## Mount proc
/busybox mkdir /proc
/busybox mount -t proc proc /proc

## Poke around root filesystem (this may be all you need)
/busybox ls /proc/1/root/

## Detach our chroot so we're no longer a sub-directory
/busybox umount -l /proc/1/root/root1

## Now we can easily chroot to the original root, since it isn't in  
our .. path

exec /busybox chroot /proc/1/root /bin/sh


See how easy that is?  Unless you stick the above parent-directory  
check (which is still racy against directories being moved around)  
for *EVERY* directory component of *EVERY* open/chdir-ish syscall,  
you are still going to be easily worked around through many different  
methods.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/25] Unionfs: add un/likely conditionals on copyup ops

2007-09-26 Thread Kyle Moffett

On Sep 26, 2007, at 09:40:20, Erez Zadok wrote:

In message [EMAIL PROTECTED], Kok, Auke writes:
I've been told several times that adding these is almost always  
bogus - either it messes up the CPU branch prediction or the  
compiler/CPU just does a lot better at finding the right way  
without these hints.


Adding them as a blanket seems rather strange. Have you got any  
numbers that this really improves performance?


Auke, that's a good question, but I found it hard to find any info  
about it.  There's no discussion on it in Documentation/, and very  
little I could find elsewhere.  I did see one url explaining what  
un/likely does precisely, but no guidelines.  My understanding is  
that it can improve performance, as long as it's used carefully  
(otherwise it may hurt performance).


Hmm, even still I agree with Auke, you probably use it too much.


Recently we've done a full audit of the entire code, and added un/ 
likely where we felt that the chance of succeeding is 95% or better  
(e.g., error conditions that should rarely happen, and such).


Actually due to the performance penalty on some systems I think you  
only want to use it if the chance of succeeding is 99% or better, as  
the benefit if predicted is a cycle or two and the harm if  
mispredicted can be more than 50 cycles, depending on the CPU.  You  
should also remember than in filesystems many failures are  
triggered by things like the ld.so library searches, where it  
literally calls access() 20 different times on various possible paths  
for library files, failing the first 19.  It does this once for each  
necessary library.


Typically you only want to add unlikely() or likely() for about 2  
reasons:
  (A)  It's a hot path and the unlikely case is just going to burn a  
bunch of CPU anyways
  (B)  It really is extremely unlikely that it fails (Think physical  
hardware failure)


Anything else is just bogus.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Chroot bug

2007-09-25 Thread Kyle Moffett

On Sep 25, 2007, at 20:55:51, Adrian Bunk wrote:

On Wed, Sep 26, 2007 at 09:20:54AM +0930, David Newall wrote:
Good call.  Though I suppose, since it's used 24x7 to aid security  
on countless production servers, that security dwarfs testing.   
Still, debugging, yes that's valid.


Incompetent people implementing security solutions are a real problem.

I don't suppose it makes and difference; whatever the purpose, a  
chroot that doesn't change the root is buggy.


It does change the root.

But it does not limit what the root user can do after the root was  
changed.


This is required for most distro installers to work:

*Procedure to install files*
chroot /target
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t tmpfs tmpfs /dev
udevd --daemon
udevtrigger
udevsettle
mount /dev/cdrom0 /media/cdrom0
*Load more kernel modules*
*Procedure to configure newly-installed system*
*Do other highly-privileged operations*
*Configure networking and submit installation report*
*Reboot*

David, please do tell myself and Adrian how "locking down" chroot()  
the way you want will avoid letting root break out through any of the  
above ways?


Hell, after you chroot one could probably just run:
  mount --bind /minimal_root /minimal_root
  cd /minimal_root
  mkdir old
  pivot_root . old
  cd /old
  mkdir old_minimal_root
  pivot_root . old_minimal_root
  umount /old_minimal_root
  rmdir /old_minimal_root
Now, like magic, the entire system is once more accessible.

Alternatively you could:
  mount -t proc proc /proc
  cat /proc/1/mounts
  mount -t $ROOTFS_FROM_PROC $ROOTDEV_FROM_PROC /

Either way root can trivially break out of any chroot using  
FUNDAMENTAL PRIMITIVES that he/she always has access to.  If you want  
to take those away you have to use SELinux or capabilities, in which  
case you could just take away the CAP_SYS_CHROOT capability in the  
first place!


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix coding style

2007-09-25 Thread Kyle Moffett

On Sep 25, 2007, at 15:16:20, Ingo Oeser wrote:

On Tuesday 25 September 2007, Srivatsa Vaddagiri wrote:

@@ -297,7 +293,7 @@ static int __init init_sched_debug_procf
pe->proc_fops = _debug_fops;

 #ifdef CONFIG_FAIR_USER_SCHED
-   pe = create_proc_entry("root_user_share", 0644, NULL);
+   pe = create_proc_entry("root_user_cpu_share", 0644, NULL);
if (!pe)
return -ENOMEM;


What about moving this debug stuff under debugfs?  Please consider  
using the functions in .  They compile into  
nothing, if DEBUGFS is not compiled in and have already useful  
functions for reading/writing integers and booleans.


Umm, that's not a debugging thing.  It appears to be a tunable  
allowing you to configure what percentage of the total CPU that UID 0  
gets which is likely to be useful to configure on production systems;  
at least until better group-scheduling tools are produced.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] Fix coding style

2007-09-25 Thread Kyle Moffett

On Sep 25, 2007, at 15:16:20, Ingo Oeser wrote:

On Tuesday 25 September 2007, Srivatsa Vaddagiri wrote:

@@ -297,7 +293,7 @@ static int __init init_sched_debug_procf
pe-proc_fops = sched_debug_fops;

 #ifdef CONFIG_FAIR_USER_SCHED
-   pe = create_proc_entry(root_user_share, 0644, NULL);
+   pe = create_proc_entry(root_user_cpu_share, 0644, NULL);
if (!pe)
return -ENOMEM;


What about moving this debug stuff under debugfs?  Please consider  
using the functions in linux/debugfs.h.  They compile into  
nothing, if DEBUGFS is not compiled in and have already useful  
functions for reading/writing integers and booleans.


Umm, that's not a debugging thing.  It appears to be a tunable  
allowing you to configure what percentage of the total CPU that UID 0  
gets which is likely to be useful to configure on production systems;  
at least until better group-scheduling tools are produced.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Chroot bug

2007-09-25 Thread Kyle Moffett

On Sep 25, 2007, at 20:55:51, Adrian Bunk wrote:

On Wed, Sep 26, 2007 at 09:20:54AM +0930, David Newall wrote:
Good call.  Though I suppose, since it's used 24x7 to aid security  
on countless production servers, that security dwarfs testing.   
Still, debugging, yes that's valid.


Incompetent people implementing security solutions are a real problem.

I don't suppose it makes and difference; whatever the purpose, a  
chroot that doesn't change the root is buggy.


It does change the root.

But it does not limit what the root user can do after the root was  
changed.


This is required for most distro installers to work:

*Procedure to install files*
chroot /target
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t tmpfs tmpfs /dev
udevd --daemon
udevtrigger
udevsettle
mount /dev/cdrom0 /media/cdrom0
*Load more kernel modules*
*Procedure to configure newly-installed system*
*Do other highly-privileged operations*
*Configure networking and submit installation report*
*Reboot*

David, please do tell myself and Adrian how locking down chroot()  
the way you want will avoid letting root break out through any of the  
above ways?


Hell, after you chroot one could probably just run:
  mount --bind /minimal_root /minimal_root
  cd /minimal_root
  mkdir old
  pivot_root . old
  cd /old
  mkdir old_minimal_root
  pivot_root . old_minimal_root
  umount /old_minimal_root
  rmdir /old_minimal_root
Now, like magic, the entire system is once more accessible.

Alternatively you could:
  mount -t proc proc /proc
  cat /proc/1/mounts
  mount -t $ROOTFS_FROM_PROC $ROOTDEV_FROM_PROC /

Either way root can trivially break out of any chroot using  
FUNDAMENTAL PRIMITIVES that he/she always has access to.  If you want  
to take those away you have to use SELinux or capabilities, in which  
case you could just take away the CAP_SYS_CHROOT capability in the  
first place!


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] bnx2: factor out gzip unpacker

2007-09-24 Thread Kyle Moffett

On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote:

On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote:

But I compile net/* into bzImage. I like netbooting :)


Isn't it possible to netboot with an initramfs image?  I am pretty  
sure I have seen some systems do exactly that.


Yeah, I've got Debian boxes that have never *not* netbooted (one Dell  
Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even  
load GRUB/LILO, although Windows somehow works fine).  So they boot  
PXELinux using the PXE boot ROM on the NICs and it loads both a  
kernel and an initramfs into memory.  Kernel is stock Debian and  
hardly has enough built-in to spit at you, let alone find network/ 
disks, but it manages to load everything it needs off the  
automagically-generated initramfs.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Uninline kcalloc()

2007-09-24 Thread Kyle Moffett

On Sep 24, 2007, at 01:35:08, [EMAIL PROTECTED] wrote:

On Sun, 23 Sep 2007 00:03:49 +0400, Alexey Dobriyan said:

-static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
-   if (n != 0 && size > ULONG_MAX / n)
-   return NULL;
-   return __kmalloc(n * size, flags | __GFP_ZERO);
-}
+void *kcalloc(size_t n, size_t size, gfp_t flags);


NAK.

This busticates some pretty subtle code in mm/slab.c that uses uses  
__builtin_return_address() for debugging - if you do this, then the  
"calling function" gets listed as "kcalloc()" rather than the much  
more useful "function that called kcalloc()" (which is what you  
care about).


(I remember going around and around multiple times getting those  
stupid inlines set up right, so that feature actually did something  
useful, otherwise kcalloc and kzalloc didn't report where they were  
called from).


Proper fix is to give __kmalloc a "void *caller" parameter and have  
all of the various wrapper functions pass in the value of  
__builtin_return_address() appropriately.  I believe that even works  
properly for inline functions which may or may not be inlined.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Kyle Moffett

On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote:

[EMAIL PROTECTED] (Mel Gorman) writes:

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
But when you already have say 10% of the ram in mixed groups then  
it is a sign the external fragmentation happens and some time  
should be spend on moving movable objects.


I'll play around with it on the side and see what sort of results  
I get.  I won't be pushing anything any time soon in relation to  
this though.  For now, I don't intend to fiddle more with grouping  
pages by mobility for something that may or may not be of benefit  
to a feature that hasn't been widely tested with what exists today.


I watched the videos you posted. A nice and quite clear improvement  
with and without your logic. Cudos.


When you play around with it may I suggest a change to the display  
of the memory information. I think it would be valuable to use a  
Hilbert Curve to arange the pages into pixels. Like this:


# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A


Here's an excellent example of an 0-255 numbered hilbert curve used  
to enumerate the various top-level allocations of IPv4 space:

http://xkcd.com/195/

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-24 Thread Kyle Moffett

On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote:

[EMAIL PROTECTED] (Mel Gorman) writes:

On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
But when you already have say 10% of the ram in mixed groups then  
it is a sign the external fragmentation happens and some time  
should be spend on moving movable objects.


I'll play around with it on the side and see what sort of results  
I get.  I won't be pushing anything any time soon in relation to  
this though.  For now, I don't intend to fiddle more with grouping  
pages by mobility for something that may or may not be of benefit  
to a feature that hasn't been widely tested with what exists today.


I watched the videos you posted. A nice and quite clear improvement  
with and without your logic. Cudos.


When you play around with it may I suggest a change to the display  
of the memory information. I think it would be valuable to use a  
Hilbert Curve to arange the pages into pixels. Like this:


# #  0  3
# #
###  1  2

### ###  0 1 E F
  # #
### ###  3 2 D C
# #
# ### #  4 7 8 B
# # # #
### ###  5 6 9 A


Here's an excellent example of an 0-255 numbered hilbert curve used  
to enumerate the various top-level allocations of IPv4 space:

http://xkcd.com/195/

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Uninline kcalloc()

2007-09-24 Thread Kyle Moffett

On Sep 24, 2007, at 01:35:08, [EMAIL PROTECTED] wrote:

On Sun, 23 Sep 2007 00:03:49 +0400, Alexey Dobriyan said:

-static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
-   if (n != 0  size  ULONG_MAX / n)
-   return NULL;
-   return __kmalloc(n * size, flags | __GFP_ZERO);
-}
+void *kcalloc(size_t n, size_t size, gfp_t flags);


NAK.

This busticates some pretty subtle code in mm/slab.c that uses uses  
__builtin_return_address() for debugging - if you do this, then the  
calling function gets listed as kcalloc() rather than the much  
more useful function that called kcalloc() (which is what you  
care about).


(I remember going around and around multiple times getting those  
stupid inlines set up right, so that feature actually did something  
useful, otherwise kcalloc and kzalloc didn't report where they were  
called from).


Proper fix is to give __kmalloc a void *caller parameter and have  
all of the various wrapper functions pass in the value of  
__builtin_return_address() appropriately.  I believe that even works  
properly for inline functions which may or may not be inlined.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] bnx2: factor out gzip unpacker

2007-09-24 Thread Kyle Moffett

On Sep 24, 2007, at 13:32:23, Lennart Sorensen wrote:

On Fri, Sep 21, 2007 at 11:37:52PM +0100, Denys Vlasenko wrote:

But I compile net/* into bzImage. I like netbooting :)


Isn't it possible to netboot with an initramfs image?  I am pretty  
sure I have seen some systems do exactly that.


Yeah, I've got Debian boxes that have never *not* netbooted (one Dell  
Op^?^?Craptiplex box whose BIOS and ACPI sucks so bad it can't even  
load GRUB/LILO, although Windows somehow works fine).  So they boot  
PXELinux using the PXE boot ROM on the NICs and it loads both a  
kernel and an initramfs into memory.  Kernel is stock Debian and  
hardly has enough built-in to spit at you, let alone find network/ 
disks, but it manages to load everything it needs off the  
automagically-generated initramfs.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   >